Causal models for qualitative and mixed methods inference

Mixed methods

Macartan Humphreys and Alan Jacobs

1 Population parameters

Alan

1.1 Population-level queries

Average causal effects for a population
- e.g., What is the average effect of \(X\) on \(Y\)
Proportion of different effects in a population
- What share of cases in the population have positive effects?
- What share have negative effects?
- For what share of treated units with positive outcomes was the outcome due to the treatment?
Causal pathways
- e.g., How commonly does \(X\) affect \(Y\) through \(M\) (vs. through \(W\)) in the population?

1.2 The population-level ATE

What is the average effect of \(X\) on \(Y\) in the population?
- This is a question about the values in \(\lambda^Y\): the shares of the different causal types in the population.
- In process tracing, these were fixed parameters of a causal model.
- Key move: we now posit that we do not know \(\lambda\) and need to learn about it

1.3 The population-level ATE

With binary variables, the average effect is always the difference between
- The share of cases with a positive effect
- The share of cases with a negative effect

\[ATE = \lambda_{01} - \lambda_{10}\]

1.4 ATE, now with a mediator

Which types define the query depends, again, on the structural model.

To estimate the ATE in this mediation model, we need to:

Combine two ways of generating positive effects
Combine two ways of generating negative effects
Subtract the first from the second

1.5 ATE, now with a mediator

Two ways of generating positive effects: \(\lambda^M_{01} \times \lambda^Y_{01}\) + \(\lambda^M_{10} \times \lambda^Y_{10}\)
Two ways of generating negative effects: \(\lambda^M_{10} \times \lambda^Y_{01}\) + \(\lambda^M_{01} \times \lambda^Y_{10}\)
Subtract the first from the second

1.8 Attribution

Query: How often is \(Y=1\) due to \(X=1\)?

Proportion of cases with positive effect among those with \(Y=1\), when \(X=1\):

\[Attribution = \frac{\lambda^Y_{01}}{\lambda^Y_{01} + \lambda^Y_{11}}\]

1.9 Pathways

We can also pose the pathway query at population level
- What is the share of cases for which \(X\) has a positive effect on \(Y\) through \(M\)?
A question about joint \(\lambda^M\) and \(\lambda^Y\) distributions

2 Updating Models: Intuition

Alan

2.1 How will we answer questions about populations?

We need to learn about those \(\lambda\)’s
- About the proportions of the population with different kinds of causal effects
We will have prior beliefs about those proportions
Then, when we see data on lots of cases, we will update our beliefs about proportions
- From a prior distribution over \(\lambda\) \(\Rightarrow\) a posterior distribution over \(\lambda\)

2.2 How do we “update” our models?

We’ve talked about process tracing a single case to answer a case-level query
- Here the model is fixed
- We use the model + case data to answer questions about the case
We can also use data to “update” our models
- Use data on many cases to learn about causal effects in the population
Allows mixing methods: using data on lots of cases, we can learn about probative value of process-tracing evidence
The core logic: we learn by updating population-level causal beliefs toward beliefs more consistent with the data

2.3 Start with a DAG

2.4 Large-\(N\) estimation of \(ATE\): what happens to beliefs over parameters

Say we only collect data on \(I\), \(M\), and \(D\) for a large number of cases
We update on \(\lambda^I\) , \(\lambda^M\) and \(\lambda^D\) to place more weight on values that are likely to give rise to the pattern of data we see
We will come to put more weight on a joint distribution of \(\lambda^M\) and \(\lambda^D\) in line with the data and less posterior weight on all other combinations

Question: What would you infer if you saw a very high correlation between \(I\) and \(D\) and a low correlation between \(I\) and \(M\)? Or a low correlation between \(M\) and \(D\)?

3 General procedure

Macartan

3.1 Bayes rule again

Key insight:

If we suppose a given set of parameter values, we can figure out the likelihood of the data given those values.
We can do this for all possible parameter values and see which ones are more in line with the data

That, with priors, is enough to update:

\[p(\lambda | D) = \frac{p(D | \lambda)p(\lambda)}{p(D)}\]

3.2 Illustration: Causal inference on a grid

Consider this joint distribution with binary \(X\) and binary \(Y\) from here

	Y = 0	Y = 1
X = 0	\(\lambda_{01}/2 + \lambda_{00}/2\)	\(\lambda_{10}/2 + \lambda_{11}/2\)
X = 1	\(\lambda_{10}/2 + \lambda_{00}/2\)	\(\lambda_{01}/2 + \lambda_{11}/2\)

reminder: \(\lambda_{10}\) is share with negative effects, \(\lambda_{01}\) is share with positive effects…

3.2.1 Causal inference on a grid: strategy

Say we now had (finite) data filling out this table. What posteriors should we form over \(\lambda_{10},\lambda_{01},\lambda_{00},\lambda_{11}\)?

	Y = 0	Y = 1
X = 0	\(n_{00}\)	\(n_{01}\)
X = 1	\(n_{10}\)	\(n_{11}\)

Lets start with a flat prior over the shares and then update over possible shares based on the data.

This time we will start with a draw of possible shares and put look for posterior weights on each drawn share.

3.2.2 Causal inference on a grid: likelihood

\[ \Pr(n_{00}, n_{01}, n_{10}, n_{11} \mid \lambda_{10},\lambda_{01},\lambda_{00},\lambda_{11}) = f_{\text{multinomial}}\left( \alpha_{00}, \alpha_{01}, \alpha_{10}, \alpha_{11} \mid \sum n, w \right) \] where:

\[w = \left(\frac12(\lambda_{01} + \lambda_{00}), \frac12(\lambda_{10}+\lambda_{11}), \frac12(\lambda_{10}+\lambda_{00}), \frac12(\lambda_{01}+\lambda_{11})\right)\] The weights are just the probability of each data type, given \(\lambda\).

why multinomial?

3.2.3 Causal inference on a grid: execution

prior draw with 10000 possibilities:

x <- gtools::rdirichlet(10000, alpha = c(1,1,1,1)) |> 
  as.data.frame()

names(x) <- letters[1:4]

x |> head() |> kable(digits = 3)

a	b	c	d
0.296	0.173	0.528	0.003
0.504	0.207	0.286	0.003
0.106	0.378	0.129	0.387
0.634	0.020	0.142	0.204
0.479	0.093	0.184	0.244
0.266	0.071	0.516	0.147

we are using \(a,b,c,d\) labels for \(\lambda_{10}, \lambda_{01}, \lambda_{00}, \lambda_{11}\)
we are using a handy distribution: the Dirichlet distribution
each row sums to 1 (each point (row) lies on a simplex)

3.2.4 Causal inference on a grid: execution

Imagine we had data (number of units with given values of X and Y):

\[n_{00} = 400, n_{01} = 100, n_{10} = 100, n_{11} = 400\]

Difference in means = 0.6.

Then we update (if we do it manually!) like this:

# add likelihood and calculate posterior

x <- x |> 
  rowwise() |>  # Ensures row-wise operations
  mutate(
    likelihood = dmultinom(
      c(400, 100, 100, 400),
      prob = c(b + c, a + c, a + d, b + d) / 2
    )
  ) |> 
  ungroup() |>
  mutate(posterior = likelihood / sum(likelihood))

3.2.5 Causal inference on a grid: execution

a	b	c	d	likelihood	posterior
0.30	0.17	0.53	0.00	2.10e-212	1.72e-209
0.50	0.21	0.29	0.00	1.26e-221	1.03e-218
0.11	0.38	0.13	0.39	7.80e-46	6.38e-43
0.63	0.02	0.14	0.20	0.00e+00	0.00e+00
0.48	0.09	0.18	0.24	1.97e-231	1.61e-228
0.27	0.07	0.52	0.15	3.99e-194	3.26e-191

3.2.6 Causal inference on a grid: inferences

We calculate queries like this:

x |> summarize(a = weighted.mean(a, posterior),
               b = weighted.mean(b, posterior),
               ATE = b - a) |>
  kable(digits = 2)

a	b	ATE
0.1	0.69	0.59

3.2.7 Causal inference on a grid: inferences

x |> ggplot(aes(b, a, size = posterior)) + geom_point(alpha = .5)

Spot the ridge

3.3 In sum: learning from data

For any data pattern, we gain confidence in parameter values more consistent with the data
For single-case inference, we must bring background beliefs about population-level causal effects
For multiple cases, we can learn about effects from the data
Large-\(N\) data can thus provide probative value for small-\(N\) process-tracing
All inference is conditional on the model

4 Mixed methods

Alan

Combining wide and deep data

4.1 A DAG

We’ll want to learn about the \(\theta\)’s and the \(\lambda\)’s
We need to observe nodes to learn about other nodes
We can potentially observe 3 nodes here: \(X, M\), and \(Y\)

4.2 A typical “quantitative” data structure

Data on exogenous variables and a key outcome for many cases
E.g., data on inequality (\(I\)) and democracy (\(D\)) for many cases

4.3 A typical “qualitative” data structure

Data on exogenous variables and a key outcome plus elements of process for a small number of cases
E.g., data on inequality (\(I\)), mass mobilization (\(M\)), and democracy (\(D\)) for many cases

4.4 Mixing qualitative and quantitative

What if we combine extensive data on many cases with intensive data on a few cases?

A non-rectangular data structure
Finite resources \(\Rightarrow\) we can’t always go “deep” on all cases

4.5 Non-rectangular data

A data structure that neither standard quantitative nor standard qualitative approaches can handle in a systematic way
Not a problem for the Integrated Inferences approach
We simply ask:
- Which causal effects in the population are most and least consistent with the data pattern we observe?
That is, what distribution of causal effects in the population, for each node, are most consistent with this data pattern?
CausalQueries uses information wherever it finds it

4.6 Mixing in practice

For Bayesian approaches this mixing is not hard.

Critically though we maintain the assumption that cases for “in depth” analysis are chosen at random—otherwise we have to account for selection processes.

What is the probability of seeing these two cases:

\(X=1, M = 1, Y = 1\) case
\(X=1, Y=1\) (no data on \(M\))

given parameters \(\lambda\):

4.7 Mixing in practice

The probability of 1 is:

\[p_{111}= \lambda^X_1 \times (\lambda^M_{01} + \lambda^M_{11}) \times (\lambda^Y_{01} +\lambda^Y_{11})\]

The probability of 2 is:

\[p_{1?1} = \lambda^X_1\times \left((\lambda^M_{01} + \lambda^M_{11}) \times (\lambda^Y_{01} +\lambda^Y_{11}) + (\lambda^M_{10} + \lambda^M_{00}) \times (\lambda^Y_{10} +\lambda^Y_{11}) \right)\]

So the probability of this data is just:

\[p(D|\lambda) = p_{111} * p_{1?1}\]

4.8 Mixing in practice

Insight:

If we imagine possible parameter values we can figure out the likeihood of any data type – quantitative, qualitative or mixed.

That, with priors, is enough to update:

\[p(\lambda | D) = \frac{p(D | \lambda)p(\lambda)}{p(D)}= \frac{p(D | \lambda)p(\lambda)}{\int_{\lambda'}p(D|\lambda')p(\lambda')d\lambda'}\]

4.9 Why is this useful?

4.9.1 How qual can inform quant: confounding

Remember:

Say we just observe a positive Inequality-Democratization correlation
Could be because Inequality causes Democratization
Could be because of confounding

4.9.2 How qual can inform quant: confounding

Observing \(M\) helps
Process data helps address the deep problem of confounding
Key point: we don’t need \(M\) for all cases. Can learn from \(I\) and \(D\) for lots of cases and \(M\) for a subset.

4.9.3 How qual can inform quant: observable confounder

Another example: \(M\) as the confound

4.9.4 How qual can inform quant: observable confounder

How much can we learn from \(M\) data for some cases?

4.9.5 How quant can inform qual: getting probative value of a clue from the data

Suppose we go to the field and we learn that mass mobilization DID occur in Malawi
- So \(M=1\)
What can we conclude?
NOTHING YET!

4.9.6 How quant can inform qual: getting probative value of a clue from the data

The pure process-tracing solution: assign our beliefs about causal effects in the population
- E.g., beliefs that linked positive effects are more likely than linked negative effects
- Meaning that \(M=1\) in an \(I=1, D=1\) case speaks in favor of \(I=1\) causing \(D=1\)
The mixed-methods solution: learn about population-level effects from large-\(N\) data

4.9.7 How quant can inform qual: getting probative value of a clue from the data

Suppose we have data on \(I\), \(D\), and \(M\) for a large number of cases
Suppose we observe a strong positive correlation across all 3 variables
What have we learned, under this model?
- Positive \(I \rightarrow M\) effects more likely than negative
- Positive \(M \rightarrow D\) effects more likely than negative
So linked positive effects more common than linked negative effects
Meaning that \(M=1\) in an \(I=1, D=1\) case speaks in favor of \(I=1\) causing \(D=1\)

4.9.8 How quant can inform qual: getting probative value of a clue from the data

But now we’ve now drawn our population-level beliefs from the data
Now, we can go and process-trace
- Did high inequality cause democratization in Malawi?
- Observe \(M\)
With conclusions grounded in case-level AND population-level evidence

5 Example 1: Historical data

Alan

5.1 Application from the book: rule-of-law institutions and long-term growth

We start with flat priors over causal types
Gather data on all nodes for many cases

5.2 Rule-of-law and growth: process-tracing probative value from large-\(N\) data

5.3 Rule-of-law and growth: learning about confounding

We allowed for confounding between rule of law and growth
- Mortality’s effects on institutions may be correlated with institutions’ effects on growth
We learn about that confounding from the data
- Rule of law more often has a positive effect on growth where mortality has a negative effect on institutions
Consistent with selection effects:
- When mortality is low, settlers make institutional choices in anticipation of their growth effects

5.4 Rule-of-law and growth: learning about confounding

What this looks like in our posteriors over nodal types:

A type where RoL has a positive effect on Growth is more common when Mortality has a negative effect on RoL.

6 Example 2: Impact evaluation

Macartan

6.1 Development Impact Evaluation Example

We (WZB team) are working with GIZ to understand the impact of a development intervention.
The implementers have a “theory of change” for how the intervention should work
We implement a mixed method approach in which we:
1. Elicit the theory of change as a DAG from practitioners and local experts
2. Gather priors on elements of the theory of change
3. Coordinate with qualitative researchers to gather information about intermediate nodes on the DAG.
4. Combine their insights with survey and administrative data to:
  - better understand program effects
  - better understand likely pathways

6.2 Development Impact Evaluation Example

6.3 Use of prior data

Priors are then used to implement four analyses:

causal queries implied by uniform priors
causal queries implied by stakeholder priors
posterior values of causal queries using a model in which data and uniform priors are used
posterior values of causal queries using a model in which data and stakeholder priors are used

6.4 Priors form

6.5 Qualitative data form

6.6 Updating on sub-DAGs

7 Mixed methods in `CausalQueries`

Macartan

7.1 Big picture

CausalQueries brings these elements together by allowing users to:

Make model: Specify a DAG: CausalQueries figures out all principal strata and places a prior on these
Update model: Provide data to the DAG: CausalQueries writes a stan model and updates on all parameters
Query updated model: CausalQueries figures out which parameters correspond to a given causal query

7.2 Illustration \(X \rightarrow Y\) model

Consider this problem:

	Y = 0	Y = 1
X = 0	\(n_{00}\)	\(n_{01}\)
X = 1	\(n_{10}\)	\(n_{11}\)

where \(X\) is randomized, both \(X\), \(Y\) binary

7.3 Model, update, query

model <- make_model("X -> Y")

data = fabricate(
  N = 1000, 
  X = rbinom(N, 1, prob = .5),  
  Y = rbinom(N, 1, prob = .2 + .4*X))

data |> collapse_data(model) |> head()

  event strategy count
1  X0Y0       XY   370
2  X1Y0       XY   206
3  X0Y1       XY   105
4  X1Y1       XY   319

model <- model |> update_model(data)

7.4 Model, update, query

model |> inspect("posterior_distribution")


posterior_distribution
Summary statistics of model parameters posterior distributions:

  Distributions matrix dimensions are 
  4000 rows (draws) by 6 cols (parameters)

     mean   sd
X.0  0.48 0.02
X.1  0.52 0.02
Y.00 0.28 0.07
Y.10 0.12 0.07
Y.01 0.50 0.07
Y.11 0.11 0.07

7.5 Model, update, query

model |> grab("posterior_distribution") |> 
  ggplot(aes(Y.01, Y.10)) + geom_point(alpha = .2)

Posterior draws

7.6 Model, update, query

model |> query_model(
  query = c(ATE = "Y[X=1] - Y[X=0]", 
            POS = "Y[X=1] > Y[X=0]", 
            SOME = "Y[X=1] != Y[X=0]" ),
  using = c("priors", "posteriors")) |>
  plot()

7.7 Generalization: Procedure

The CausalQueries approach generalizes to settings in which nodes are categorical:

Identify all principal strata: that is, the universe of possible response types or “causal types”: \(\theta\)
Define as parameters of interest the probability of each of these response types: \(\lambda\)
Place a prior over \(\lambda\): e.g. Dirichlet
Figure out \(\Pr(\text{Data} | \lambda)\)
Use stan to figure out \(\Pr(\lambda | \text{Data})\)

7.8 Generalization: Procedure

Also possible when there is unobserved confounding

…where dotted lines means that the response types for two nodes are not independent

8 Extra slides

8.1 Illustration: “Lipids” data

Example of an IV model. What are the principle strata (response types)? What relations of conditional independence are implied by the models?

data("lipids_data")

lipids_data |> kable()

event	strategy	count
Z0X0Y0	ZXY	158
Z1X0Y0	ZXY	52
Z0X1Y0	ZXY	0
Z1X1Y0	ZXY	23
Z0X0Y1	ZXY	14
Z1X0Y1	ZXY	12
Z0X1Y1	ZXY	0
Z1X1Y1	ZXY	78

Note that in compact form we simply record the number of units (“count”) that display each possible pattern of outcomes on the three variables (“event”).[^1]

8.2 Model

model <- make_model("Z -> X -> Y; X <-> Y") 
model |> plot()

8.3 Updating and querying

Queries can be condition on observable or counterfactual quantities

model |>
  update_model(lipids_data, refresh = 0) |>
  query_model(queries = c(
      ATE  = "Y[X=1] - Y[X=0]",
      PoC  = "Y[X=1] - Y[X=0] :|: X==0 & Y==0",
      LATE = "Y[X=1] - Y[X=0] :|: X[Z=1] > X[Z=0]"),
      using = "posteriors")

Table 1: Replication of Chickering and Pearl (1996).

query	given	mean	sd	cred.low.2.5%	cred.high.97.5%
Y[X=1] - Y[X=0]	-	0.55	0.10	0.37	0.73
Y[X=1] - Y[X=0]	X==0 & Y==0	0.64	0.15	0.37	0.89
Y[X=1] - Y[X=0]	X[Z=1] > X[Z=0]	0.70	0.05	0.59	0.80

Causal models for qualitative and mixed methods inference

1 Population parameters

1.1 Population-level queries

1.2 The population-level ATE

1.3 The population-level ATE

1.4 ATE, now with a mediator

1.5 ATE, now with a mediator

1.6 Share of positive effects

1.7 Share of positive effects

1.8 Attribution

1.9 Pathways

2 Updating Models: Intuition

2.1 How will we answer questions about populations?

2.2 How do we “update” our models?

2.3 Start with a DAG

2.4 Large-\(N\) estimation of \(ATE\): what happens to beliefs over parameters

3 General procedure

3.1 Bayes rule again

3.2 Illustration: Causal inference on a grid

3.2.1 Causal inference on a grid: strategy

3.2.2 Causal inference on a grid: likelihood

3.2.3 Causal inference on a grid: execution

3.2.4 Causal inference on a grid: execution

3.2.5 Causal inference on a grid: execution

3.2.6 Causal inference on a grid: inferences

3.2.7 Causal inference on a grid: inferences

3.3 In sum: learning from data

4 Mixed methods

4.1 A DAG

4.2 A typical “quantitative” data structure

4.3 A typical “qualitative” data structure

4.4 Mixing qualitative and quantitative

4.5 Non-rectangular data

4.6 Mixing in practice

4.7 Mixing in practice

4.8 Mixing in practice

4.9 Why is this useful?

4.9.1 How qual can inform quant: confounding

4.9.2 How qual can inform quant: confounding

4.9.3 How qual can inform quant: observable confounder

4.9.4 How qual can inform quant: observable confounder

4.9.5 How quant can inform qual: getting probative value of a clue from the data

4.9.6 How quant can inform qual: getting probative value of a clue from the data

4.9.7 How quant can inform qual: getting probative value of a clue from the data

4.9.8 How quant can inform qual: getting probative value of a clue from the data

5 Example 1: Historical data

5.1 Application from the book: rule-of-law institutions and long-term growth

5.2 Rule-of-law and growth: process-tracing probative value from large-\(N\) data

5.3 Rule-of-law and growth: learning about confounding

5.4 Rule-of-law and growth: learning about confounding

6 Example 2: Impact evaluation

6.1 Development Impact Evaluation Example

6.2 Development Impact Evaluation Example

6.3 Use of prior data

6.4 Priors form

6.5 Qualitative data form

6.6 Updating on sub-DAGs

7 Mixed methods in CausalQueries

7.1 Big picture

7.2 Illustration \(X \rightarrow Y\) model

7.3 Model, update, query

7.4 Model, update, query

7.5 Model, update, query

7.6 Model, update, query

7.7 Generalization: Procedure

7.8 Generalization: Procedure

8 Extra slides

8.1 Illustration: “Lipids” data

8.2 Model

8.3 Updating and querying

7 Mixed methods in `CausalQueries`