Causal models for qualitative and mixed methods inference

Mixed methods

Macartan Humphreys and Alan Jacobs

1 Population parameters

Alan

1.1 Population-level queries

  • Average causal effects for a population
    • e.g., What is the average effect of \(X\) on \(Y\)
  • Proportion of different effects in a population
    • What share of cases in the population have positive effects?
    • What share have negative effects?
    • For what share of treated units with positive outcomes was the outcome due to the treatment?
  • Causal pathways
    • e.g., How commonly does \(X\) affect \(Y\) through \(M\) (vs. through \(W\)) in the population?

1.2 The population-level ATE

  • What is the average effect of \(X\) on \(Y\) in the population?
    • This is a question about the values in \(\lambda^Y\): the shares of the different causal types in the population.
    • In process tracing, these were fixed parameters of a causal model.
    • Key move: we now posit that we do not know \(\lambda\) and need to learn about it

1.3 The population-level ATE

  • With binary variables, the average effect is always the difference between
    • The share of cases with a positive effect
    • The share of cases with a negative effect

\[ATE = \lambda_{01} - \lambda_{10}\]

1.4 ATE, now with a mediator

Which types define the query depends, again, on the structural model.

To estimate the ATE in this mediation model, we need to:

  • Combine two ways of generating positive effects
  • Combine two ways of generating negative effects
  • Subtract the first from the second

1.5 ATE, now with a mediator

  • Two ways of generating positive effects: \(\lambda^M_{01} \times \lambda^Y_{01}\) + \(\lambda^M_{10} \times \lambda^Y_{10}\)
  • Two ways of generating negative effects: \(\lambda^M_{10} \times \lambda^Y_{01}\) + \(\lambda^M_{01} \times \lambda^Y_{10}\)
  • Subtract the first from the second

1.6 Share of positive effects

Query: How often does \(X\) have a positive effect on \(Y\)?

  • Proportion of cases with positive effect = \(\lambda^Y_{01}\)

1.7 Share of positive effects

Query: How often does \(X\) have a positive effect on \(Y\)?

  • Proportion of cases with positive effect = \(\lambda^M_{01} \times \lambda^Y_{01}\) + \(\lambda^M_{10} \times \lambda^Y_{10}\)

1.8 Attribution

Query: How often is \(Y=1\) due to \(X=1\)?

  • Proportion of cases with positive effect among those with \(Y=1\), when \(X=1\):

\[Attribution = \frac{\lambda^Y_{01}}{\lambda^Y_{01} + \lambda^Y_{11}}\]

1.9 Pathways

  • We can also pose the pathway query at population level
    • What is the share of cases for which \(X\) has a positive effect on \(Y\) through \(M\)?
  • A question about joint \(\lambda^M\) and \(\lambda^Y\) distributions

2 Updating Models: Intuition

Alan

2.1 How will we answer questions about populations?

  • We need to learn about those \(\lambda\)’s
    • About the proportions of the population with different kinds of causal effects
  • We will have prior beliefs about those proportions
  • Then, when we see data on lots of cases, we will update our beliefs about proportions
    • From a prior distribution over \(\lambda\) \(\Rightarrow\) a posterior distribution over \(\lambda\)

2.2 How do we “update” our models?

  • We’ve talked about process tracing a single case to answer a case-level query
    • Here the model is fixed
    • We use the model + case data to answer questions about the case
  • We can also use data to “update” our models
    • Use data on many cases to learn about causal effects in the population
  • Allows mixing methods: using data on lots of cases, we can learn about probative value of process-tracing evidence
  • The core logic: we learn by updating population-level causal beliefs toward beliefs more consistent with the data

2.3 Start with a DAG

2.4 Large-\(N\) estimation of \(ATE\): what happens to beliefs over parameters

  • Say we only collect data on \(I\), \(M\), and \(D\) for a large number of cases
  • We update on \(\lambda^I\) , \(\lambda^M\) and \(\lambda^D\) to place more weight on values that are likely to give rise to the pattern of data we see
  • We will come to put more weight on a joint distribution of \(\lambda^M\) and \(\lambda^D\) in line with the data and less posterior weight on all other combinations

Question: What would you infer if you saw a very high correlation between \(I\) and \(D\) and a low correlation between \(I\) and \(M\)? Or a low correlation between \(M\) and \(D\)?

3 General procedure

Macartan

3.1 Bayes rule again

Key insight:

  • If we suppose a given set of parameter values, we can figure out the likelihood of the data given those values.
  • We can do this for all possible parameter values and see which ones are more in line with the data

That, with priors, is enough to update:

\[p(\lambda | D) = \frac{p(D | \lambda)p(\lambda)}{p(D)}\]

3.2 Illustration: Causal inference on a grid

Consider this joint distribution with binary \(X\) and binary \(Y\) from here

Y = 0 Y = 1
X = 0 \(\lambda_{01}/2 + \lambda_{00}/2\) \(\lambda_{10}/2 + \lambda_{11}/2\)
X = 1 \(\lambda_{10}/2 + \lambda_{00}/2\) \(\lambda_{01}/2 + \lambda_{11}/2\)

reminder: \(\lambda_{10}\) is share with negative effects, \(\lambda_{01}\) is share with positive effects…

3.2.1 Causal inference on a grid: strategy

Say we now had (finite) data filling out this table. What posteriors should we form over \(\lambda_{10},\lambda_{01},\lambda_{00},\lambda_{11}\)?

Y = 0 Y = 1
X = 0 \(n_{00}\) \(n_{01}\)
X = 1 \(n_{10}\) \(n_{11}\)

Lets start with a flat prior over the shares and then update over possible shares based on the data.

This time we will start with a draw of possible shares and put look for posterior weights on each drawn share.

3.2.2 Causal inference on a grid: likelihood

\[ \Pr(n_{00}, n_{01}, n_{10}, n_{11} \mid \lambda_{10},\lambda_{01},\lambda_{00},\lambda_{11}) = f_{\text{multinomial}}\left( \alpha_{00}, \alpha_{01}, \alpha_{10}, \alpha_{11} \mid \sum n, w \right) \] where:

\[w = \left(\frac12(\lambda_{01} + \lambda_{00}), \frac12(\lambda_{10}+\lambda_{11}), \frac12(\lambda_{10}+\lambda_{00}), \frac12(\lambda_{01}+\lambda_{11})\right)\] The weights are just the probability of each data type, given \(\lambda\).

why multinomial?

3.2.3 Causal inference on a grid: execution

prior draw with 10000 possibilities:

x <- gtools::rdirichlet(10000, alpha = c(1,1,1,1)) |> 
  as.data.frame()

names(x) <- letters[1:4]

x |> head() |> kable(digits = 3)
a b c d
0.296 0.173 0.528 0.003
0.504 0.207 0.286 0.003
0.106 0.378 0.129 0.387
0.634 0.020 0.142 0.204
0.479 0.093 0.184 0.244
0.266 0.071 0.516 0.147
  • we are using \(a,b,c,d\) labels for \(\lambda_{10}, \lambda_{01}, \lambda_{00}, \lambda_{11}\)
  • we are using a handy distribution: the Dirichlet distribution
  • each row sums to 1 (each point (row) lies on a simplex)

3.2.4 Causal inference on a grid: execution

Imagine we had data (number of units with given values of X and Y):

\[n_{00} = 400, n_{01} = 100, n_{10} = 100, n_{11} = 400\]

Difference in means = 0.6.

Then we update (if we do it manually!) like this:

# add likelihood and calculate posterior

x <- x |> 
  rowwise() |>  # Ensures row-wise operations
  mutate(
    likelihood = dmultinom(
      c(400, 100, 100, 400),
      prob = c(b + c, a + c, a + d, b + d) / 2
    )
  ) |> 
  ungroup() |>
  mutate(posterior = likelihood / sum(likelihood))

3.2.5 Causal inference on a grid: execution

a b c d likelihood posterior
0.30 0.17 0.53 0.00 2.10e-212 1.72e-209
0.50 0.21 0.29 0.00 1.26e-221 1.03e-218
0.11 0.38 0.13 0.39 7.80e-46 6.38e-43
0.63 0.02 0.14 0.20 0.00e+00 0.00e+00
0.48 0.09 0.18 0.24 1.97e-231 1.61e-228
0.27 0.07 0.52 0.15 3.99e-194 3.26e-191

3.2.6 Causal inference on a grid: inferences

We calculate queries like this:

x |> summarize(a = weighted.mean(a, posterior),
               b = weighted.mean(b, posterior),
               ATE = b - a) |>
  kable(digits = 2)
a b ATE
0.1 0.69 0.59

3.2.7 Causal inference on a grid: inferences

x |> ggplot(aes(b, a, size = posterior)) + geom_point(alpha = .5) 

Spot the ridge

3.3 In sum: learning from data

  • For any data pattern, we gain confidence in parameter values more consistent with the data
  • For single-case inference, we must bring background beliefs about population-level causal effects
  • For multiple cases, we can learn about effects from the data
  • Large-\(N\) data can thus provide probative value for small-\(N\) process-tracing
  • All inference is conditional on the model

4 Mixed methods

Alan

Combining wide and deep data

4.1 A DAG

  • We’ll want to learn about the \(\theta\)’s and the \(\lambda\)’s
  • We need to observe nodes to learn about other nodes
  • We can potentially observe 3 nodes here: \(X, M\), and \(Y\)

4.2 A typical “quantitative” data structure

  • Data on exogenous variables and a key outcome for many cases

  • E.g., data on inequality (\(I\)) and democracy (\(D\)) for many cases

4.3 A typical “qualitative” data structure

  • Data on exogenous variables and a key outcome plus elements of process for a small number of cases

  • E.g., data on inequality (\(I\)), mass mobilization (\(M\)), and democracy (\(D\)) for many cases

4.4 Mixing qualitative and quantitative

  • What if we combine extensive data on many cases with intensive data on a few cases?
  • A non-rectangular data structure

  • Finite resources \(\Rightarrow\) we can’t always go “deep” on all cases

4.5 Non-rectangular data

  • A data structure that neither standard quantitative nor standard qualitative approaches can handle in a systematic way
  • Not a problem for the Integrated Inferences approach
  • We simply ask:
    • Which causal effects in the population are most and least consistent with the data pattern we observe?
  • That is, what distribution of causal effects in the population, for each node, are most consistent with this data pattern?
  • CausalQueries uses information wherever it finds it

4.6 Mixing in practice

For Bayesian approaches this mixing is not hard.

Critically though we maintain the assumption that cases for “in depth” analysis are chosen at random—otherwise we have to account for selection processes.

What is the probability of seeing these two cases:

  1. \(X=1, M = 1, Y = 1\) case
  2. \(X=1, Y=1\) (no data on \(M\))

given parameters \(\lambda\):

4.7 Mixing in practice

The probability of 1 is:

\[p_{111}= \lambda^X_1 \times (\lambda^M_{01} + \lambda^M_{11}) \times (\lambda^Y_{01} +\lambda^Y_{11})\]

The probability of 2 is:

\[p_{1?1} = \lambda^X_1\times \left((\lambda^M_{01} + \lambda^M_{11}) \times (\lambda^Y_{01} +\lambda^Y_{11}) + (\lambda^M_{10} + \lambda^M_{00}) \times (\lambda^Y_{10} +\lambda^Y_{11}) \right)\]

So the probability of this data is just:

\[p(D|\lambda) = p_{111} * p_{1?1}\]

4.8 Mixing in practice

Insight:

If we imagine possible parameter values we can figure out the likeihood of any data type – quantitative, qualitative or mixed.

That, with priors, is enough to update:

\[p(\lambda | D) = \frac{p(D | \lambda)p(\lambda)}{p(D)}= \frac{p(D | \lambda)p(\lambda)}{\int_{\lambda'}p(D|\lambda')p(\lambda')d\lambda'}\]

4.9 Why is this useful?

4.9.1 How qual can inform quant: confounding

Remember:

  • Say we just observe a positive Inequality-Democratization correlation
  • Could be because Inequality causes Democratization
  • Could be because of confounding

4.9.2 How qual can inform quant: confounding

  • Observing \(M\) helps
  • Process data helps address the deep problem of confounding
  • Key point: we don’t need \(M\) for all cases. Can learn from \(I\) and \(D\) for lots of cases and \(M\) for a subset.

4.9.3 How qual can inform quant: observable confounder

  • Another example: \(M\) as the confound

4.9.4 How qual can inform quant: observable confounder

  • How much can we learn from \(M\) data for some cases?

4.9.5 How quant can inform qual: getting probative value of a clue from the data

  • Suppose we go to the field and we learn that mass mobilization DID occur in Malawi

    • So \(M=1\)
  • What can we conclude?

  • NOTHING YET!

4.9.6 How quant can inform qual: getting probative value of a clue from the data

  • The pure process-tracing solution: assign our beliefs about causal effects in the population
    • E.g., beliefs that linked positive effects are more likely than linked negative effects
    • Meaning that \(M=1\) in an \(I=1, D=1\) case speaks in favor of \(I=1\) causing \(D=1\)
  • The mixed-methods solution: learn about population-level effects from large-\(N\) data

4.9.7 How quant can inform qual: getting probative value of a clue from the data

  • Suppose we have data on \(I\), \(D\), and \(M\) for a large number of cases
  • Suppose we observe a strong positive correlation across all 3 variables
  • What have we learned, under this model?
    • Positive \(I \rightarrow M\) effects more likely than negative
    • Positive \(M \rightarrow D\) effects more likely than negative
  • So linked positive effects more common than linked negative effects
  • Meaning that \(M=1\) in an \(I=1, D=1\) case speaks in favor of \(I=1\) causing \(D=1\)

4.9.8 How quant can inform qual: getting probative value of a clue from the data

  • But now we’ve now drawn our population-level beliefs from the data
  • Now, we can go and process-trace
    • Did high inequality cause democratization in Malawi?
    • Observe \(M\)
  • With conclusions grounded in case-level AND population-level evidence

5 Example 1: Historical data

Alan

5.1 Application from the book: rule-of-law institutions and long-term growth

  • We start with flat priors over causal types
  • Gather data on all nodes for many cases

5.2 Rule-of-law and growth: process-tracing probative value from large-\(N\) data

5.3 Rule-of-law and growth: learning about confounding

  • We allowed for confounding between rule of law and growth
    • Mortality’s effects on institutions may be correlated with institutions’ effects on growth
  • We learn about that confounding from the data
    • Rule of law more often has a positive effect on growth where mortality has a negative effect on institutions
  • Consistent with selection effects:
    • When mortality is low, settlers make institutional choices in anticipation of their growth effects

5.4 Rule-of-law and growth: learning about confounding

What this looks like in our posteriors over nodal types:

A type where RoL has a positive effect on Growth is more common when Mortality has a negative effect on RoL.

6 Example 2: Impact evaluation

Macartan

6.1 Development Impact Evaluation Example

  • We (WZB team) are working with GIZ to understand the impact of a development intervention.

  • The implementers have a “theory of change” for how the intervention should work

  • We implement a mixed method approach in which we:

    1. Elicit the theory of change as a DAG from practitioners and local experts
    2. Gather priors on elements of the theory of change
    3. Coordinate with qualitative researchers to gather information about intermediate nodes on the DAG.
    4. Combine their insights with survey and administrative data to:
      • better understand program effects
      • better understand likely pathways

6.2 Development Impact Evaluation Example

6.3 Use of prior data

Priors are then used to implement four analyses:

  • causal queries implied by uniform priors
  • causal queries implied by stakeholder priors
  • posterior values of causal queries using a model in which data and uniform priors are used
  • posterior values of causal queries using a model in which data and stakeholder priors are used

6.4 Priors form

6.5 Qualitative data form

6.6 Updating on sub-DAGs

7 Mixed methods in CausalQueries

Macartan

7.1 Big picture

CausalQueries brings these elements together by allowing users to:

  1. Make model: Specify a DAG: CausalQueries figures out all principal strata and places a prior on these
  2. Update model: Provide data to the DAG: CausalQueries writes a stan model and updates on all parameters
  3. Query updated model: CausalQueries figures out which parameters correspond to a given causal query

7.2 Illustration \(X \rightarrow Y\) model

Consider this problem:

Y = 0 Y = 1
X = 0 \(n_{00}\) \(n_{01}\)
X = 1 \(n_{10}\) \(n_{11}\)

where \(X\) is randomized, both \(X\), \(Y\) binary

7.3 Model, update, query

model <- make_model("X -> Y")

data = fabricate(
  N = 1000, 
  X = rbinom(N, 1, prob = .5),  
  Y = rbinom(N, 1, prob = .2 + .4*X))

data |> collapse_data(model) |> head()
  event strategy count
1  X0Y0       XY   370
2  X1Y0       XY   206
3  X0Y1       XY   105
4  X1Y1       XY   319
model <- model |> update_model(data)

7.4 Model, update, query

model |> inspect("posterior_distribution") 

posterior_distribution
Summary statistics of model parameters posterior distributions:

  Distributions matrix dimensions are 
  4000 rows (draws) by 6 cols (parameters)

     mean   sd
X.0  0.48 0.02
X.1  0.52 0.02
Y.00 0.28 0.07
Y.10 0.12 0.07
Y.01 0.50 0.07
Y.11 0.11 0.07

7.5 Model, update, query

model |> grab("posterior_distribution") |> 
  ggplot(aes(Y.01, Y.10)) + geom_point(alpha = .2)

Posterior draws

7.6 Model, update, query

model |> query_model(
  query = c(ATE = "Y[X=1] - Y[X=0]", 
            POS = "Y[X=1] > Y[X=0]", 
            SOME = "Y[X=1] != Y[X=0]" ),
  using = c("priors", "posteriors")) |>
  plot()

7.7 Generalization: Procedure

The CausalQueries approach generalizes to settings in which nodes are categorical:

  1. Identify all principal strata: that is, the universe of possible response types or “causal types”: \(\theta\)
  2. Define as parameters of interest the probability of each of these response types: \(\lambda\)
  3. Place a prior over \(\lambda\): e.g. Dirichlet
  4. Figure out \(\Pr(\text{Data} | \lambda)\)
  5. Use stan to figure out \(\Pr(\lambda | \text{Data})\)

7.8 Generalization: Procedure

Also possible when there is unobserved confounding

…where dotted lines means that the response types for two nodes are not independent

8 Extra slides

8.1 Illustration: “Lipids” data

Example of an IV model. What are the principle strata (response types)? What relations of conditional independence are implied by the models?

data("lipids_data")

lipids_data |> kable()
event strategy count
Z0X0Y0 ZXY 158
Z1X0Y0 ZXY 52
Z0X1Y0 ZXY 0
Z1X1Y0 ZXY 23
Z0X0Y1 ZXY 14
Z1X0Y1 ZXY 12
Z0X1Y1 ZXY 0
Z1X1Y1 ZXY 78

Note that in compact form we simply record the number of units (“count”) that display each possible pattern of outcomes on the three variables (“event”).[^1]

8.2 Model

model <- make_model("Z -> X -> Y; X <-> Y") 
model |> plot()

8.3 Updating and querying

Queries can be condition on observable or counterfactual quantities

model |>
  update_model(lipids_data, refresh = 0) |>
  query_model(queries = c(
      ATE  = "Y[X=1] - Y[X=0]",
      PoC  = "Y[X=1] - Y[X=0] :|: X==0 & Y==0",
      LATE = "Y[X=1] - Y[X=0] :|: X[Z=1] > X[Z=0]"),
      using = "posteriors") 
Table 1: Replication of Chickering and Pearl (1996).
query given mean sd cred.low.2.5% cred.high.97.5%
Y[X=1] - Y[X=0] - 0.55 0.10 0.37 0.73
Y[X=1] - Y[X=0] X==0 & Y==0 0.64 0.15 0.37 0.89
Y[X=1] - Y[X=0] X[Z=1] > X[Z=0] 0.70 0.05 0.59 0.80