We can also pose the pathway query at population level
What is the share of cases for which \(X\) has a positive effect on \(Y\) through \(M\)?
A question about joint \(\lambda^M\) and \(\lambda^Y\) distributions
2 Updating Models: Intuition
Alan
2.1 How will we answer questions about populations?
We need to learn about those \(\lambda\)’s
About the proportions of the population with different kinds of causal effects
We will have prior beliefs about those proportions
Then, when we see data on lots of cases, we will update our beliefs about proportions
From a prior distribution over \(\lambda\)\(\Rightarrow\) a posterior distribution over \(\lambda\)
2.2 How do we “update” our models?
We’ve talked about process tracing a single case to answer a case-level query
Here the model is fixed
We use the model + case data to answer questions about the case
We can also use data to “update” our models
Use data on many cases to learn about causal effects in the population
Allows mixing methods: using data on lots of cases, we can learn about probative value of process-tracing evidence
The core logic: we learn by updating population-level causal beliefs toward beliefs more consistent with the data
2.3 Start with a DAG
2.4 Large-\(N\) estimation of \(ATE\): what happens to beliefs over parameters
Say we only collect data on \(I\), \(M\), and \(D\) for a large number of cases
We update on \(\lambda^I\) , \(\lambda^M\) and \(\lambda^D\) to place more weight on values that are likely to give rise to the pattern of data we see
We will come to put more weight on a joint distribution of \(\lambda^M\) and \(\lambda^D\) in line with the data and less posterior weight on all other combinations
Question: What would you infer if you saw a very high correlation between \(I\) and \(D\) and a low correlation between \(I\) and \(M\)? Or a low correlation between \(M\) and \(D\)?
3 General procedure
Macartan
3.1 Bayes rule again
Key insight:
If we suppose a given set of parameter values, we can figure out the likelihood of the data given those values.
We can do this for all possible parameter values and see which ones are more in line with the data
Consider this joint distribution with binary \(X\) and binary \(Y\) from here
Y = 0
Y = 1
X = 0
\(\lambda_{01}/2 + \lambda_{00}/2\)
\(\lambda_{10}/2 + \lambda_{11}/2\)
X = 1
\(\lambda_{10}/2 + \lambda_{00}/2\)
\(\lambda_{01}/2 + \lambda_{11}/2\)
reminder: \(\lambda_{10}\) is share with negative effects, \(\lambda_{01}\) is share with positive effects…
3.2.1 Causal inference on a grid: strategy
Say we now had (finite) data filling out this table. What posteriors should we form over \(\lambda_{10},\lambda_{01},\lambda_{00},\lambda_{11}\)?
Y = 0
Y = 1
X = 0
\(n_{00}\)
\(n_{01}\)
X = 1
\(n_{10}\)
\(n_{11}\)
Lets start with a flat prior over the shares and then update over possible shares based on the data.
This time we will start with a draw of possible shares and put look for posterior weights on each drawn share.
3.2.2 Causal inference on a grid: likelihood
\[
\Pr(n_{00}, n_{01}, n_{10}, n_{11} \mid \lambda_{10},\lambda_{01},\lambda_{00},\lambda_{11}) =
f_{\text{multinomial}}\left( \alpha_{00}, \alpha_{01}, \alpha_{10}, \alpha_{11} \mid \sum n, w \right)
\] where:
\[w = \left(\frac12(\lambda_{01} + \lambda_{00}), \frac12(\lambda_{10}+\lambda_{11}), \frac12(\lambda_{10}+\lambda_{00}), \frac12(\lambda_{01}+\lambda_{11})\right)\] The weights are just the probability of each data type, given \(\lambda\).
# add likelihood and calculate posteriorx <- x |>rowwise() |># Ensures row-wise operationsmutate(likelihood =dmultinom(c(400, 100, 100, 400),prob =c(b + c, a + c, a + d, b + d) /2 ) ) |>ungroup() |>mutate(posterior = likelihood /sum(likelihood))
3.2.5 Causal inference on a grid: execution
a
b
c
d
likelihood
posterior
0.30
0.17
0.53
0.00
2.10e-212
1.72e-209
0.50
0.21
0.29
0.00
1.26e-221
1.03e-218
0.11
0.38
0.13
0.39
7.80e-46
6.38e-43
0.63
0.02
0.14
0.20
0.00e+00
0.00e+00
0.48
0.09
0.18
0.24
1.97e-231
1.61e-228
0.27
0.07
0.52
0.15
3.99e-194
3.26e-191
3.2.6 Causal inference on a grid: inferences
We calculate queries like this:
x |>summarize(a =weighted.mean(a, posterior),b =weighted.mean(b, posterior),ATE = b - a) |>kable(digits =2)
a
b
ATE
0.1
0.69
0.59
3.2.7 Causal inference on a grid: inferences
x |>ggplot(aes(b, a, size = posterior)) +geom_point(alpha = .5)
Spot the ridge
3.3 In sum: learning from data
For any data pattern, we gain confidence in parameter values more consistent with the data
For single-case inference, we must bring background beliefs about population-level causal effects
For multiple cases, we can learn about effects from the data
Large-\(N\) data can thus provide probative value for small-\(N\) process-tracing
All inference is conditional on the model
4 Mixed methods
Alan
Combining wide and deep data
4.1 A DAG
We’ll want to learn about the \(\theta\)’s and the \(\lambda\)’s
We need to observe nodes to learn about other nodes
We can potentially observe 3 nodes here: \(X, M\), and \(Y\)
4.2 A typical “quantitative” data structure
Data on exogenous variables and a key outcome for many cases
E.g., data on inequality (\(I\)) and democracy (\(D\)) for many cases
4.3 A typical “qualitative” data structure
Data on exogenous variables and a key outcome plus elements of process for a small number of cases
E.g., data on inequality (\(I\)), mass mobilization (\(M\)), and democracy (\(D\)) for many cases
4.4 Mixing qualitative and quantitative
What if we combine extensive data on many cases with intensive data on a few cases?
A non-rectangular data structure
Finite resources \(\Rightarrow\) we can’t always go “deep” on all cases
4.5 Non-rectangular data
A data structure that neither standard quantitative nor standard qualitative approaches can handle in a systematic way
Not a problem for the Integrated Inferences approach
We simply ask:
Which causal effects in the population are most and least consistent with the data pattern we observe?
That is, what distribution of causal effects in the population, for each node, are most consistent with this data pattern?
CausalQueries uses information wherever it finds it
4.6 Mixing in practice
For Bayesian approaches this mixing is not hard.
Critically though we maintain the assumption that cases for “in depth” analysis are chosen at random—otherwise we have to account for selection processes.
What is the probability of seeing these two cases:
posterior_distribution
Summary statistics of model parameters posterior distributions:
Distributions matrix dimensions are
4000 rows (draws) by 6 cols (parameters)
mean sd
X.0 0.48 0.02
X.1 0.52 0.02
Y.00 0.28 0.07
Y.10 0.12 0.07
Y.01 0.50 0.07
Y.11 0.11 0.07
7.5 Model, update, query
model |>grab("posterior_distribution") |>ggplot(aes(Y.01, Y.10)) +geom_point(alpha = .2)
Posterior draws
7.6 Model, update, query
model |>query_model(query =c(ATE ="Y[X=1] - Y[X=0]", POS ="Y[X=1] > Y[X=0]", SOME ="Y[X=1] != Y[X=0]" ),using =c("priors", "posteriors")) |>plot()
7.7 Generalization: Procedure
The CausalQueries approach generalizes to settings in which nodes are categorical:
Identify all principal strata: that is, the universe of possible response types or “causal types”: \(\theta\)
Define as parameters of interest the probability of each of these response types: \(\lambda\)
Place a prior over \(\lambda\): e.g. Dirichlet
Figure out \(\Pr(\text{Data} | \lambda)\)
Use stan to figure out \(\Pr(\lambda | \text{Data})\)
7.8 Generalization: Procedure
Also possible when there is unobserved confounding
…where dotted lines means that the response types for two nodes are not independent
8 Extra slides
8.1 Illustration: “Lipids” data
Example of an IV model. What are the principle strata (response types)? What relations of conditional independence are implied by the models?
data("lipids_data")lipids_data |>kable()
event
strategy
count
Z0X0Y0
ZXY
158
Z1X0Y0
ZXY
52
Z0X1Y0
ZXY
0
Z1X1Y0
ZXY
23
Z0X0Y1
ZXY
14
Z1X0Y1
ZXY
12
Z0X1Y1
ZXY
0
Z1X1Y1
ZXY
78
Note that in compact form we simply record the number of units (“count”) that display each possible pattern of outcomes on the three variables (“event”).[^1]
8.2 Model
model <-make_model("Z -> X -> Y; X <-> Y") model |>plot()
8.3 Updating and querying
Queries can be condition on observable or counterfactual quantities