Chapter 8 Process tracing
8.1 What to infer from what
The simplest application of the CausalQueries
package is to figure out what inferences to make about a case upon observing within-case data, given a model. One might observe many pieces of evidence and have to figure out how to update from these jointly.
In Integrated Inferences we explore an inequality-democratization model where for a case with low inequality and democratization (say) one is interested in whether the democratization was due to the low inequality. In the simple model, inequality can give rise to popular mobilization which in turn forces democratization; or alternatively, inequality could prevent democratization by generating a threat from elites. In addition other forces, such as international pressure, could give rise to democratization. The question is: how do we update on our beliefs that low inequality caused democratization when we observe mobilization or international pressure?
model <- make_model("I -> M -> D <- I; P -> D") |>
set_restrictions(c(
"(M[I=1] < M[I=0])",
"(D[I=1] > D[I=0]) | (D[M=1] < D[M=0]) | (D[P=1] < D[P=0])"))
We can read inferences directly from query_model
:
query_model(model,
query = list(`I = 0 caused D = 1` = "D[I=1] != D[I=0]"),
using = "parameters",
given = c("I==0 & D==1",
"I==0 & D==1 & M==0",
"I==0 & D==1 & M==1",
"I==0 & D==1 & P==0",
"I==0 & D==1 & P==1",
"I==0 & D==1 & M == 0 & P==0",
"I==0 & D==1 & M == 1 & P==0",
"I==0 & D==1 & M == 0 & P==1",
"I==0 & D==1 & M == 1 & P==1")) |> kable()
model | query | given | using | case_level | mean | sd | cred.low.2.5% | cred.high.97.5% |
---|---|---|---|---|---|---|---|---|
model_1 | I = 0 caused D = 1 | I==0 & D==1 | parameters | FALSE | 0.4384 | 0.4384 | 0.4384 | |
model_1 | I = 0 caused D = 1 | I==0 & D==1 & M==0 | parameters | FALSE | 0.4750 | 0.4750 | 0.4750 | |
model_1 | I = 0 caused D = 1 | I==0 & D==1 & M==1 | parameters | FALSE | 0.3939 | 0.3939 | 0.3939 | |
model_1 | I = 0 caused D = 1 | I==0 & D==1 & P==0 | parameters | FALSE | 0.6154 | 0.6154 | 0.6154 | |
model_1 | I = 0 caused D = 1 | I==0 & D==1 & P==1 | parameters | FALSE | 0.3404 | 0.3404 | 0.3404 | |
model_1 | I = 0 caused D = 1 | I==0 & D==1 & M == 0 & P==0 | parameters | FALSE | 0.6667 | 0.6667 | 0.6667 | |
model_1 | I = 0 caused D = 1 | I==0 & D==1 & M == 1 & P==0 | parameters | FALSE | 0.5714 | 0.5714 | 0.5714 | |
model_1 | I = 0 caused D = 1 | I==0 & D==1 & M == 0 & P==1 | parameters | FALSE | 0.3929 | 0.3929 | 0.3929 | |
model_1 | I = 0 caused D = 1 | I==0 & D==1 & M == 1 & P==1 | parameters | FALSE | 0.2632 | 0.2632 | 0.2632 |
We see in this example that learning about a rival cause—the moderator \(P\) (international pressure)—induces larger changes in beliefs than learning about the mediator, \(M\) (mobilization). The two clues substitute for each other marginally.
The importance of different clues depends however on what one wants to explain. In the next analysis, we see that if we want to know if inequality explained democratization, learning that \(M=0\) has a large impact on beliefs.
model | query | given | using | case_level | mean | sd | cred.low.2.5% | cred.high.97.5% |
---|---|---|---|---|---|---|---|---|
model_1 | I = 1 caused D = 1 | I==1 & D==1 | parameters | FALSE | 0.1277 | 0.1277 | 0.1277 | |
model_1 | I = 1 caused D = 1 | I==1 & D==1 & M==0 | parameters | FALSE | 0.0000 | 0.0000 | 0.0000 | |
model_1 | I = 1 caused D = 1 | I==1 & D==1 & M==1 | parameters | FALSE | 0.1500 | 0.1500 | 0.1500 | |
model_1 | I = 1 caused D = 1 | I==1 & D==1 & P==0 | parameters | FALSE | 0.2308 | 0.2308 | 0.2308 | |
model_1 | I = 1 caused D = 1 | I==1 & D==1 & P==1 | parameters | FALSE | 0.0882 | 0.0882 | 0.0882 |
Note that inferences are taken here based on the model made by make_model
, without any updating of the model using data. In this sense the approach simply makes the model used for process tracing explicit, but it does not justify. It is possible however to first update a model using data from many cases and then use the updated model to draw inferences about a single case.
8.2 Probative value and \(d\)-separation
Observation of a node (a “clue”) is potentially informative for a query when it is not \(d\)-separated1 from query-relevant nodes (See Integrated Inferences, Ch 6).
An implication of this is that the observation of some nodes may render other nodes more or less informative. From the graph alone you can sometimes tell when additional data will be uninformative for a query.
To wit:
model <- make_model("X -> Y -> S <- W") |>
set_restrictions(complements("Y", "W", "S"), keep = TRUE)
plot(model)
query_model(model,
query = "Y[X=1] > Y[X=0]",
using = "parameters",
given = c("X==1",
"X==1 & W==1",
"X==1 & S==1",
"X==1 & S==1 & W==1",
"X==1 & Y==1",
"X==1 & W==1 & S==1 & Y==1")) |> kable()
model | query | given | using | case_level | mean | sd | cred.low.2.5% | cred.high.97.5% |
---|---|---|---|---|---|---|---|---|
1 | Y[X=1] > Y[X=0] | X==1 | parameters | FALSE | 0.25 | 0.25 | 0.25 | |
2 | Y[X=1] > Y[X=0] | X==1 & W==1 | parameters | FALSE | 0.25 | 0.25 | 0.25 | |
3 | Y[X=1] > Y[X=0] | X==1 & S==1 | parameters | FALSE | 0.25 | 0.25 | 0.25 | |
4 | Y[X=1] > Y[X=0] | X==1 & S==1 & W==1 | parameters | FALSE | 0.40 | 0.40 | 0.40 | |
5 | Y[X=1] > Y[X=0] | X==1 & Y==1 | parameters | FALSE | 0.50 | 0.50 | 0.50 | |
6 | Y[X=1] > Y[X=0] | X==1 & W==1 & S==1 & Y==1 | parameters | FALSE | 0.50 | 0.50 | 0.50 |
In this example \(W\) is not informative for the \(X\) causes \(Y\) query (a query about \(\theta^Y\), a parent of \(Y\)), when \(Y\) and \(S\) are unobserved (Row 1 = Row 3). It becomes informative, however, when \(S\), a symptom of \(Y\), is observed (Row 3 \(\neq\) Row 4). But when \(Y\) is observed neither \(S\) nor \(W\) are informative (Row 5 = Row 6).
The reason is that \(W\) is \(d\)-separated from \(\theta^Y\) when \(Y\) and \(S\) are unobserved. But \(S\) is a “collider” for \(Y\) and \(W\) and so \(W\) becomes informative about \(Y\) once \(S\) is observed, and hence of \(\theta^Y\) (so long as \(Y\) is unobserved). When \(Y\) is observed however now \(S\) and \(W\) become \(d\)-separated from \(\theta^Y\) and neither is informative.
8.3 Foundations for Van Evera’s tests
Students of process tracing often refer to a set of classical “qualitative tests” that are used to link within-case evidence to inferences around specific (often case-level) hypotheses. The four classical tests as described by Collier (2011) and drawing on Van Evera (1997) are “smoking gun” tests, “hoop” tests, “doubly decisive” tests, and “straw-in-the-wind” tests. A hoop test is one which, if failed, bodes especially badly for a claim; a smoking gun test is one that bodes well for a hypothesis if passed; a doubly decisive test is strongly conclusive no matter what is found, and a straw-in-the-wind test is suggestive, though not conclusive, either way.
In some treatments (such as Humphreys and Jacobs (2015)) formalization involves specifying a prior that a hypothesis is true and an independent set of beliefs about the probability of seeing some data if the hypothesis is true and if it is false. Then updating proceeds using Bayes’ rule.
This simple approach suffers from two related weaknesses however: first, there is no good reason to expect these probabilities to be independent; second, there is nothing in the set-up to indicate how beliefs around the probative value of clues can be established or justified.
Both of these problems are easily resolved if the problem is articulated using fully specified causal models.
Many different causal models might justify Van Evera’s tests. We illustrate using one in which the requisite background knowledge to justify the tests can be derived from a factorial experiment and in which one treatment serves as a clue for the effect of another.
For the illustration we first make use of a function that generates data from a model with a constrained set of types for \(Y\) and a given prior distribution over clue \(K\).
van_evera_data <- function(y_types, k_types)
make_model("X -> Y <- K") |>
set_restrictions(labels = list(Y = y_types), keep = TRUE) |>
set_parameters(param_type = "define", node = "K", parameters = c(1 - k_types, k_types)) |>
make_data(n = 1000)
We then use a function that draws inferences, given different values of a clue \(K\), from a model that has been updated using available data. Note that the model that is updated has no constraints on \(Y\), has flat beliefs over the distribution of \(K\), and imposes no assumption that \(K\) is informative for how \(Y\) reacts to \(X\).
van_evera_inference <- function(data)
make_model("X -> Y <- K") |>
update_model(data = data) |>
query_model(query = "Y[X=1] > Y[X=0]",
given = c(TRUE, "K==0", "K==1"),
using = "posteriors")
We can now generate posterior beliefs, given \(K\), for different types of tests where the tests are now justified by different types of data, coupled with a common prior causal model.
Results:
doubly_decisive <- van_evera_data("0001", .5) |> van_evera_inference()
hoop <- van_evera_data(c("0001", "0101"), .9) |> van_evera_inference()
smoking_gun <- van_evera_data(c("0001", "0011"), .1) |> van_evera_inference()
straw_in_wind <- van_evera_data(c("0001", "0101", "0011"), .5) |> van_evera_inference()
model | query | given | using | case_level | mean | sd | cred.low.2.5% | cred.high.97.5% |
---|---|---|---|---|---|---|---|---|
model_1 | Y[X=1] > Y[X=0] | - | posteriors | FALSE | 0.4912 | 0.0159 | 0.4597 | 0.5214 |
model_1 | Y[X=1] > Y[X=0] | K==0 | posteriors | FALSE | 0.0097 | 0.0053 | 0.0025 | 0.0229 |
model_1 | Y[X=1] > Y[X=0] | K==1 | posteriors | FALSE | 0.9755 | 0.0079 | 0.9575 | 0.9881 |
model | query | given | using | case_level | mean | sd | cred.low.2.5% | cred.high.97.5% |
---|---|---|---|---|---|---|---|---|
model_1 | Y[X=1] > Y[X=0] | - | posteriors | FALSE | 0.4455 | 0.0217 | 0.4031 | 0.4871 |
model_1 | Y[X=1] > Y[X=0] | K==0 | posteriors | FALSE | 0.0431 | 0.0264 | 0.0091 | 0.1097 |
model_1 | Y[X=1] > Y[X=0] | K==1 | posteriors | FALSE | 0.4964 | 0.0238 | 0.4511 | 0.5423 |
model | query | given | using | case_level | mean | sd | cred.low.2.5% | cred.high.97.5% |
---|---|---|---|---|---|---|---|---|
model_1 | Y[X=1] > Y[X=0] | - | posteriors | FALSE | 0.5373 | 0.0214 | 0.4955 | 0.5789 |
model_1 | Y[X=1] > Y[X=0] | K==0 | posteriors | FALSE | 0.4960 | 0.0231 | 0.4518 | 0.5422 |
model_1 | Y[X=1] > Y[X=0] | K==1 | posteriors | FALSE | 0.8967 | 0.0372 | 0.8089 | 0.9564 |
model | query | given | using | case_level | mean | sd | cred.low.2.5% | cred.high.97.5% |
---|---|---|---|---|---|---|---|---|
model_1 | Y[X=1] > Y[X=0] | - | posteriors | FALSE | 0.4944 | 0.0220 | 0.4520 | 0.5376 |
model_1 | Y[X=1] > Y[X=0] | K==0 | posteriors | FALSE | 0.3251 | 0.0298 | 0.2671 | 0.3847 |
model_1 | Y[X=1] > Y[X=0] | K==1 | posteriors | FALSE | 0.6714 | 0.0307 | 0.6101 | 0.7297 |
We see that these tests all behave as expected. Importantly, however, the approach to thinking about the tests is quite different to that described in Collier (2011) or Humphreys and Jacobs (2015). Rather than having a belief about the probative value of a clue, and a prior over a hypothesis, inferences are drawn directly from a causal model that relates a clue to possible causal effects. Critically, with this approach, the inferences made from observing clues can be justified by reference to a more fundamental, agnostic model, that has been updated in light of data. The updated model yields both a prior over the proposition, belief about probative values, and guidance for what conclusions to draw given knowledge of \(K\).
8.4 Clue selection: clues at the center of chains can be more informative
Model querying can also be used to assess which types of clues are more informative among a set of informative clues. Consider a chain linking \(X\) to \(Y\) via \(M_1\), \(M_2\), \(M_3\). To keep things simple let’s assume that the chain is monotonic: no node in the chain has a negative effect on the next node in the chain.
Which clue is most informative for the proposition that \(X\) caused \(Y\) in a case with \(X=Y=1\)?
In all case we will conclude that \(X\) did not cause \(Y\) if we see a 0 along the chain (since a 1 can not cause a 0). But what do we conclude if we see a 1?
model <- make_model("X -> M1 -> M2 -> M3 -> Y") |>
set_restrictions(labels = list(M1 = "10", M2 = "10", M3 = "10", Y = "10"))
In imposing monotonicity and using default parameter values we are assuming that the effect of each node on the next node is 1/3. What does this imply for our query? We get the answer using query_model
.
query_model(model,
query = "Y[X=1] > Y[X=0]",
given = c("X==1 & Y==1", "X==1 & Y==1 & M1==1", "X==1 & Y==1 & M2==1",
"X==1 & Y==1 & M3==1", "X==1 & Y==1 & M1==1 & M2==1 & M3==1"),
using= "parameters") |> kable()
model | query | given | using | case_level | mean | sd | cred.low.2.5% | cred.high.97.5% |
---|---|---|---|---|---|---|---|---|
model_1 | Y[X=1] > Y[X=0] | X==1 & Y==1 | parameters | FALSE | 0.0244 | 0.0244 | 0.0244 | |
model_1 | Y[X=1] > Y[X=0] | X==1 & Y==1 & M1==1 | parameters | FALSE | 0.0357 | 0.0357 | 0.0357 | |
model_1 | Y[X=1] > Y[X=0] | X==1 & Y==1 & M2==1 | parameters | FALSE | 0.0400 | 0.0400 | 0.0400 | |
model_1 | Y[X=1] > Y[X=0] | X==1 & Y==1 & M3==1 | parameters | FALSE | 0.0357 | 0.0357 | 0.0357 | |
model_1 | Y[X=1] > Y[X=0] | X==1 & Y==1 & M1==1 & M2==1 & M3==1 | parameters | FALSE | 0.0625 | 0.0625 | 0.0625 |
A couple of features are worth noting. First without any data our beliefs that \(X\) caused \(Y\) are quite low. This is due to the fact that even though the ATE at each step is reasonably large, the ATE over the whole chain is small, only \((1/3)^4)\) (incidentally, a beautiful number: 0.01234568).
Second we learn from which nodes we learn the most. We update most strongly from positive evidence on the middle mediator. One can also show that not only is there greater updating higher if a positive outcome is seen on the middle mediator, but the expected reduction in posterior variance is also greater (expected reduction in posterior variance takes account of the probability of observing different outcomes, which can also be calculated from the model given available data.)2
Last, while we update most strongly when we observe positive evidence on all steps, even that does not produce a large posterior probability that \(X=1\) caused \(Y=1\). Positive evidence on a causal chain is often not very informative. Explanations for this are in P. Dawid, Humphreys, and Musio (2019).
References
\(d\)-separation is a key idea in the study of directed acyclic graphs; for an introduction see \(d\)-separation without tears.↩︎
These quantities can be calculated by the
CQtools
package, still in alpha, via:CQtools::expected_learning(model, "Y[X=1] > Y[X=0]", given = "X==1 & Y==1", strategy = "M2")
↩︎