6 Theories as Causal Models
We embed the notion of a “theory” into the causal-models framework. We describe a conceptual hierarchy in which a theory is a “lower level” model that explains or justifies a “higher level” model. The approach has implications for the logical consistency of our inferences and for assessing when and how theory is useful for strengthening causal claims.
In Chapter 3, we described a set of theories and represented them as causal models. But so far we haven’t been very explicit about what we mean by a theory or how theory maps onto a causal-model framework.
In this book, we will think of theory as a type of explanation: A theory provides an account of how or under what conditions a set of causal relationships operate. We generally express both a theory and the claims being theorized as causal models: A theory is a model that implies another model—possibly with the help of some data.
To fix ideas: a simple claim might be that “A caused B in case
- “A always causes B”
- “A always causes B whenever C, and C holds in case j”, or
- “A invariably causes M and invariably M causes B”.
All of these theories have in common that they are arguments that could be provided to support the simple claim that A causes B is a particular case. In each case, if you believe the theory, you believe the implication.
We can also think about theoretical implications in probabilistic terms. Suppose that we start with a simple claim of the form “A likely caused B in case
The rest of this chapter builds out this logic and uses it to provide a way of characterizing when a theory is useful or not.
In the first section, we consider multiple senses in which one model might imply, and thus serve as a theory of, another model.
First, we consider how one causal structure can imply (serve as a theory of) another causal structure, by including additional detail that explains how or when causal effects in the other model will unfold. If structural model
implies structural model , then is a theory of .We then turn to logical relations between probabilistic models. We show how the distributions over nodal types in a simpler model structure can be underwritten by distributions over nodal types in a more detailed model structure. Here, a claim about the prevalence (or probability) of causal effects in a causal network is justified by claims about the prevalence or probability of causal effects in a more granular rendering of that causal network.
Finally, we show how a probabilistic model plus data can provide a theoretical underpinning for a new, stronger model. The new model is again implied by another model, together with data.
In the second section, we consider how models-as-theories-of can be useful. In embedding theorization within the world of causal models, we ultimately have an empirical objective in mind. In our framework, theorizing a causal relationship of interest means elaborating our causal beliefs about the world in greater detail. As we show in later chapters, theorizing in the form of specifying underlying causal models allows us to generate research designs: to identify sources of inferential leverage and to explicitly and systematically link observations of components of a causal system to the causal questions we seek to answer. In this chapter, we point to ways in which the usefulness of theories can be assessed.
In the chapter’s third and final section, we discuss the connection between the kinds of theories we focus on—what might be called empirical theories—and analytic theories of the kind developed for instance by formal theorists. Moving from one to the other requires a translation and we illustrate how this might be done by showing how we can generate a causal model from a game-theoretic model.
6.1 Models as Theories Of
Let us say that a causal model,
A theory,
Both structural models and probabilistic models—possibly in combination with data—imply other models.
6.1.1 Implications of Structural Causal
A structural model can imply multiple other simpler structural models. Similarly, a structural model can be implied by multiple more complex models.
Theorization often involves a refinement of causal types, implemented through the addition of nodes. Take the very simple model,
What theories might justify
Model
Model
Both of these models imply
Importantly, both
As we move down a level, we can think of a part of
So, for instance,
Consider next model
6.1.2 Probabilistic Models Implied by Lower Level Probabilistic Models
We used Figure 6.1 to show how one structural model can be implied by another. In the same way, one probabilistic model can be implied by another. If a higher level probabilistic model is to be implied by a lower level probabilistic model, consistency requires that the probability distributions over exogenous nodes for the higher level model are those that are implied by the distributions over the exogenous nodes in the lower level model.
To illustrate, let us add a distribution over
Recall that
In
, the probability that has a positive effect on is .In
, the probability that has a positive effect on is . That is, it is the probability that we have a chain of linked positive effects plus the probability that we have a chain of linked negative effects—the two ways in which we can get a positive total effect of on in this model.
Consistency then requires a particular equality between the distributions (
In other words, the probability of a positive
While the probability distributions in a lower level model must imply the probability distributions in the higher level model that it supports, the converse may not be true: Knowing the distribution over exogenous nodes of a higher level model does not provide sufficient information to recover distributions over exogenous nodes in the lower level model. So, for instance, knowing
6.1.3 Models Justified by Theory and Data
Finally, we can think of a higher level model as being supported by a lower level model combined with data. For this reason, we can fruitfully think of an initial model—when coupled with data—as constituting a theory of an updated model.
To see how this might work, imagine a scholar arguing: “
Here
We can take this further. If pushed now as to why
As further justifications are sought, researchers seek acceptable lower models that, together with data, can justify higher level models. Note that, as we move down levels in this hierarchy of models, we may be—helpfully—moving from models that are harder to accept down to models that are easier to accept, because we are bringing data to bear. So, in the above example, it should be easier to accept
6.2 Gains from Theory
We now turn to consider how to think about whether a theory is useful. We are comfortable with the idea that theories, or models more generally, are wrong. Models are not full and faithful reflections of reality; they are maps designed for a particular purpose. We make use of them because we think that they help in some way.
But how do they actually help, and can we quantify the gains we get from using them?
We think we can.
6.2.1 Illustration: Gains from a Front-Door Theory
Here is an illustration with a theory that allows the use of the “front-door criterion” (Pearl 2009). The key idea is that by invoking a theory for a model—which itself may require justification—one can draw inferences that would not have been possible without the theory.
Imagine we have a structural causal model
Now let’s form a probabilistic causal model,
Suppose, however, that we now posit the lower level structural model
Specifically, we can now:
- turn
into a probabilistic model ; - use data on
, , and to move to an updated version, ; notably, data on the mediator may help us sort out whether the correlation is causal or a consequence of confounding; - pose our causal question to
, which has been informed by data on .
This perhaps all seems a bit convoluted, so it is fair to ask: what are the gains? This depends on the data we observe, of course. If we observe, for instance, that
Thus, in return for specifying a theory of
In other situations, we might imagine invoking a theory that does not necessarily involve new data, but that allows us to make different, perhaps tighter inferences using the same data. An example might be the invocation of theory that involves a monotonicity restriction or exclusion restriction that allows for the identification of a quantity that would not be identifiable without the theory.
Thus, one reason to theorize our models—develop lower level models that make stronger claims—is to be able to reap greater inferential leverage from the more elaborated theory when we go to the data.
6.2.2 Quantifying gains
Can we quantify how much better off we are?
We need some evaluation criterion—some notion of “better off”—to answer this question. Two of the more intuitive criteria might be based on:
- Error: An error-based evaluation asks whether the theory helped reduce the (absolute) difference between an estimate and a target; similarly, we might focus on squared error—which essentially places more weight on bigger errors
- Uncertainty: We might instead assess gains in terms of reduced uncertainty. We might measure uncertainty using the variance of our beliefs, or we might use relative entropy to assess reductions in uncertainty
Other criteria (or loss functions) might focus on other features. For instance, we might ask whether the data we see are explained by the theory in the sense that they are more likely—less surprising—given the theory. Or we might want a criterion that takes account of the costs of collecting additional data or to the risks associated with false conclusions. For instance, in Heckerman, Horvitz, and Nathwani (1991), an objective function is generated using expected utility gains from diagnoses generated from new information over diagnoses based on what is believed already.
Beyond specifying a criterion, we also can approach any criterion from a “subjective” or an “objective” position. Are we concerned with how uncertain we will be as researchers, or do we seek to benchmark our inferences against the true state of the world?
We can, further, distinguish between evaluation from an ex ante or an ex post perspective. Are we evaluating how we expect to do under a given theory before we have seen the data, or how we have done after we have drawn our inferences? Table 6.1 shows how these two dimensions might be crossed to generate four different approaches to evaluating learning.
Ex ante | Ex post | |
---|---|---|
Subjective | Expected posterior variance | Posterior variance, Change in beliefs, Wisdom |
Objective | Expected mean squared error | Error, squared error |
So, for instance, in the top left quadrant (subjective/ex ante), we are interested in how uncertain we expect to be if we work with a given theory; in the bottom left quadrant (objective/ex ante), we are asking how far off we expect to be from some ground truth. In the second column, we are asking how uncertain we are about an inference we have made (subjective/ex post), or about how far off we have ended up from a ground truth (objective/ex post).
We now use this
In this setup, we imagine that there is an unknown parameter,
For the illustrations that follow, imagine that we start with a (subjective) prior that
Now, we have a theory under which we believe that
The key features of this example are summarized in Table 6.2. Each row here (or “event”) represents a different situation we might end up in: A different combination of what the true answer to our query is (
Say, now, that we in fact observe
Let’s now think about different ways of characterizing the gains from observing
|
|
Event probability (subjective) | Event probability (objective) | Inference on |
Actual (squared) Error | Posterior variance |
---|---|---|---|---|---|---|
0 | 0 | 0.64 | 0.4 | 0.059 | 0.003 | 0.055 |
0 | 1 | 0.16 | 0.4 | 0.500 | 0.250 | 0.250 |
1 | 0 | 0.04 | 0.1 | 0.059 | 0.886 | 0.055 |
1 | 1 | 0.16 | 0.1 | 0.500 | 0.250 | 0.250 |
6.2.2.1 Objective, ex post
If we are willing to posit an external ground truth, then we can define “better” in objective terms. For instance, we might calculate the size of the error (or, more typically, the squared error) we make in our conclusions relative to the ground truth. We can then compare the error we make when we use the theory (and the clue that that theory makes usable) to draw an inference to the error that we make when we draw an inference without the aid of the theory (and its associated clue).
The difference in squared errors is given by
In the numeric example, our objective ex post (squared) error is
6.2.2.2 Objective, ex ante
Rather than asking how wrong we are given the data pattern we happened to observe, we can ask how wrong we are expected to be when we go looking for a clue that our theory makes usable (we say “are expected to be” rather than “we expect to be” because the evaluation may be made using beliefs that differ from the beliefs we bring with us when we draw inferences). An objective ex ante approach would ask what the expected error is from the conclusions that we will draw given a theory. For instance: how wrong are we likely to be if we base our best guess on our posterior mean, given the observation of a clue that the theory lets us make use of? “How wrong” might again be operationalized in different ways: for instance, in terms of expected squared error—the square of the distance between the truth and the posterior mean.
The expected squared error (see also Section 5.1.6) is:
This equation yields the error that one would expect to get with respect to any true value of the parameter (
Returning to the numeric example, we can calculate the expected (actual) squared error with respect to the objective event probabilities in Table 6.2. This yields here 0.215. This might be compared (unfavorably) to the expected error if we just used the prior (0.2) on
We do badly in expectation not just because the theory is wrong, but because it is very wrong. We might have done better, and gained from the theory, in expectation, had the theory only been moderately wrong. To see this, imagine instead that in fact
6.2.2.3 Subjective, ex post
The problem, of course, with an objective approach is that we do not have the information—the true values of our queries—that we need to calculate objective errors.
A more subjective approach involves asking about the reduction in posterior variance. Ex post we can define “better” as the reduction in posterior variance from drawing an inference that makes use of a theory and its associated clue compared to an inference that does not.
A problem with this measure, however, is that posterior variance is not guaranteed to go down: Our uncertainty can increase as we gather more data. Importantly, however, that increase in uncertainty would not mean that we have not been learning. Rather, we have learned that things are not as simple as we thought—so we become less certain than we were before, in a manner justified by what we have observed.
One approach that addresses this issue asks: How much better are our guesses having observed
The numerator in this expression captures how much better off we are with the guess we have made given current data (
Returning to the numeric example, our posterior variance after observing
6.2.2.4 Subjective, ex ante
Finally, we might think about the contributions to learning that we expect from a theory before observing the data. We can conceptualize expected learning as the reduction in expected posterior variance: How certain do we expect we will be after we make use of new information? (See also our discussion in Section 5.1.6.)
For any
This equation takes the posterior variance, given some data, over all the possible data that one might encounter given distribution
The key move is in recognizing that
Returning to the numeric example in Table 6.2, the expected posterior variance (with expectations taken with respect to the subjective event probability distribution) is 0.118. Note that we would also get 0.118 if we took the expectation of the actual squared error with respect to subjective event probability distribution. The reduction in posterior variance over the prior variance of 0.16 is 26.47%.
We have described a set of possible metrics for gains from theory, but there is no single right metric. The right metric for assessing gains fundamentally depends on what the researcher values—whether that is making fewer errors, being confident in conclusions, avoiding overconfidence, or something else.
6.3 Formal Theories and Causal Models
It is relatively easy to see how the ideas above play out for what might be called empirical models. But in social sciences, “theory” is a term sometimes reserved for what might be called “analytic theories”. In this last section, we work through how to use this framework when seeking to bring analytic theories to data.
As an example of an analytic theory, we might consider the existence of “Nash equilibria.” Nash considered a class of settings (“normal form games”) in which each player
Nash’s theorem relates to the existence of a collection of strategies with the property that each strategy would produce the greatest utility for each player, given the strategies of the other players. Such a collection of strategies is called a Nash equilibrium.
The claim that such a collection of strategies exists in these settings is an analytic claim. Unless there are errors in the derivation of the result, the claim is true in the sense that the conclusions follow from the assumptions. There is no evidence that we could go looking for in the world to assess the claim. The same can be said of the theoretical claims of many formal models in social sciences; they are theoretical conclusions of the if-then variety (Clarke and Primo 2012).
For this reason we will refer to theories of this form as “analytic theories.”
When researchers refer to a theory of populism or a theory of democratization however, they often do not have such analytic theories in mind. Rather they have in mind what might be called “applied theories” (or perhaps more simply “scientific theories” or “empirical theories”): general claims about the relations between objects in the world. The distinction here corresponds to the distinction in Peressini (1999) between “pure mathematical theories” and “mathematized scientific theories.”
Applied theory, in this sense, is a collection of claims with empirical content: An applied theory refers to a set of propositions regarding causal relations in the world that might or might not hold, and is susceptible to assessment using data. These theories might look formally a lot like analytic theories, but it is better to think of them as translations at most. The relations between nodes of an applied theory are a matter of conjecture, not a matter of necessity.7
Though it is not standard practice, formal models produced by game theorists can often be translated into applied theoretical analogs and then represented using the notation of structural causal models. Moreover, doing so may be fruitful. Using the approach described above, we can assess the utility of the applied theory, if not the analytic theory itself.
For two players, for instance, we might imagine a representation of a standard normal form game as shown in Figure 6.3.
The model includes all the primitives of a normal form game: We can read off the number of players, the strategy sets (the range of the strategy nodes) and the mapping from actions to utilities. Here the only causal functions are the utility functions. In an analytic theory, these functions are known. In an applied translation of the theory these are a matter of conjecture: The functions capture the researchers’ beliefs that actual actions will produce actual payoffs. So far the model does not capture any claims about behavior or expected behavior.
In contrast to Nash’s theorem regarding the existence of equilibria, a behavioral theory might claim that in problems that can be represented as normal form games, players indeed play Nash equilibrium. This is a theory about how people act in the world. We might call it Nash’s theory.
How might this theory be represented as a causal model? Figure 6.4 provides one representation.
Erratum: figure captions in this section have been updated, 2025.01.06.
Here, player beliefs about the game form (
This model represents what we expect to happen in a game under Nash’s theory and we can indeed see if the relations between nodes in the world look like what we expect under the theory. The relations are nevertheless a matter of conjecture, to be contrasted with the exact claims on strategy profiles that are produced by an analytic theory that assumes Nash equilibria are played.
So far, the model does not provide much of an explanation for behavior. A lower level causal model might help. In Figure 6.5, the game form
This representation implies a set of relations that can be compared against empirical patterns. Do players indeed hold these beliefs when playing a given game? Are actions indeed consistent with beliefs in ways specified by the theory? It provides a theory of beliefs and a theory of individual behavior as well as an explanation for social outcomes.
The model in Figure 6.5 provides a foundation of sorts for Nash’s theory. It suggests that players play Nash equilibria because they expect others to and they are utility maximizers. But this is not the only explanation that can be provided; alternatively behavior might line up with the theory without passing through beliefs at all, as suggested in some accounts from evolutionary game theory that show how processes might select for behavior that corresponds to Nash even if agents are unaware of the game they are playing.
One might step still further back and ask why would actors form these beliefs, or take these actions, and answer in terms of assumptions about actor rationality. Figure 6.6, for instance, is a model in which actor rationality might vary and might influence beliefs about the actions of others as well as reactions to those beliefs. Fully specified causal functions might specify not only how actors act when rational but also how they react when they are not. In this sense, the model in Figure 6.6 both nests Nash’s theory and provides an explanation for why actors conform to the predictions of the theory.
In a final elaboration, we can represent a kind of underspecification of Nash’s theory that makes it difficult to take the theory to data. In the above, we assume that players choose actions based on expectations that the other player would play the Nash equilibrium—or that the theory would specify which equilibrium in the case of multiplicity. But it is well known that Nash’s theory often does not provide a unique solution. This indeterminacy can be captured in the causal model as shown in Figure 6.7, where a common shock—labeled
The causal function for expectations can then allow for the possibility that (i) there is a particular equilibrium invariably chosen and played by both (ii) or a guarantee that players are playing one or other equilibrium together but uncertainty over which one is played, or (iii) the possibility that players are in fact out of sync, with each playing optimal strategies given beliefs but nevertheless not playing the same equilibria.
Nash’s theory likely corresponds to position (ii). It can be captured by causal functions on beliefs given
We highlight three points from this discussion.
First, the discussion highlights that thinking of theory as causal models does not force a sharp move away from abstract analytic theories; close analogs of these can often be incorporated in the same framework. This is true even for equilibrium analysis that seems to involve a kind of simultaneity at first blush.
Second, the discussion highlights how the causal modeling framework can make demands for specificity from formal theories. For instance, specifying a functional relation from game form to actions requires a specification of a selection criterion in the event of multiple equilibria. Including agent rationality as a justification for the theory invites a specification for what would happen absent rationality.
Third, the example shows a way of building a bridge from pure theory to empirical claims. One can think of Nash’s theory as an entirely data-free set of claims. When translated into an applied theory—a set of propositions about the ways actual players might behave—and represented as a causal model, we are on a path to being able to use data to refine the theory. Thus, we might begin with a formal specification like that in Figure 6.7 but with initial uncertainty about player rationality, optimizing behavior, and equilibrium selection. This theory nests Nash but does not presume the theory to be a valid description of processes in the world. Combined with data, however, we shift to a more refined theory that might select Nash from the lower level model.
Finally, we can apply the ideas of Section 6.2 to formal theories and ask: Is the theory useful? For instance, does data on player rationality help us better understand the relationship between game structure and welfare?
The claim could also follow from a theory that reflected beliefs about heterogeneity of causal processes. For a review of rival approaches to scientific explanation see Woodward (2003).↩︎
We note that our definition of theory differs somewhat from that given in Pearl (2009) (p207): there a theory is a structural causal model and a restriction over the possible values of exogenous but not a probability distribution over these nodes. Our definition also considers probabilistic models as theories, allowing statements such as “the average effect of
on in some domain is 0.5.”↩︎As we emphasize further below, it is in fact only the random, unknown component of the
link that makes the addition of potentially informative as a matter of research design: If were a deterministic function of only, then knowledge of would provide full knowledge of , and nothing could be learned from observing .↩︎The numerator simplifies according to:
From this we see that the measure does not depend on either prior or posterior variance (except through the denominator). Note also that wisdom, though non negative, can exceed 1 in situations in which there is a radical re-evaluation of a prior theory, even if uncertainty rises. As an illustration, if our prior on some share is given by a distribution, then our prior mean is .1, and our prior variance is very small, at 0.0043. If we observe another four positive cases, then our posterior mean becomes 1/4 and our posterior variance increases to 0.0075. We have shifted our beliefs upward and at the same time, become more uncertain. But we are also wiser since we are confident that our prior best guess of .1 is surely an underestimate. Our wisdom is 5.25—a dramatic gain.↩︎This can be seen from the law of total variance which can be written as:
The expression is written here to highlight the gains from observation of , given what is already known from observation of . See Raiffa and Schlaifer (1961). A similar expression can be given for the expected posterior variance from learning in addition to when is not yet known. See, for example, Proposition 3 in Geweke and Amisano (2014). Note also that an implication is that the expected reduction in variance is then always positive, provided you are changing beliefs at all. In contrast, the (objective) expected error measure can be assessed under rival theoretical propositions, allowing for the real possibility that the gains of invoking a theory are negative.↩︎That is, since:
we have: and so: ↩︎Peressini (1999) distinguishes between “applied mathematical theories” and “mathematized scientific theories” on the grounds that not all mathematized theories are an application of a pure theory.↩︎