How did economics learn to make credible causal claims?

For thirty years the best economists in the world built giant models of the whole economy and trusted them to tell you what a policy would do. Then three people, on three different flanks, showed the models could not deliver what they promised. The fix was not a better model. It was a different question about evidence — and it changed what counts as knowing that one thing causes another. This walkthrough follows the identification problem itself, from the Cowles structural program to the credibility revolution to the synthesis frontier. Each generation’s answer failed; the next was the response.

See the methodology lineage as a graph
Stage 1 of 4

The structural program

“Econometrics… aims at a unification of the theoretical-quantitative and the empirical-quantitative approach to economic problems… It is this unification that constitutes econometrics.”

— Ragnar Frisch, opening editorial of Econometrica, 1933

Read that as a promise. The ambition was not to fit a curve to data — it was to estimate the actual machinery of the economy, the deep relationships that govern how households and firms behave, and then to use that estimated machinery to answer any policy question you cared to ask. Lawrence Klein, who built the first models that tried to do exactly this, put the working version plainly: estimate the structure of the economy as a system of equations, and you can simulate any policy you like before you try it. For two decades it looked like that promise was being kept.

The move that made the promise conceivable came from Trygve Haavelmo in 1944. His Probability Approach in Econometrics argued that economic data should be treated as draws from a joint probability model — that an economy is a stochastic system, and you can reason about it with the same statistical machinery you use for any other random process. Before Haavelmo this was not obvious; after him, structural estimation had a foundation. The Cowles Commission — Tjalling Koopmans, Jacob Marschak, and their collaborators — built the apparatus on top of it.

The problem they had to solve is older than it looks. Suppose you want the demand curve for a good and you have price-and-quantity data over many years. You cannot just run a regression of quantity on price, because price and quantity are determined together — every observation is the intersection of a supply curve and a demand curve, both moving. The cloud of points you see traces out neither curve. This is the simultaneous-equations problem, and the Cowles answer to it is identification: you can recover the structural parameters only if you can impose enough credible restrictions to pin down which curve is which.

A structural system writes the endogenous variables $\mathbf{y}$ and exogenous variables $\mathbf{x}$ as $\mathbf{B}\mathbf{y} + \mathbf{\Gamma}\mathbf{x} = \mathbf{u}$. The reduced form is $\mathbf{y} = -\mathbf{B}^{-1}\mathbf{\Gamma}\,\mathbf{x} + \mathbf{B}^{-1}\mathbf{u}$, which is all the data can show you. An equation is identified only if enough restrictions — typically exclusions, variables assumed absent from an equation — are imposed to recover its structural coefficients from the reduced form. The order condition (count: at least as many excluded exogenous variables as included endogenous ones, minus one) is necessary; the rank condition is the full requirement.

直觉模式

To trace out a demand curve, you need something that shifts supply without touching demand — a bad harvest, say — so that as supply slides along the fixed demand curve, the points it sweeps out reveal the curve’s shape. Identification is the art of finding, or assuming, those shifters. The whole edifice rests on whether the assumed shifters are real. Hold that thought; it is where the next decade attacks.

Klein took this machinery to scale. The Klein-Goldberger model (1955) estimated the US economy as a system of dozens of equations; the Brookings model and the FRB-MIT-Penn model of the 1960s ran to hundreds. Governments used them to forecast and to simulate — raise this tax, change that interest rate, and read the predicted effect off the system. The intellectual home of this structural program in the history of economic thought is the postwar mainstream synthesis, the era that took models to data with new confidence.

Take the structural ambition at its full strength, because it is the right ambition and the rest of the thread only makes sense once you see that. The goal was never to describe correlations. It was to estimate the deep parameters — the preferences, the technologies, the genuine behavioral relationships — that do not change when you change policy. If you actually had those, you could do something no amount of curve-fitting can ever do: evaluate a policy that has never been tried, on a population that has never been treated, by simulating the estimated system forward under the new rule. That is the holy grail of policy economics, and structural estimation is the only approach that even attempts it.

And the assumptions were on the table. A structural model is a complete, stated account of the mechanism: here are the equations, here are the exclusion restrictions, here is what we are assuming about how the economy works. You can argue with it because it commits to something. A reduced-form summary of what happened in the data commits to nothing about what would happen under a different regime. The structural program was the disciplined, honest, ambitious way to do economics — estimate the machine, then ask the machine. The question that would tear it open is not whether the ambition was right. It is whether the program could actually deliver the deep parameters it promised, or only something that looked like them until the policy changed.

The structural program had the right ambition, and for two decades it looked like it was working: the big models forecast, they simulated, they advised presidents and central banks. The whole thing turned on a single word — deep. Were the estimated coefficients actually the structural, policy-invariant parameters the program promised? Or were they reduced-form combinations that would shift the moment policy shifted, leaving you with a model that fit the past and lied about the future? And were the identifying restrictions — the exclusions that let you trace one curve rather than another — ever credible, or just convenient? Those two questions, is what you estimated actually invariant and is your identification actually believable, are the crack the next decade pried open from three different directions at once.

By 1976 a thirty-six-year-old at Chicago had a one-line argument that the entire enterprise rested on a mistake. By 1980 a paper titled Macroeconomics and Reality called the identifying restrictions everyone imposed, in its author’s word, incredible. And by 1986 someone simply checked the structural models against an actual randomized experiment — and they failed. Three blows, three different flanks, the same target. And as you will see, every one of them was the same kind of failure.

Stage 2 of 4

The critiques

“Given that the structure of an econometric model consists of optimal decision rules of economic agents, and that optimal decision rules vary systematically with changes in the structure of series relevant to the decision maker, it follows that any change in policy will systematically alter the structure of econometric models.”

— Robert Lucas, “Econometric Policy Evaluation: A Critique,” 1976

Strip the formality and the punch is brutal. The coefficients in those big models are not deep parameters. They are mixtures of deep parameters and the policy rule people were living under when the data was generated. Change the rule and people change their behavior, so the coefficients move — which means simulating a policy change while holding the coefficients fixed is not just imprecise. It is invalid in principle. The single most-used feature of the structural program, policy simulation, rested on assuming away the one thing a policy change does.

Three critiques landed within a decade, each on a different flank of the same identification strategy. Take them in turn, compressed to the logic.

Lucas (1976): the parameters are not invariant. An estimated equation captures how people behaved under the policy regime that produced the data. People form expectations and optimize against the rule they expect. If the central bank changes its rule, expectations change, behavior changes, and the equation you estimated under the old rule no longer holds. So the coefficients are not the deep structure — they are the deep structure entangled with the old policy. Simulate a policy change with them and you get an answer to a question nobody asked.

Sims (1980): the restrictions are not credible. To identify their systems, the big models had to impose dozens of exclusion restrictions — assertions that this variable does not enter that equation. Christopher Sims argued these were not theory; they were imposed because the modeler needed them to achieve identification. His word was incredible: the restrictions could not actually be believed. His alternative was the vector autoregression — model all the variables symmetrically, be honest that what you have recovered is the reduced-form dynamics, and impose only the minimal, defensible restrictions needed to read off structural shocks. Stop pretending to more identification than you have.

LaLonde (1986): the estimates do not replicate an experiment. Robert LaLonde took the National Supported Work data — an actual randomized job-training experiment with a real control group — threw away the experimental control group, and asked whether the standard non-experimental econometric estimators could recover the experimental answer from observational comparison groups. They could not. Different reasonable estimators gave widely different numbers, and few were close to the experimental benchmark. The structural toolkit, asked to reproduce a result an experiment had already settled, missed.

Lucas first, at full strength, because the argument is a genuine logical demonstration and not a quibble. The central use of the big models was policy simulation. The thing they held fixed to run the simulation — the estimated coefficients — is exactly the thing a policy change moves, because the coefficients encode behavior optimized against the prevailing rule. There is no patch for this inside the fixed-coefficient framework; the framework’s validity for its own central task is what fails. But here is the calibration that the rest of the thread turns on, and it is the most important single piece of honesty in this story. Lucas was right about the disease and wrong about which cure would win. His proposed remedy was to estimate the truly deep parameters — preferences, technology — that are policy-invariant, and rebuild macro on those microfoundations. The trouble is that those deep parameters are no easier to identify credibly than the structural coefficients they replaced. Identifying a utility function from aggregate data runs into the same wall. That is a large part of why the turn that actually won applied economics was not the rational-expectations-structural turn Lucas wanted. It was something else.

Sims next, and his case is the cleanest statement of the identification disease as a disease of belief. The structural models achieved identification by fiat: they excluded variables from equations not because any theory demanded the exclusion but because the exclusion was needed to make the system solvable. Dress that up however you like; the restriction is still something nobody has a reason to believe. Sims’s “incredible” is precise and damning. His VAR is the honest response — do not claim identification you have not earned; model the joint dynamics, recover what the data actually pins down, and reserve structural interpretation for the few restrictions you can defend out loud.

LaLonde last, and his is the one that drew blood empirically rather than logically. Lucas and Sims argued that the structural program could not credibly identify what it claimed. LaLonde showed it did not. He had something the others did not: a real experiment to check against. When the non-experimental estimators were asked to reproduce the experimental answer, they disagreed with the experiment and with each other. This is the founding negative result the next rung would name as its problem statement: the assumptions the structural estimators depend on are, in LaLonde’s framing, assumptions the data cannot test — and when you find a case where you can test them, against an experiment, they lose.

Three critiques, three different flanks, one target — and notice what they have in common. Lucas: the parameters are not invariant, so you cannot simulate. Sims: the identifying restrictions are not credible, so you do not even have the parameters you think you have. LaLonde: when an actual experiment is available, the structural estimates do not reproduce it. Every one of these is an identification failure. The structural program’s entire machinery existed to identify deep parameters from non-experimental data, and all three critiques say the same thing in three registers: it didn’t, it couldn’t, and here is the proof. Once you see that the common failure is identification, the next move is almost forced. If identification from imposed assumptions keeps failing — stop imposing. Go find a setting where the variation is already exogenous, where you do not have to assume your way to a clean comparison because the world handed you one.

That setting turned out to be everywhere, once economists learned to look. A minimum-wage hike on one side of a state line but not the other. A draft lottery that randomly pulled some young men into the military and not others. A test-score cutoff that lands one student in a program and the next student out. They called it finding a natural experiment — and within twenty years it had taken over applied economics.

Stage 3 of 4

The credibility revolution

“On April 1, 1992, New Jersey’s minimum wage rose… We find no indication that the rise in the minimum wage reduced employment.”

— David Card & Alan Krueger, American Economic Review, 1994

Notice what Card and Krueger did not do. They did not write down a structural model of the fast-food labor market and impose restrictions to identify it. They found a setting where one state raised its minimum wage and a neighboring one did not, surveyed restaurants on both sides of the line before and after, and just looked. New Jersey is the treatment; Pennsylvania is the control the world supplied for free. The identifying assumption is one sentence a reader can argue with: absent the wage change, the two sides of the border would have moved together. That is the whole revolution in one study — not a better model, a found comparison.

The design-based move inverts the structural strategy. Instead of imposing restrictions to identify parameters from whatever data you have, you go find a source of variation that is plausibly as-good-as-random, and you build the analysis around it. Four tools do most of the work. Instrumental variables done right: a variable that shifts treatment but plausibly affects the outcome only through treatment — with a concrete, defensible story for why. Regression discontinuity: compare units just above and just below an arbitrary cutoff, who are alike in everything except which side of the line they fell on. Difference-in-differences: a treated group and a comparable control, before and after, differencing out what they share. And randomization itself, where you can run the experiment.

The framework that made this rigorous is the Angrist-Imbens-Rubin potential-outcomes language. Each unit has a treated outcome and an untreated outcome; we only ever see one. Causal effects are comparisons of potential outcomes, and a research design is credible when it makes a defensible claim about which units’ missing potential outcome the comparison group stands in for. The sharpest result of the framework is also its sharpest limit: the local average treatment effect.

With a binary instrument $Z$, treatment $D$, and outcome $Y$, instrumental variables identify the LATE:

$$\text{LATE} = \frac{\mathbb{E}[Y \mid Z=1] - \mathbb{E}[Y \mid Z=0]}{\mathbb{E}[D \mid Z=1] - \mathbb{E}[D \mid Z=0]}$$

Under monotonicity and exclusion, this is the average treatment effect for the compliers — the units whose treatment status the instrument actually moved — not the population average, and not the effect on always-takers or never-takers.

直觉模式

A natural experiment only moves some people. The draft lottery changes military service for the men whose number it happened to draw near the threshold — not for the volunteer who was going to enlist anyway, not for the man who was never going regardless. What you learn is the effect on the people the experiment actually moved. That is real knowledge, cleanly identified. It is also not the same as the effect on everyone.

And here is the epistemic shift that is the whole point. The identifying assumption is now checkable. A referee can ask one sharp question — is this instrument really exogenous? is this cutoff really arbitrary? would the two sides of the border really have moved together? — and the paper lives or dies on the answer. Compare that to the structural models’ dozens of exclusion restrictions, which no referee could verify because they were never meant to be verified. The history of economic thought places this turn squarely in the modern era, where it became the discipline’s evidentiary standard.

Argue the design-based turn at full strength, in its own voice, because it is the answer to all three of Stage 2’s critiques at once, and the mapping is exact. It does not simulate fixed deep parameters, so the Lucas critique does not bite: there is nothing being held invariant across a regime change, because the method estimates an effect in a setting, not a structure to be extrapolated. It does not impose incredible restrictions, so Sims’s critique does not bite: the identifying assumption is a single, stated, checkable claim about one source of variation, not a dozen exclusions imposed by convenience. And it reproduces experiments because the entire point is to find variation that mimics an experiment — which turns LaLonde’s demonstration from an indictment into the program’s founding motivation. The thing that killed the structural program is the thing the design-based program is built to deliver: identification you can check.

Joshua Angrist and Jörn-Steffen Pischke named the whole shift the “credibility revolution,” and their claim was not modest: empirical work in economics is far more credible today than it was thirty years ago, because the standard for a causal claim changed. The change was not better technique alone — it was a change in what economists will accept as having shown that one thing causes another. The discipline ratified it with its highest honors: the 2019 Nobel to Banerjee, Duflo, and Kremer for the experimental approach to poverty, and the 2021 Nobel to Card, Angrist, and Imbens for the empirical labor work and the framework that made natural-experiment evidence interpretable.

Labor economics was where the revolution proved itself first and hardest. Card-Krueger on the minimum wage, Angrist on the Vietnam draft lottery, the monopsony program that gave the minimum-wage non-effect a mechanism — this is where the design-based move went from clever trick to field standard, and it is a story with its own lineage worth following on its own terms.

The credibility revolution did the one thing the structural program never could: it made the identifying assumption checkable. That is the whole game. A referee in 1975 could not verify a structural model’s restrictions; a referee in 2005 could ask whether the instrument was really exogenous and watch the paper stand or fall on the answer. By the bar it set for itself, the revolution won, and it won decisively in applied micro. But look hard at what it gave up to win. It learned the effect on the compliers — the people the natural experiment happened to move. It did not learn the effect on everyone else, or the effect of a policy nobody has tried yet, or the answer on a population the found variation never touched. The structural program’s impossible ambition was at least the right ambition for those questions. So the sharp question the next rung has to face is this: did the revolution solve identification — or did it solve identification by quietly restricting itself to the questions where exogenous variation happens to lie around?

That question has a name and a defender who has spent thirty years pressing it. James Heckman won the Nobel for structural econometrics, watched the field walk away from it, and never stopped arguing that the design-based program forgets what structure was for. The last move in the thread is not a winner. It is an attempt to have both.

Stage 4 of 4

The synthesis frontier

“The gap between credible internal validity and the external validity we ultimately care about has narrowed but not closed… combining design-based identification with explicit modeling is where the action now is.”

— Guido Imbens, in the spirit of his 2021 Nobel lecture on causal inference

The pointed thing about who is saying this: Imbens is of the design-based tradition — he built the LATE framework that defined it. And he is now a leader of the effort to put structure back on top of it. When the revolution’s own architects start building bridges back toward the program the revolution displaced, that is the signal that the thread has reached its synthesis phase. The field is not relitigating who won. It is asking how to keep the checkable-identification floor while recovering the extrapolation the floor cannot give you.

Lay the trade-off out honestly, because the synthesis only makes sense once both sides are stated at full strength. Design-based / reduced-form: minimal assumptions, transparent identification, internally valid — but a LATE on a found population does not extrapolate to a different population or an untried policy. Structural: more assumptions, broader claims — extrapolation, ex-ante counterfactual policy evaluation — under those assumptions, but the assumptions are exactly what is hard to defend. Neither dominates. They answer different questions, and the questions both matter.

The frontier is the set of attempts to combine them. Structural-causal models formalize when a causal effect can be identified from a stated model of the mechanism, making the structural assumptions explicit objects you can inspect. Causal machine learning — Susan Athey and Guido Imbens’s program for the methods economists should know, and the Chernozhukov double/debiased ML framework — keeps the design-based identification discipline up front but lets flexible ML estimate the high-dimensional nuisance pieces, so you can have credible identification and rich heterogeneity. The point of the whole frontier is stated plainly: it is neither a return to pre-revolution structural confidence nor a permanent reduced-form settlement. It is an attempt to keep the credibility revolution’s checkable-identification standard while recovering the structural program’s reach.

Heckman’s defense, at full strength, because the structural tradition fighting back is the predecessor framework surfaced at its best one last time. The structural program was designed to do something the design-based program structurally cannot: ex-ante policy evaluation — predict the effect of a policy that has not been implemented, on a population that has not been treated, under explicit, stated assumptions. A LATE on the compliers of one natural experiment is silent about a different population, a scaled-up version of the program, or a policy nobody has tried. If those are the questions you need answered — and for most genuinely forward-looking policy design, they are — then you need structure, because only structure even attempts them. Heckman’s “building bridges” argument is not a rejection of the revolution; it is a reminder that the revolution’s win came with a real cost, and the cost is the questions structure was built to answer. The calibration matters: this engagement carries Heckman’s methodological frame, which is right, not his record on specific empirical reanalyses, several of which did not hold up while the design-based findings they contested — Card-Krueger among them — did. The frame survives independently of the contested cases.

And the integration is the thread’s real destination, voiced by the revolution’s own people. Imbens, Athey, Chernozhukov — design-tradition leaders — are now importing structural ambitions back in, disciplined by the checkability standard the revolution established. The synthesis is not “structure was right after all,” and it is not “reduced-form forever.” It is something more precise and more useful: credible identification is the floor, and combining it with structural policy-relevance is the live work. The bridges run both ways now, and the people building them are the ones who tore down the old structure in the first place.

Here is where the thread landed. The credibility revolution genuinely raised the bar. Each predecessor rung failed at identification in a way the next rung diagnosed correctly — Lucas was right that the parameters were not invariant, Sims was right that the restrictions were often incredible, LaLonde was right that the structural estimates did not replicate experiments — and the design-based turn solved the common problem by making the identifying assumption checkable. That is a durable epistemic gain and the central achievement of the whole lineage. But it narrowed what economics attempts. The structural-extrapolation questions got harder, not easier, and Heckman’s defense holds precisely for them: the discipline partly retreated from the questions only the structural tradition was equipped to ask. The synthesis is ongoing — the frontier integrates the two, and this walkthrough declares no winner among the frontier directions, because there isn’t one yet. And the win is domain-local: design-based identification won applied micro — labor, public, development, applied IO — decisively, but macro mostly did not aggregate from the design-based toolkit; DSGE and calibrated structural macro persist for reasons the credibility revolution never addressed. Calling the whole discipline “design-based” is wrong.

So the verdict is a position, not a punt. The thread did not end in “the debate continues.” It ended somewhere specific: credible identification became the floor of applied economics — the bar a causal claim has to clear is now “can a reader check the identifying assumption” — and the unfinished work is putting structural policy-relevance back on top of that floor without losing it. That is the live frontier, and it is genuinely open. One frame sits outside the thread entirely: a tradition running from Mises and Hayek rejects the identification project at its premise, holding that the deep questions of economics are not empirically decidable at all — a position this thread does not engage because it is a different thread, not a rung in this one.

This thread traced how the methods accumulated — each rung a response to a predecessor’s identification failure.

Where the thread landed

We started with a promise: estimate the deep machinery of the economy, then simulate any policy you like. The Cowles structural program had the right ambition and built real apparatus to pursue it. Then three critiques, on three flanks, showed that the apparatus could not credibly deliver the deep parameters it promised — and every one of the three was the same kind of failure, a failure of identification. The credibility revolution answered all three at once by changing the question: stop imposing assumptions, find exogenous variation, and make the one assumption you keep a claim a reader can check. It won applied micro decisively. It also learned the effect on the compliers and not on everyone, which is why the structural tradition’s defense survives for the questions only it can ask.

So the thread landed on a floor, not a finish line. Credible identification is now the bar every causal claim in applied economics has to clear, and that is a real and durable gain. Putting structural policy-relevance back on top of that floor — structural-causal models, causal machine learning, design-based structural estimation — is the unfinished work, and the people doing it are the revolution’s own architects. The next time someone tells you a study “proves” a policy works, you have the one question the whole lineage earned: can you check the identifying assumption? To see the methods lineage as a connected graph — who responded to whom, across the eras the era-organized books keep apart — open the methodology lineage view.