Did the credibility revolution change what economics knows?

In the 1990s economics decided to stop trusting models it couldn’t check and start running experiments. It got far better answers. It also stopped asking some of the biggest questions.

Stage 1 of 4

What the revolution made tractable

“Contrary to the central prediction of the textbook model… we find no indication that the rise in the minimum wage reduced employment.”
— David Card & Alan Krueger, American Economic Review, 1994

In 1992 New Jersey raised its minimum wage and Pennsylvania did not. Card and Krueger counted fast-food jobs on both sides of the border, before and after, and found the textbook prediction failed. The result is famous for what it concluded. It matters far more for how it concluded it: no structural model, no untestable assumptions, just a natural experiment doing the work. That move is the credibility revolution.

To see why that move was a break and not just a clever paper, start with what it replaced. Through the 1970s and 1980s, applied economics ran on structural models: write down a model of how agents behave, estimate its parameters from observational data, and read policy effects off the estimated structure. The trouble was that the answers depended on assumptions the data could not check — functional forms, error distributions, which variables were exogenous. Robert LaLonde made the problem undeniable in 1986. He took a real randomized job-training experiment, threw away the randomization, and asked whether the leading structural estimators could recover the experimental answer from observational data. They could not. Different reasonable specifications gave wildly different effects, and there was no internal way to tell which was right.

The credibility move was to stop trying to model the whole world and instead hunt for a setting where treatment was assigned as good as randomly — a state that raised its minimum wage while its neighbor didn’t, a lottery that admitted some students and not others, a discontinuity in eligibility at an arbitrary cutoff. Where you find that, you can identify a causal effect with assumptions a reader can actually scrutinize. The toolkit that grew up around this — instrumental variables done carefully, difference-in-differences, regression discontinuity — shares one feature: the identifying assumption is out in the open.

The formal object the toolkit estimates is narrower than it looks, and the narrowness is the point. It is not the average effect of treatment over everyone. It is the local average treatment effect — the effect on the people whose treatment actually changed because of the natural experiment.

Write $Y_i(1)$ and $Y_i(0)$ for unit $i$’s outcome with and without treatment — the two potential outcomes, only one of which is ever observed. With an instrument $Z$ that shifts treatment $D$ without otherwise affecting $Y$, the estimand is

$$\text{LATE} = \frac{E[Y \mid Z=1] - E[Y \mid Z=0]}{E[D \mid Z=1] - E[D \mid Z=0]} = E\big[Y_i(1) - Y_i(0) \,\big|\, \text{compliers}\big]$$

The denominator says: of everyone the instrument nudged, scale up by the fraction whose treatment status actually flipped. What survives is the effect on the compliers — not the whole population.

直觉模式

You never see the same person both treated and untreated, so you can never measure an individual effect directly. The natural experiment gives you a clean comparison only for the people whose situation it actually changed — the New Jersey restaurants that hired differently because of the wage hike. So the honest answer the method delivers is “here is the effect, for these specific people, in this specific setting.” That is less than economists used to claim. It is also far more defensible.

One more piece made the revolution stick: inference caught up with identification. Bertrand, Duflo, and Mullainathan showed in 2004 that the standard way of computing standard errors in difference-in-differences studies was wildly overconfident, treating correlated observations as independent and manufacturing significance that wasn’t there. Cleaning that up — clustering, honest uncertainty — meant a credible causal estimate now had to survive not just a checkable identifying assumption but a checkable claim about how sure you were allowed to be.

This is the next move after the Lucas counter-revolution, and the textbook tradition tells it as a lineage. The counter-revolutionaries had attacked Keynesian models for resting on relationships that fell apart the moment policy used them; their answer was deeper microfoundations. The credibility revolution answered the same anxiety differently: not better models, but identification you could check. The History of Economic Thought treatment puts both moves in one chapter, and walks the labor-economics ignition of the credibility turn directly.

The formal econometrics — the identification problem, instrumental variables, difference-in-differences, regression discontinuity, and the potential-outcomes framework that underpins the LATE — has its full home in the Economics textbook.

The defenders’ case, at full strength

“Empirical microeconomics has experienced a credibility revolution… Better research design is integral to the improved reliability of empirical microeconomics.”
— Joshua Angrist & Jörn-Steffen Pischke, Journal of Economic Perspectives, 2010

Angrist and Pischke are not claiming economics found new truths. They are claiming it found a way to tell which of its empirical claims are even worth believing, and the distinction is sharper than it sounds. Before the revolution, a referee reading a structural paper had no independent way to check the assumptions that drove its headline number; the assumptions and the result were tangled together, and disagreement bottomed out in taste. After it, the identifying assumption sits on the surface where a hostile reader can attack it directly: is the Pennsylvania border a valid control for New Jersey? Is the eligibility cutoff really arbitrary? Those are answerable questions. The gain is not a better estimate of any one effect. It is that causal claims became contestable on their merits — a publishable applied-micro paper in 2025 lives or dies on a research-design argument that a 1985 paper never had to make and could not have won.

And the discipline scaled the move ruthlessly. Once the field agreed that credible identification was the price of entry, it reorganized graduate training, journal refereeing, and the entire incentive structure of applied work around finding clean variation. Whole literatures — the returns to schooling, the effect of class size, the impact of incarceration — were rebuilt on quasi-experimental footing and produced answers that held up under replication in a way the prior generation’s had not. That is the strongest form of the defenders’ claim, and on its own terms it is correct.

Where this leaves us

The credibility revolution made one specific thing visible that wasn’t visible before: whether the identifying assumption behind a causal claim is checkable by a reader. That is not nothing — it is the methodological line between a publishable applied-micro paper today and one from 1985. But notice what it does not do. It does not settle whether the question being asked is the right question, and it buys its certainty by narrowing what it answers, from “the effect of this policy” to “the effect on the people this particular experiment happened to move.” That narrowing is where the next stage starts to bite.

Getting a clean causal estimate for one specific question on one specific population is a different thing from knowing how the world works. The development economists who took this seriously didn’t run one experiment. They ran a thousand. What did they actually learn — and what did the field give up to learn it?

Stage 2 of 4

What was made visible — and what got displaced

Esther Duflo: Social experiments to fight poverty (TED 2010)

“We don’t really know whether [our aid] is doing any good… The good news is that we can find out. We can run experiments.”
— Esther Duflo, TED, “Social experiments to fight poverty”, 2010

Duflo’s pitch was deflating and radical at once: development economics had spent decades arguing over grand theories of growth while having almost no credible evidence about which interventions actually helped. Her answer was to treat policy like medicine — randomize, measure, accumulate. It is the credibility revolution taken to its cleanest possible form. And it is where the cost of that cleanliness comes into view.

Banerjee and Duflo founded MIT’s Poverty Action Lab in 2003 and built an infrastructure for running randomized field experiments at scale. By the late 2010s the program had produced on the order of a thousand trials — on deworming, microfinance, conditional cash transfers, teacher attendance, bednets, immunization incentives — and the 2019 Nobel to Banerjee, Duflo, and Michael Kremer ratified it as the dominant empirical paradigm in development. The development-economics lineage, from Lewis and the structuralists through the RCT turn, is walked in the History of Economic Thought treatment.

What the RCTs made visible was a different kind of fact than the field was used to. The pre-revolution development literature trafficked in large structural claims — the big push, balanced growth, the two-gap model — that named what made whole economies grow. The trials replaced those with small, specific, falsifiable effects on identified populations: deworming raised school attendance in this part of Kenya by this much; microfinance did far less than its champions promised, almost everywhere it was tested. The honest unit of knowledge shrank from “what causes development” to “what this intervention did, here.” For the things the method can reach, that is a real upgrade: the field stopped bluffing about effects it had never credibly measured.

But the revolution did not just add a tool. It moved the center of gravity of the whole discipline, and other programs lost ground in the move. Cross-country growth regressions in the Barro mold — once a thriving industry — came to look hopeless, their identifying assumptions unfalsifiable in exactly the way LaLonde had warned. Calibrated structural macro and the ambitious policy-evaluation econometrics of the prior generation lost journal space and graduate-student attention to clean field experiments. The displaced program traces directly to the Lucas rational-expectations move, whose discipline-wide ambition the credibility turn quietly undercut in applied work.

Banerjee himself sits in the intellectual-genealogy graph among the modern pluralists; you can see the RCT program’s place in the methodology lineage in the interactive thought graph.

Two things, both true

“The practice of running experiments has fundamentally changed how we think about understanding the world… a tool to discover what works, rather than a way to confirm what we already believe.”
— Abhijit Banerjee & Esther Duflo, Nobel Prize lecture, “Field Experiments and the Practice of Economics”, 2019

Banerjee and Duflo’s claim is not that experiments are more rigorous — though they are. It is that experiments produce a different, more honest kind of knowledge claim: modest, context-specific, and falsifiable, where the prior generation had offered confident generality it could not back up. They are explicit that this is humbler economics, and they argue the humility is the gain. The field used to answer “how do poor countries develop?” with theories no one could test; now it answers “does this work, here?” with evidence anyone can check, and lets answers accumulate. On the questions it can reach, this is the strongest case that the revolution made the field know more, not just know more carefully.

“The method of randomization… cannot answer many of the questions that policy analysts want answered… Structural models are required to extrapolate beyond the experimental population and to evaluate policies that have not yet been tried.”
— James Heckman, Journal of Economic Literature, 2010 (“Building Bridges…”)

Heckman is making the displacement argument at its strongest, and it is not nostalgia. The structural program was built to do something the experimental program structurally cannot: evaluate a policy that has not been tried yet, on a population that has not been treated yet, by spelling out the behavioral model and the assumptions and accepting that the answer is only as good as they are. A randomized trial cannot tell you the effect of a minimum wage you have not enacted, in a labor market you have not observed. To get there you need a model of how people respond — which is to say, you need the structural apparatus the revolution displaced. His point is not that Card-Krueger was wrong; it held up. It is that a discipline that can only answer questions an experiment happens to make available has given up on a whole class of questions policy actually needs answered.

Where this leaves us

Two things are true at once. The development-RCT turn made visible a kind of knowledge the pre-revolution literature was bluffing about — small, context-specific, falsifiable effects on identified populations, where before there had been confident theory and almost no credible measurement. And the displacement of the structural program was a real loss, not merely a corrective: there are questions — the effect of an untried policy, the answer for a population no experiment reached — that the structural program was designed to address and the credibility-revolution toolkit structurally cannot. Both are gains-and-losses on the same ledger, and the next stage is where the critics press the second one.

If the revolution is so successful at what it does, why have its most thoughtful critics — Deaton, Heckman, the macro-side reckoning crowd — spent fifteen years arguing that something important got lost? They are not making one complaint. They are making three. What is the actual bill the revolution still owes?

Stage 3 of 4

The live critiques

“RCTs are valuable, but they have no special status, and they are subject to many problems… The deeper question — how to use experimental evidence to make decisions about people who were not in the experiment — has not been adequately addressed.”
— Angus Deaton & Nancy Cartwright, Social Science & Medicine, 2018

Deaton is a Nobel laureate who spent a career measuring poverty, not a structural die-hard defending lost turf. That is what makes the critique land. The mainstream defenders and the mainstream critics of the credibility revolution are not arguing about one thing. They are arguing about three at once — external validity, what structural models were for, and whether any of this aggregates up to macro. Take each at its strongest.

Start with the trade-off the whole debate turns on. A reduced-form estimate — the clean LATE from Stage 1 — makes minimal assumptions and is internally valid: within its experiment, the number is trustworthy. A structural estimate makes more assumptions and earns broader claims under those assumptions — but the assumptions are exactly what is hard to defend. Neither side gets both. The credibility revolution chose internal validity and minimal assumptions, and the three critiques are three different invoices for that choice.

External validity. A clean estimate on one sample tells you nothing about a different population unless you assume the treatment effect generalizes in some known way — and the moment you assume that, you are doing structural work, just without admitting it. The deworming effect in Kenya does not transfer to Bihar, where the parasites, schools, and households differ, and no amount of within-trial rigor can recover what the trial never observed. The bill is structural, not a matter of running more trials.

What structural was for. The point Heckman pressed in Stage 2 sharpens into a critique here: extrapolation to untried policies and unreached populations is not a bonus feature of structural modeling, it is the job it was built to do. A discipline that abandons it has not refined its toolkit; it has dropped a function.

Macro doesn’t aggregate. A credible micro estimate — say, the marginal propensity to consume of hand-to-mouth households — can inform a macro model’s calibration, but it does not replace the structural-macro machinery needed to answer what a tax cut does to output, or whether a rate hike will cause a recession. The credibility toolkit does not climb from clean micro effects up to general-equilibrium questions, which is a large part of why the revolution feels like an applied-micro event rather than a discipline-wide one. The post-2008 macro side has its own, deeper reckoning — treated in its own walkthrough.

In the textbook lineage, the external-validity critique and the post-credibility methodological landscape live in the modern-pluralism chapter, which walks the J-PAL program and the Deaton-Pritchett-Rodrik objection together. The structural-econometrics defense traces through the econometric-methodology turn that the information-economics chapter co-locates with its measurement lineage.

Three critics, three different bills

“Without knowing why things happen and why people do things, we run the risk of worthless casual (‘fairy story’) causal theorizing, and we will never know whether our results will generalize.”
— Angus Deaton, “Instruments, Randomization, and Learning about Development”

Deaton’s external-validity critique is a correct observation about the structural limits of the toolkit, not a curmudgeon’s complaint. His charge is precise: a trial that delivers a clean local effect, with no account of why the effect occurred, has bought identification credibility at the price of understanding — and understanding is exactly what you need to say anything about people who weren’t in the trial. The mechanism that produced the Kenyan deworming effect is the only thing that could tell you what to expect in Bihar, and the experiment, by design, stays silent on it. The bill is real and, fifteen years on, largely unpaid: the policy-evaluation infrastructure the revolution built has not solved external validity, and pretending otherwise is the field’s standing self-deception.

“The most informative evaluations combine experimental and structural methods… Atheoretical approaches… provide no framework for interpreting their estimates or for extrapolating to new environments.”
— James Heckman, “Econometric Causality”

Heckman’s defense of structural econometrics carries on its methodological frame, and that frame is right even where his empirical record is not. He has been wrong on specifics — some of his reanalyses of Card-Krueger did not replicate, and Card-Krueger held up. But the methodological claim is independent of that scoreboard: a structural model is the only object that supplies the parameters needed to interpret an estimate and carry it to a new environment, because it states what behavior generated the number. Strip that out and you have a measurement with no theory of why — useful for the case at hand, mute about every other case. The reduced-form program cannot do ex ante policy evaluation under explicit assumptions, and that is not a tuning problem to be fixed with cleaner instruments. It is a thing the program, by construction, does not do.

“We are digging ourselves, one step at a time, deeper and deeper into a Fantasyland, with economic agents who can solve… problems that humans cannot… The techniques we use affect our thinking in deep and not always conscious ways.”
— Ricardo Caballero, Journal of Economic Perspectives, 2010 (“Macroeconomics after the Crisis”)

The macro-aggregation critique is a frame-level argument, not a defense of bad pre-revolution structural work. Caballero’s complaint is that macro questions are different in kind from applied-micro questions: they live in dynamic general equilibrium, where the object of interest is the whole system’s response, and credibility-revolution-grade evidence informs that response without ever replacing the structural model needed to compute it. You can pin down the MPC of hand-to-mouth households with a beautiful natural experiment; you still cannot read off what a stimulus does to aggregate output without a model of how the pieces interact. The micro evidence is an input to macro, not a substitute for it — which is precisely why DSGE and calibrated structural macro persist in a field the credibility revolution otherwise reshaped.

Where this leaves us

The three critiques are not making the same complaint, and that is the whole point. Deaton and Cartwright are correct that the external-validity bill is real and largely unpaid. Heckman is correct that the structural program was designed to do something — ex ante evaluation under explicit assumptions — that the reduced-form program structurally cannot, and that the loss of it is real. Caballero and the macro-reckoning crowd are correct that micro-empirical work does not aggregate up to the questions macro needs to answer. None of these invalidates Stage 1’s verdict: causal-claim standards genuinely rose. All of them constrain what that verdict is allowed to mean. The revolution won the argument it picked. The critics are pointing at the arguments it stopped having.

So if causal-claim standards genuinely rose, and the external-validity bill is real, and the displacement was partly loss — what does economics actually know now that it didn’t know before? And what is the discipline doing about the bill?

Stage 4 of 4

What economics actually knows now

“The gold standard for drawing inferences about the effect of a policy is a randomized controlled experiment. However… researchers are often interested in questions that… cannot be answered by experimentation. The challenge is to combine the credibility of experimental methods with the generalizability of structural methods.”
— Guido Imbens, Nobel Prize lecture / 2021 surveys on causal inference

Imbens shared the 2021 Nobel for the very potential-outcomes apparatus that powered Stage 1. He is not declaring victory. He is naming the unfinished job: the credible methods answer a narrow class of questions, the questions that matter often fall outside it, and the live work of the field is to bridge the two without retreating to the old false confidence. That bridge is the answer to “what does economics know now.”

The discipline’s response to the external-validity bill is not a return to pre-revolution structural confidence. It is a set of attempts to keep the identification discipline and recover the policy reach. Three are worth naming. Imbens and his collaborators have pushed the potential-outcomes framework toward questions of extrapolation and external validity directly — treating “does this transfer?” as a formal problem rather than a hope. Susan Athey and Guido Imbens have built a program of causal machine learning, using flexible algorithms to estimate how treatment effects vary across people and settings — the heterogeneity that external validity turns on — while keeping a credible identification strategy underneath. And Raj Chetty’s big-data work links millions of administrative records to causal designs, answering questions about mobility and opportunity at a scale and resolution the small-sample RCT never could.

The framing that unites them is the opposite of nostalgia. Pearl’s structural-causal-models program and the double/debiased machine-learning methods sit alongside this work as further tools for the same project; the point of the frontier is not to undo the credibility revolution but to combine its checkable identification with the structural program’s ambition to speak about cases no experiment reached. The modern-pluralism chapter stages this landscape as the discipline’s current methodological condition: not one paradigm, but a set of programs negotiating the same trade-off.

The threads against each other

Weigh the three threads honestly against the question they were always about — not “is the revolution good” but “what does economics now know.” Stage 1’s defenders are right and uncontested: in labor, public, development, and applied micro, what counts as a believable causal claim is far stricter and far more transparent than it was in 1985. That is a genuine epistemic gain, and it is permanent. Stage 2’s displacement is also real: the range of questions the field actively pursues narrowed in some domains and changed shape, and the narrowing was part corrective — LaLonde was right that pre-revolution structural claims were often overstated — and part loss, because the structural program did real work that nothing has replaced.

Stage 3’s critics constrain the verdict without overturning it. External validity is an open bill. Heckman’s function — ex ante evaluation under explicit assumptions — is one the reduced-form program cannot perform. Macro mostly did not aggregate from the toolkit. Put together, the threads do not net out to “economics knows more.” They net out to something more precise and more interesting: economics knows different things, better. It holds a far stronger grip on a narrower class of causal questions, a weaker claim on the big structural ones it used to answer with unearned confidence, and an honest open problem — external validity — where it once would have waved its hands.

The verdict

Did the credibility revolution change what economics knows? Yes — but the honest answer is calibrated, not a slogan. Economics knows different things, better, not necessarily more things; and which it is depends entirely on the question you ask. Four commitments hold that up. Causal-claim standards genuinely rose: in applied micro, the bar for a believable causal claim is far higher and far more transparent than pre-1990, and this is not seriously contested. The range of questions narrowed in some domains and changed shape: cross-country growth regressions, calibrated structural-macro counterfactuals, and the structural-econometrics program lost ground — some of that displacement corrective, some of it genuine loss. External validity is a real and unresolved bill: the policy-evaluation infrastructure did not solve the problem of using one population’s evidence to decide for another, and the discipline should stop pretending it did. Macro mostly didn’t aggregate: DSGE and calibrated structural macro persist because the credibility toolkit does not replace what they do, which is why the revolution reads as an applied-micro event more than a discipline-wide one.

The caveat that keeps this honest: the narrowing is local, not global. Calling 2025 economics “narrower” across the board is wrong — the field reaches into causal questions it could not credibly touch a generation ago. And the post-credibility synthesis — Imbens on extrapolation, Athey-Imbens on causal machine learning, Chetty on big-data causal inference, with Pearl’s structural-causal models and double/debiased ML alongside — is the live response to the bill, not a settled answer. This walkthrough does not crown a winner among those directions, because the field hasn’t, and pretending otherwise would be exactly the unearned confidence the revolution was a reaction against.

The credibility revolution is one thread in the larger question of whether economics is a single unified science or a federation of specialties — for that integration, see the sibling walkthrough Is economics one science, or many?.

Where this leaves us

We started with two fast-food restaurants on either side of a state line and a result that broke a textbook prediction. What made Card and Krueger matter was not the answer but the move: stop trusting models you can’t check, find variation a reader can scrutinize, and accept a narrower, more honest claim in exchange. The development economists pushed that move to its cleanest form and learned a thousand small, specific, falsifiable things — while the structural program that once answered the big questions lost its place. The critics named the price: an unpaid external-validity bill, a real lost function in ex ante policy evaluation, and a macro that never aggregated from the toolkit. The frontier — causal machine learning, big-data causal inference, the formal study of extrapolation — is the field’s working attempt to keep the discipline and recover the reach.

What survives the integration is a single calibrated sentence: economics knows different things, better, not necessarily more things, and which it is depends on the question. The standard for a causal claim rose and will not fall back. The range of what the field confidently pursues narrowed in places, partly for good reasons and partly at real cost. And the hardest question — how to use what you learned about the people in your experiment to decide for the people who weren’t — is still open, which is exactly where the live work is happening. The next time someone tells you the credibility revolution either fixed economics or hollowed it out, you have the tools to push past both slogans to the calibration each one skips.

The defenders’ case, at full strength

Where this leaves us

Two things, both true

Where this leaves us

Three critics, three different bills

Where this leaves us

The threads against each other

The verdict

Where this leaves us

相关问题