Darpa Wants to Solve Science's Replication Crisis With Robots

Say this much for the “reproducibility crisis” in science: It’s poorly timed. At the same instant that a significant chunk of elected and appointed policymakers seem to disbelieve the science behind global warming, and a significant chunk of parents seem to disbelieve the science behind vaccines … a bunch of actual scientists come along and point out that vast swaths of the social sciences don’t stand up to scrutiny. They don’t replicate—which is to say, if someone else does the same experiment, they get different (often contradictory) results. The scientific term for that is bad.

What’s good, though, is that the scientific method is built for self-correction. Researchers are trying to fix the problem. They’re encouraging more sharing of data sets and urging each other to preregister their hypotheses—declaring what they intend to find and how they intend to find it. The idea is to cut down on the statistical shenanigans and memory-holing of negative results that got the field into this mess. No more collecting a giant blob of data and then combing through it for a publishable outcome, a practice known as “HARKing”—hypothesizing after results are known.

And self-appointed teams are even going back through old work, manually, to see what holds up and what doesn’t. That means doing the same experiment again, or trying to expand it to see if the effect generalizes. It’s a slog—boring, expensive, and time-consuming. To the Defense Advanced Research Projects Agency, the Pentagon’s mad-science wing, the problem demands an obvious solution: Robots.

A Darpa program called Systematizing Confidence in Open Research and Evidence—yes, SCORE—aims to assign a “credibility score” (see what they did there) to research findings in the social and behavioral sciences, a set of related fields to which the reproducibility crisis has been particularly unkind. In 2017, I called the project a bullshit detector for science, somewhat to the project director’s chagrin. Well, now it’s game on: Darpa has promised $7.6 million to the Center for Open Science, a nonprofit organization that’s leading the charge for reproducibility. COS is going to aggregate a database of 30,000 claims from the social sciences. For 3,000 of those claims, the Center will either attempt to replicate them or subject them to a prediction market—asking human beings to essentially bet on whether the claims would replicate or not. (Prediction markets are pretty good at this; in a study of reproducibility in the social sciences last summer, for example, a betting market and a survey of other researchers performed about as well as actual do-overs of the studies.)

“The replication work is an assessment of ground-truth fact,” a final call on whether a study held up or failed, says Tim Errington, director of research at COS. “That’s going to get benchmarked against algorithms. Other teams are going to come up with a way to do that automatically, and then you assess each against the other.”

In other words, first you get a database, then you do some human assessment, and then the future machine overlords come in? “I would say ‘machine partners,’” says Adam Russell, an anthropologist and SCORE program manager at Darpa. He’s hoping that the machine-driven “phase II” of the program—which starts taking applications in March—will lead to algorithms that will outperform bettors in a prediction market. (Some early work has already suggested that it’s possible.) “It’ll potentially provide insight into ways we can do things better,” Russell says. Russell wants the Defense Department to understand problems in national security—how insurgencies form, how humanitarian aid gets distributed, how to deter enemy action. It wants to know which research studies are worth paying attention to.

But if SCORE should happen to also address fundamental weaknesses in the social sciences? Yeah, that’d be cool or whatever. In 2017, a sociologist at Microsoft Research named Duncan Watts wrote a resonant critique of what he called an “incoherency problem” in his field. Watts warned that the social and behavioral sciences were having trouble reproducing scientific claims—a key test of validity—because they don’t have a unifying theoretical structure. Even if an individual article made a claim that withstood rigorous testing and statistical analysis, it might not use the same words as an adjacent article, or it’d use the same words but intend different meanings.

Take the case of research into the significance of informal networks within organizations. Everyone knows those are super-important. Watercooler talk, Slack DMs, those people who are always in each other’s offices—those interactions matter. They’re where all the real decisions get made, right? Figure out a way to structure them, and you can improve any organization. “That sounds like a claim, right? And you want to know, is that claim correct?” Watts says. “The problem is, it’s not really a claim. It’s, like, 100 different claims.” What’s an “organization?” What does “matter” mean? What counts as a network? Without nailing that kind of thing down, “you’re basically sort of doing what you might call ‘strategic ambiguity,’ or ‘creative interpretation,’” Watts says. “Or just kind of bullshitting.”

From that perspective, even figuring out what belongs in that database of 30,000 claims will be key to getting a useful outcome. But if it works, it’s even possible that the algorithmic tools will learn to predict reproducibility by picking up on more than the expected red flags that a replication study or a bettors’ market would seize on. The sheer size of the cross-disciplinary database might reveal all sorts of new variables. “We’ve never actually done something like this, where we’ve aggregated multiple data sets,” Errington says. “It really pushes the envelope on everything that we and other groups have been working on. And then, of course, we’ll see what we can do with it.”

That’s what Darpa wants, too: algorithms that go beyond what humans already understand. And, since one of the requirements of the program is that the algorithms be interpretable (as opposed to inscrutable “black boxes”), they’ll be able to teach those new principles for credible science to us lowly meatbags. “We want to pick up lots of weak signals well beyond human bandwidth, and combine them to help us make better decisions,” Russell says. Built in there somewhere could even be the infrastructure for forcing all those squishy social science constructs to actually relate to one another.

Believe it or not, even Watts is optimistic about whether it’ll work. No one’s more surprised than he is. “It’s such a Darpa thing to do, where they’re like, ‘We’re Darpa, we can just blaze in there and do this super-hard thing that nobody else has even thought about touching,’” he says. “Good for them, man. I want to help.”

Yes. That’s just what the machine overlords were hoping he’d say.

More Great WIRED Stories