With Big Data and Predictive Analytics, Scientists Are Getting Smarter About Outbreaks

A Learning Process

In the world of epidemiology, diseases that have seen an uptick in recent years are called “emerging infectious diseases.” But are there really more cases of these diseases, or have we just become better at spotting them? According to Barbara Han, a disease ecologist at the non-profit Cary Institute of Ecosystem Studies in New York, it’s not just us getting better. “It’s actually an increasing problem of infectious diseases,” she says. And most of these diseases originate in animals.

Han decided to figure out what makes certain animals more likely to host specific diseases. “There is something inherent about a species that enables it to carry disease, compared to the vast majority that don’t,” she says. “I want to know what the data can give me, what can the data show me, about what distinguishes those two.” She turned to algorithms and machine learning.

Han starts with a list of species that researchers have already flagged as disease carriers or non-disease carriers. She then trains a computer algorithm to separate the species on the list — not labeled in any way, so the algorithm doesn’t know which is which — by dozens of traits. For example, the algorithm may start by looking at an animal’s body mass, followed by its age of sexual maturity and finally by whether it’s nocturnal or not. At the end of this sorting, the algorithm will ideally have grouped species by whether they’re disease carriers or not.

But this first sort gets a fair bit wrong. To make the algorithm more accurate, Han has the computer do another round of sorting, this time focusing on the species it miscategorized the first time. When it does this over and over again, the algorithm learns. And, importantly, it learns which factors contribute to a species carrying a transferable disease or not. “At the end of that process, you get a very powerful predictor,” Han says. When the model examines a species that’s a question mark — whether or not it carries disease isn’t known beforehand — it can use what it’s learned to study that species’s traits, compare them with traits from known carriers and predict the likelihood of that species hosting a disease.

The algorithm can also create a list of animals ranked by their risk of carrying disease, as well as a description of the traits that determine that risk. For example, when Han trained the algorithm with hundreds of mice species, it determined disease-carrying risk was associated with a rapid life cycle — early sexual maturity, frequent reproduction and fast growth rates. Knowing what animals and which traits are most likely to be associated with disease allows researchers to zero in on and prepare for where the next pandemic could originate.