Problem:
Determining the number of people killed in wars is immensely difficult: chaos, poor communication, and propaganda can wildly distort the figures.
Solution:
Rebecca Steorts, an assistant professor of statistics at Duke University, is using advanced data-analysis techniques to help human rights groups get definitive casualty counts.
Since the Syrian civil war began in 2011, six private organizations have been building databases of death totals. There is also an “official” governmental tally. But compiling them into one master document is a data nightmare because of duplicates, misspelled names, inaccurate dates, and even wrong genders. One estimate showed that running a basic comparison algorithm on the combined lists would take 57 days. In 2013, Steorts realized that by combining a Bayesian statistical approach with a machine-learning technique called blocking, she could reliably merge the databases—and do it in less than a day.
Blocking works by placing items that are similar to one another—say, similar names or approximate dates of death—in the same group for comparison. (A simple analogy: if you were trying to compile one whole set of cards out of two incomplete decks, you’d separate them into suits first and then discard the duplicates.) Only after it has assembled the various blocks does Steorts’s software do the intensive work of linking individual records.
The Human Rights Data Analysis Group, a nonprofit that publishes a death toll for Syria once every year, is testing Steorts’s method to see if it can be incorporated into the estimate it will release in 2016.
—Patrick Doyle
Watch this Innovator at EmTech 2015
Meet the Innovators Under 35