The reams of data that many modern businesses collect—dubbed “big data”—can provide powerful insights. It is the key to Netflix’s recommendation engines, Facebook’s social ads, and even Amazon’s methods for speeding up the new Web browser, Silk, which comes with its new Fire tablet.
But big data is like any powerful tool. Using it carelessly can have dangerous results.
A new paper presented at a recent Symposium on the Dynamics of the Internet and Society spells out the reasons that businesses and academics should proceed with caution. While privacy invasions—both deliberate and accidental—are obvious issues, the paper also warns that data can easily be incomplete and distorted.
“With big data comes big responsibilities,” says Kate Crawford, an associate professor at the University of New South Wales, who was involved with the work. “There’s been the emergence of a philosophy that big data is all you need,” she adds. “We would suggest that, actually, numbers don’t speak for themselves.”
Crawford’s paper, written with Microsoft senior researcher Danah Boyd, illustrates the ways that big data sets can fall down, particularly when used to make claims about people’s behavior. “Big data sets are never complete,” Crawford says. For example, researchers often study Facebook to analyze people’s social relationships, using connections made through the social network as a stand-in for real-world ties. But it’s common for Facebook to show a distorted picture of people’s closest social relationships, such as with parents, live-in romantic partners, or friends seen daily. “Facebook is not the world,” Crawford says.
Google is a poster child for the power of data. The company has transformed a massive amount of information, gathered through its search engine, into a commanding ad network and powerful role as the gatekeeper of much of the world’s information.
At a conference on Knowledge Discovery and Data Mining in August, I watched Google’s director of research, Peter Norvig, demonstrate the true power of a large data set, using the example of machine translation. Norvig showed that training algorithms on very large data sets, like those it has collected from the many Web pages it crawls that are available in multiple languages, can produce dramatic results. With enough data, Norvig said, even the worst algorithm performs far better than what can be achieved with a smaller data set.
But Crawford and Boyd’s work shows that studying large data still requires finesse. Twitter, which is commonly scrutinized for insights about people’s moods, attitudes toward politics, and other aspects of daily life, presents a number of problems, the researchers say. About 40 percent of Twitter’s active users sign in to listen, not to post, which, Crawford and Boyd say, suggests that posts could come from a certain type of person, rather than a random sample. They also note that few researchers have access to all Twitter posts—most use smaller samples provided by the company. Without better information about how those samples were collected, studies could arrive at skewed results, they argue.
Crawford notes that many big data sets—particularly social data—come from companies that have no obligation to support scientific inquiry. Getting access to the data might mean paying for it, or keeping the company happy by not performing certain types of studies.
The researchers add that big data can also raise serious ethical concerns.
Many times, Crawford notes, combining data from different sources can lead to unexpected results for the people involved. For example, other researchers have previously shown that they can identify individuals by using social media data in combination with supposedly anonymized behavioral data provided by companies.
Jennifer Chayes, managing director of Microsoft Research New England, says her lab has had firsthand experience with such problems. The lab wanted to run a contest for researchers to analyze a set of search data, she says, and was going over the data carefully to avoid the sorts of deanonymizing scandals that have occurred from search data releases in the past. They discovered that people often entered search terms that were personally identifying and embarrassing—such as, “Is my wife Jane Doe cheating on me?” The lab nixed the contest. Chayes says, “We began to realize how much we didn’t understand about human behavior around search engines.”
Handling big data sets takes almost impossible care, agrees Alessandro Acquisti, an associate professor at Carnegie Mellon who has studied the unintended information that data sets can reveal. Even public data sets raise questions, such as what to do with information that people post and then subsequently want to delete, he says.
Given the quantity of information now available on the Internet, Crawford argues, researchers need to slow down and think about the methods they use. “[The effect of the availability of big data] did shock a lot of people,” she says. “And it should.”