Figuring out how human beings do human things is one of the most exciting things that science—psychology, sociology, economics, anthropology—can do. It’s also one of the hardest. Reliable, meaningful methods that distill real-world behavior into experimental variables have been, let’s say, elusive. That might be part of the reason the “reproducibility crisis,” concerns about the validity of some scientific findings because of statistical and methodological strains, hit the so-called soft sciences first and hardest.
Princeton University Press
Matt Salganik, a sociologist at Princeton, is trying to solve that hard problem. He wants to know how human beings behave and why, especially in a socially mediated world. To do it, Salganik has become a hardcore data nerd. The digital traces everyone now leaves on servers provide inexhaustible fuel for the science of human behavior, he says, and learning to use them wisely could also fix the various crises that science now sees in its own practices. Salganik’s new book Bit by Bit: Social Research in the Digital Age, out December 13, lays down the new (and not-so-new) rules for bringing data and the social sciences together.
WIRED: The book has a sort of interesting origin story.
Salganik: My dissertation research was an online experiment. We created a website where people could download new music, but we could control what info people had about what other people were doing. This allowed us to create and test social fads. By doing it on a website rather than in a traditional on-campus lab, we were able to have about 100 times the number of participants you’d normally have. We got 27,000 people.
The paper was published in November of 2006, and since then I’ve been doing research using digital-age techniques and teaching it to students. This book is the result of that experience. I wanted to help others get started doing this kind of research, and help others who are already doing it in one field to see connections with other fields.
When the book went in for traditional peer review, it also went online for a parallel open review. I converted the book into a series of websites, and anyone could come and read them and annotate them. I was able to collect a tremendous amount of feedback that helped with the book, and I was able to collect a lot of data about how people interacted with the book in the wild. All the big data techniques that big media and tech companies use, we were using those as well. And now we’ve released an open review toolkit that other authors can use.
Was the feedback you got through the open review very different from the more formal peer review?
The feedback I got from the peer review was from experts who often had ideas about how they thought the book should have been written.
No, some of them were good ideas. It was helpful. The feedback I got from the open review was different. It included non-experts, and I want my book to be readable and helpful to non-experts. So that was very helpful in diagnosing some of the problems in the writing. There was an annotation about me skipping a step in an argument, and I looked at it and thought, ‘Oh yeah, I did skip a step.’ To the peer reviewers and to me it was an obvious step, but to the non-experts, it wasn’t.
Who do you think will be able to use the book? Who’s the audience?
I hope the audience will be broad. People in the social sciences are facing this set of issues. People in data science. And then outside universities, many companies have data scientists trained in computer science, engineering, statistics, who are now working with social data. They are essentially social scientists but they have none of the training of social scientists. For those people, I hope the book introduces them to some of the ideas from social science and the ways social scientists do their work. I did a sabbatical at Microsoft Research and there were some very sophisticated engineers there who just didn’t know a lot about social science
In a few places you make some points about the differences between data scientists and social scientists. Where do those cultures diverge?
I see these communities as having a lot to learn from each other and contribute to each other. Social scientists in the past have generally worked with data that was specifically created for the purposes of research. In the book I call this “custom-made data.” And data scientists tend to work with “ready-made data,” originally made for one purpose and being repurposed for research. So for example if social scientists wanted to study public opinion, their natural first thought would be to look at a survey like the General Social Survey, done by researchers for other researchers. A data scientist’s first stop might be to look at Twitter.
Some of these differences come from what are valued in these different communities. For social scientists, it’s often being able to make an empirical statement about some bigger theory. For data scientists, it’s often more to do something neat or interesting or novel with data. Those kinds of differences in values can lead to different approaches.
Also there are differences in training. Social scientists are trained in how survey data is collected and how to analyze it; data scientists often don’t have this training, but they have training in other things, like how to work with very large data sets. So social science can learn a lot of from the techniques and viewpoints of data scientists, and likewise data scientists can learn a lot from social scientists. If you want to study public opinion, it doesn’t make sense to say the general social survey is better than Twitter. You have to ask, which data source is most useful for the question that we have.
One chapter that particularly grabbed me had to do with ethics. You write that social scientists mostly only think about ethics when they have to deal with the seemingly intractable bureaucracy of an Institutional Review Board’s rules for how they treat living subjects, and that data scientists basically don’t think about ethics at all.
My statement was definitely sort of board and sweeping, but it’s a statement of what the world is and not of what it should be. Among the researchers I talk to, no one wants to be unethical, but the ethics of a lot of analog-era social science research—lab experiments on campus, surveys, ethnography—has more or less been settled. Generally there’s agreement on what you can and can’t do. The way that social scientists approached ethics prior to a lot of this big data research had become, I would say, somewhat routinized.
And now there is a possibility for us to do very different things. Our ability to observe millions of people without consent or awareness, and our ability to enroll people in experiments without consent or awareness, these are new things we can do, and I don’t think we as academics have figured out how to use that power responsibly. Similar questions have arisen in industry and government. A big challenge for us in the digital age is to figure out how to take advantage of these opportunities in a way that’s responsible. In the book I try to lay out some principles we can follow that will help people think about and talk about that.
Those are respect for persons, beneficence, justice, and respect for law and public interest.
Yeah, and these ideas are not ones I created. The one reason I’m confident they’re likely to be useful in the future is they have been enduring. The Belmont Report, from which I drew some of those principles, was published more than 40 years ago. One of the reasons to go with a principles-based approach rather than a rules-based approach is that we can be confident the abilities we’re going to have are going to change. To reason about those new capabilities, we need to have somewhat abstract principles.
The one most researchers who work with people talk about is informed consent, making sure the people you’re working with know what they’re signing up for.
That’s a key part of the four principles I lay out. Those are more broad than just consent. Right now there’s a huge emphasis on informed consent, and it’s obviously important, but we could potentially may be putting too much emphasis on that one specific thing and not enough on the broader idea of respect for persons, which is the principle from which informed consent is derived.
It’s interesting that you’re suggesting a data-driven approach to social scientists at the exact moment that the social sciences are dealing with a crisis that’s about data—reproducibility problems and statistical manipulations that call into question some of the field’s key findings.
I would say the transition from the analog age to the digital age, which is what’s driving a lot of these new sources of data, is also enabling social scientists to have new work practices. It makes it easier for us to share our data and code, and it makes it easier for us to provide access to our research to everyone, not just people who are lucky enough to be at universities with subscriptions to expensive journals. The digital age has the possibility of helping us change and improve our scientific practices in ways that I think people are excited about and starting to embrace.
What, specifically, has changed in that transition to the digital age?
When I started graduate school the kinds of data that researchers worked with were generally data created for researchers by researchers. That had some good things about it, because the data was usually related to topics of scientific interest. It was usually available to all other researchers, which is important.
Now there’s a lot of data being generated as a byproduct of everyday actions. This is “digital trace data” or “digital exhaust.” It’s often at a much bigger scale, which creates a lot of interesting research opportunities, but it also comes with some problems. The data often has the goals of the company or government baked into it. This is called “algorithmic confounding.”
What does that mean?
Learning about human behavior from Facebook data is like learning about human behavior by watching people in a casino. You can definitely learn from watching people in a casino, but a casino is a highly engineered environment designed to encourage some behavior and discourage other behavior. Facebook is similar. When people look at Facebook they think, “Oh, this is people’s natural behavior.” And that’s not true at all. The goals of the system designer are not the goals of the researcher in many cases.
And then there’s access. Facebook and Twitter have enormous amounts of data that are not available to every researcher, and there are good reasons for that—complicated ethical, legal, and business reasons. But if there’s a situation where some researchers have access and others don’t, this can create concerns about reproducibility, the role some companies play in allowing certain projects to go forward and not others, and the role they could play in encouraging certain types of results.
More on Metascience
The challenge for all of us is to figure out how this data that could be beneficial to scientists and society in general can be made available in ways that would be safe for the people providing the data and safe for the companies
But this science goes way beyond just social media.
My kids, who are 8 and 4, are growing up talking to Alexa. They’re going to interact with the world in a different way than I did. Those kind of psychological impacts will take a while for us to be able to observe and understand, but we’re already starting to see major changes in industry and social relations.
There’s a lot of opportunity in general in any kinds of transaction records. Facebook and Twitter, a lot of this is data people are intentionally creating, but there’s a big possibility in data more implicitly created. For example, the location data created by my cell phone. Bitcoin is another good example of that. In the process of economic transactions, this ledger is created. I have a colleague making tools for researchers to understand what’s happening in the Bitcoin ledger.
It’s getting easier for lots of people to interact with each other, either through a company’s platform or through distributed peer-to-peer systems. And to the extent all of these interactions are digitally mediated, they create records. Those records are all really exciting to researchers.