Expert Commentary

Can academics and journalists collaborate on big data projects? The SilverLining Project wants to find out

“We noticed in the media there’s more and more spectacular break-ins and people stealing data and putting it on the dark web,” says Gary King, an expert in big data analysis. “We were wondering whether we could create some good out of all this bad.”

big data
(Mika Baumeister / Unsplash)

The internet was made for transactions. Whether it’s photos of grandkids or a pallet of toilet paper, the internet connects people who have something with people who want it.

That includes illegal goods and services. In hidden corners of the internet called the dark web, users can buy and sell anonymously over encrypted networks.

Hackers regularly break into private databases owned by companies and governments, and hacked data often circulates for sale on the dark web. Gary King, director of Harvard University’s Institute for Quantitative Social Science, launched The SilverLining Project in mid-2020 to squeeze a little light out of the darkness.

“We noticed in the media there’s more and more spectacular break-ins and people stealing data and putting it on the dark web,” says King, an expert in big data analysis. “We were wondering whether we could create some good out of all this bad.”

The SilverLining Project seeks to advise academics who want to use data on the dark web for scholarly analysis without running afoul of the law and university rules.

King also is eager to work, collaborate with and provide technical expertise to journalists who receive large data leaks. He worked closely with the Freedom of the Press Foundation to create the first SecureDrop site for a university. Major news outlets regularly use SecureDrop to receive information leaks. Data shared via SecureDrop is encrypted and subpoena-proof.

“We will never scoop journalists or get in their way,” King says. “We also can provide value to them. We have skills in automated video and text analysis and the technology to deal with big, messy databases of emails and all different types of data formats. We can absolutely give those tools to a journalist who might be interested.”

I recently spoke with King about why journalists might consider working with him — and how he has worked with news outlets in the past to understand journalism’s widespread effects on the national social media dialogue. Our conversation has been edited for clarity.

Clark Merrefield: What’s your SilverLining Project elevator pitch, particularly for journalists? Why should they share data with you?

Big data SilverLining Project dark web
Gary King

Gary King: The data, in itself, is incredibly informative and that’s why people are upset when it’s stolen. So we wanted to know: Is there some legitimate way to create public good from these data while also not encouraging people to hack in any way?

These data we are talking about, the leaks and various things, are often stolen from companies or governments. So what these criminals do sometimes is they steal data and put in on the dark web, which is like the regular web except it’s impossible to know where anything came from.

One of the things we do to acquire data is to look around the dark web and see what’s there. And then some subset of that [data], we’re allowed to look at to see whether we can create value. A second approach is that major journalism outlets — The New York Times, Bloomberg, The Wall Street Journal — have SecureDrop sites, which is about the most secure way of transmitting data from one person to another without either knowing who the other person is. So, if you want to be a whistleblower and leak information to the Times in the most secure way, you send it to their SecureDrop site.

We are just releasing this project now and we are starting to create good relationships with journalists. There are gains from trade. We can amplify their work and create long-term value over and above what they are going to be writing about right away.

CM: Talk about encryption — how can potential newsroom collaborators be sure their data is secure and that you won’t share their data with other news outlets?

GK: We did research, looked into and eventually created the first SecureDrop site for a university. We worked with the Freedom of the Press Foundation to modify the SecureDrop site so that it would work within a university setting. That means if someone wants to send data they can send it to our SecureDrop.

So the third way we acquire data is we work with people we know. And we can sign [non-disclosure agreements] and agree not to scoop [journalists]. The interesting thing is that we can’t step on each other’s toes. If a journalist got a big leak of data and sent us a copy, it would be illegal for us to take that data, in most cases, and give it to their competitors or anyone else.

And our goals are different. [Journalists are] interested in breaking the news. We’re interested in creating long-term value. So if a journalist receives a dataset, they will publish something from that immediately, or work on it for months and release it as soon as possible. For us, we are probably going to work on it for years and nail some kind of particular generalization that might be hard for journalists because they don’t have those types of skills and those types of goals.

We will never scoop journalists or get in their way. We also can provide value to them. We have skills in automated video and text analysis and the technology to deal with big, messy databases of emails and all different types of data formats. We can absolutely give those tools to a journalist who might be interested.

Imagine a journalist who receives a leak of some data. They could share it with us and we could show them better ways of analyzing it and better ways of finding what they want to find in the data. We would not publish anything until they’re done. But they would get our help in analyzing, and how to systematize the data, how to make it machine actionable.

More and more journalists are becoming great data scientists and producing great data visualizations. At worst, they would give us access to the data and we would produce some value in the longer term. If you gave me the data and at some point we have some great finding, we will tell you about that before anyone else. You would get the exclusive on that. To be fair, it might be a year or two — that’s the way we work.

If we work together, we won’t hurt you, and can’t hurt you, and we’re likely to be able to help you. I feel like it’s our responsibility to work with journalists if they want to work with us.

CM: Can you talk about an example of a journalist or newsroom you’ve worked with and how that went?

GK: We don’t have anything that’s public, so that’s the end of my sentence.

I will say, a more general point I should have made before, it’s called The SilverLining Project and the subtitle is “Finding social good in the clouds on the dark web.” That’s the idea. This bad stuff is going to keep happening unless we discover some way to stop it. But it seems to be happening more and more. There’s a question as to whether we can do better. So for an academic to say, “Hey, let’s collaborate,” that’s what we do. Science is a team sport. Journalism is a team sport within an organization but not across organizations, except in some instances. The Panama Papers was a big collaborative, so journalists are getting more collaborative across organizations. Collaborating with us is even easier because we’re not competitors at all. What we’re going to do is amplify their work and produce more value than what they were doing previously.

We also have tremendous experience in dealing with the most sensitive data there is. Data that if I made it available, people would die within minutes or hours or days — like [information from] students who would interview known terrorists. We’re not making that available. We are not allowed to hurt our research subjects no matter who they are. We have procedures to protect data as good as anybody. So you can trust us with the data.

CM: Bylines and proper credit are big deals for journalists. How does that work for journalists who want to collaborate? Do you expect a co-byline?

GK: We would be open to that, but it’s not necessary. If you call me up and ask for [analysis] methods advice, that’s for free. Unless it’s going to take hours and hours, that’s free. And I’ll learn from that, too. Students and colleagues all over the world do that with me regularly.

Sometimes, I learn the structure of the data and I realize it is a particular data structure where there’s no methodological solution. And, for me, that’s really great because I’ll develop a methodological solution for it. And I’ll realize that the unusual structure of the data you have, actually that exists elsewhere and we never really realized that was a thing — but now we have a solution for it.

Understanding the problems you run into, that’s really useful information for me. I’ll write a technical methods paper about it and you’ll use the method we developed to do what you do. And if you want to thank us, that’s fine. And if you don’t, that’s OK.

In academia the coin of our realm is not writing newspaper articles. Having an article in Science or the American Political Science Review would not be the coolest thing to [a journalist]. You get your byline.

CM: A few years ago you came out with a paper, “How the News Media Activate Public Expression and Influence National Agendas,” in Science. You recruited 48 small news outlets to write and publish articles on topics you approved on randomly assigned dates. I’m curious how you persuaded these news outlets to write about topics you approved?

GK: It seems utterly impossible any journalist would do it. What we as academics wanted to know, and what these journalists wanted to know, was, what is the effect of the media? Turns out that’s a very difficult thing to study because, for the most part, the media is not trying to influence people — they are trying to stay in business. They follow people. If people get interested in that thing over there, the media run in that direction. If they don’t follow, they go out of business. But, sometimes, they might also have an impact on people. So when things are going in both directions, it’s difficult to parse out which is doing which. The only way to break into that is to intervene — is to control what is published.

If there were no requirements at all, we would randomly assign to the media outlets what they would publish and when. Just like the vaccine trials, they are randomly assigned whether you are getting the placebo or vaccine. It’s very important to do that. The gold standard is random assignment. That’s simple and straightforward. We need total control over what’s published. And if I explain that to a journalist, they’ll say, “OK, I will never work with you.”

So we had this impossible-to-satisfy constraint. The academics and journalists both needed absolute control over what’s published and when. But just because it’s impossible doesn’t mean we shouldn’t do it.

Here’s the setup: we got the 48 media outlets and they understood what I just said, the constraints, and they also had a goal, which was the same as ours — they wanted to know what impact they were having.

So what we did is we first, as the investigators, chose a subject area — a big subject area, like the economy. And then we would ask for volunteers, among the 48 outlets, for three or four that might want to write on that area. If they had no expertise on the area, we would say that they’re not in our experiment. So we had total control. But, of course, they could still do [stories on the topic] if they wanted — so they had total control.

Then we said, “Pick an angle.” So suppose they picked whether Uber drivers in Philadelphia were worried about driverless cars. If the specific angle was not within the big area, we would say, “No, that’s out of the experiment.” But, of course, if they still wanted to do [the story], they could still do it. But it wouldn’t be in the experiment.

Next we would say, “The story can be any type of story.” It can be deep, investigative journalism. It can be opinion. Except it has to be something that is not breaking news that you have to publish today. It has to be something that’s OK to publish next week or the week after. If they chose something that’s breaking news, we’d say, “OK, you’re out of the experiment.” But, they could still do it.

They would then agree on one thing. We, as the investigators, would pick a two-week period for them to publish. A lot of news is quite predictable. We know there is going to be a jobs report on a particular day. We would predict what news would be released during each two-week period into the future and find a pair of weeks that, to the best of our predictive abilities, didn’t have a big event related to jobs. [The project] was done during the Obama administration. So, if Obama was going to give a big speech on jobs, we wouldn’t use those two weeks. If a jobs report was going to come out, we wouldn’t use those two weeks. We would then flip a coin and the outlets that got heads would run it in the first week. If tails, they would run it in the second week, not the first. That’s the one thing they agreed to do. We get total control over what’s published and when. Because of flipping the coin, we can figure out what the effect is. If there’s breaking jobs news, they can cover it. They can do it, but they just wouldn’t be in our experiment. Those were the procedures. Both sides, at the end of the day, could say they had total control.

What we’re trying to convey in The SilverLining Project, like what we did with that project, is we’re not scary and we’re not trying to violate your journalistic integrity. That would be bad for you. That would be bad for the science as well.

CM: So the topline finding from the Science paper was that Twitter posts increased about 20% on a given topic after the news outlets published their stories, resulting in about 13,000 more posts for a week on the broad policy area. Were those social media posts meant to be a proxy for audience engagement?

GK: We thought of the results as the ability of three or four media outlets to set the national agenda. There was a big area like jobs or the economy, and then there was the causal intervention, the vaccine. And the vaccine was writing about Uber drivers in Philadelphia. We did it in lots of different subject areas. Overall, the effect was the discussion — the agenda. The focus of people’s attention switched from one subject to another. And then more over the rest of the week. If they published on Tuesday, its 20% [more Twitter posts] the next day and then it’s a little less and less throughout the week. If you pile those up, it’s like a 62% increase in the national discussion. Who would have thought these little media outlets could have such a big effect on the national discussion?

CM: How did newsrooms use the findings? Did they affect their ongoing discourse with their audiences?

GK: I think they liked the results. I know they use them to justify themselves to their supporters and funders and boards. Now they have real scientific evidence. We may continue to learn about which types of interventions by the media might have bigger effects. Could you get that 62% up even higher if different media outlets published more together, or more on separate topics rather than the same, or if they published above the fold or below the fold, or using particular language? There are other things to study.

CM: Anything else you’d like journalists to know about The SilverLining Project?

GK: The main thing is we would love to talk to journalists and find ways to make your work have more of an impact. Maybe we can help you and you can help us and we can do something together.

Contact The SilverLining Project on Signal at 339-337-2605, by PGP-encrypted email at iqss_silverlining@protonmail.com using their public key, or by SecureDrop.