Big Data, Big Questions


Oct 16, 2017

Speech bubbles imposed over a world map

Image by Olena_T/Getty Images

Researchers at RAND made a surprising discovery as they sifted through millions of Arabic tweets. For all its vaunted social-media savvy, the Islamic State was losing the war of words on Twitter. Its opponents outnumbered its supporters six to one. They were calling the group and its fighters the “dogs of fire.”

The report last year received widespread attention. What got less notice was how the researchers did it. A team at RAND had built a computer program that could scan millions of lines of text and identify what people were talking about, how they fit into communities, and how they saw the world.

The program, known as RAND-Lex, has since shed light on how al-Qa'ida affiliates communicate, how Russian internet trolls operate, and how the American public thinks about health. It has helped carry an old lesson of linguistics into the digital age: How people speak speaks volumes about them—even when it's 140 characters at a time.

A Holistic Approach to Text Analytics

Bill Marcellino once spent six months with a company of Marines, slogging through obstacle courses and gutting out 15-mile hikes, just to understand how they talk. He came to RAND in 2010 as a social and behavioral scientist, where he found himself sharing an office with a computer scientist named Zev Winkelman. Winkelman had left a job in the financial industry after the Sept. 11 terrorist attacks to work on big-data approaches to national security.

Bill Marcellino (left) and Zev Winkelman

Bill Marcellino (left) and Zev Winkelman

Photo by Dori Gordon Walker/RAND Corporation

They soon realized they were working on the same kinds of puzzles, just from different perspectives. Marcellino was using what he knew about the big picture of language to understand what was unique or telling about individual pieces of text. Winkelman was looking at text, too—but he was using computers to identify the distinct pieces first, and then work back to the bigger picture.

“We realized we could bring together social science and computer science to make meaning out of huge data sets of text,” Winkelman says. “We could build something more holistic, something that people could use, a center of gravity for text analytics.”

Distinguishing ISIS Supporters and Opponents on Twitter

Marcellino and Winkelman started coming in early and staying late to turn their ideas into computer code. Their first version, RAND-Lex 1.0, could scroll through millions of lines of text and compare them against a linguistic baseline. It was looking for surprises—words or phrases that appeared more often than expected, statistical outliers. It might flag the words “single-payer,” “preexisting,” and “Obamacare” in a transcript for a health-care debate, for example—not necessarily the most frequent words, but the most distinct.

ISIS opponents preferred to belittle the group by abbreviating its name in Arabic to Daesh.

That's how researchers at RAND were able to get an unprecedented look at the online messaging battle between ISIS supporters and opponents. They found that supporters almost always referred to the group by its full name, the Islamic State. Opponents preferred to belittle the group by abbreviating its name in Arabic to Daesh.

But when the researchers fed only those Daesh tweets into RAND-Lex, they found that, for all their numbers, opponents often were speaking past each other. Gulf State Shia blamed Saudi Arabia for the rise of ISIS; Saudi Arabia and its Sunni neighbors blamed Shia Iran. And none of them matched up with the Syrian mujahideen, who sometimes applauded ISIS fighters even while denouncing the group's brutality.

The study revealed fierce opposition to ISIS across communities on Arabic Twitter. But it also showed that a one-size-fits-all approach to countering ISIS's online message would fall flat.

Following the Linguistic Fingerprints of ISIS

The RAND-Lex team narrowed its focus to Egypt in another study. The researchers wanted to see if they could measure how well ISIS's message resonated with people far outside its home turf of Iraq and Syria.

To do that, they ran ISIS speeches, proclamations, and articles through RAND-Lex, looking for distinct words—the group's linguistic fingerprints. Then they looked for those same words in more than 6 million Egyptian tweets, to see whether people were starting to talk like ISIS.

They found that only around 1 or 2 percent of the population was borrowing words from ISIS. They were much more likely to describe the world in terms taken from the Muslim Brotherhood. But the number of ISIS-imitating accounts grew in the months the researchers followed, especially in poorer places like the Sinai, a sign that its message was starting to stick with some Egyptians.

The next update to RAND-Lex helped researchers understand why. It was able to not just pull out distinct words and phrases, but also assign values to them—to discern angry words from happy words, for example, or future-facing words from backward-looking words. It could start to get a feel for the text.

ISIS speech was often intense, future-oriented, focused on social values and relationships—a rallying cry.

It found that ISIS speech was not as hateful and negative as some might expect. Instead, it was often intense, future-oriented, focused on social values and relationships—a rallying cry. It used “we” phrases, but not so much “them” phrases.

The language of al-Qa'ida in the Arabian Peninsula, by comparison, was informational, even technical—less a call to action than a report of how and why something had happened. It often read, the researchers noted, like a how-to manual.

“We can show what is unique about how different people talk about the world, and how they tackle the world,” Marcellino said. “They're inextricably linked. We talk about the world in ways that reflect how we see the world.”

More Needles, More Haystacks

In more recent months, RAND researchers have scanned tens of millions of American tweets into RAND-Lex to understand how people talk about health and wellness. They found that people are more likely to talk about being sick than about staying well—a possible opening for healthy-living campaigns to change the conversation.

Researchers have also used RAND-Lex to examine Russian propaganda on Twitter—and found a running online battle between Russian propagandists and Ukrainian activists. Marcellino ran hundreds of blog items through RAND-Lex to see how Americans were talking about privilege; most addressed “white privilege” or “male privilege,” he found, but almost none mentioned class privilege.

The computer program he and Winkelman built by hand, with help and support from across RAND, has expanded beyond keyword testing and value comparisons. It can search through volumes of text and pull out the major themes; it can learn from small samples of text how to classify much larger collections. It can tease out the overall stance of a text in English or Arabic, with Russian in the works. RAND recently made it available to outside researchers as a subscription service.

“We live in a world where the amount of data is increasing all the time,” Marcellino said. “It's not just that the haystacks are getting bigger and bigger. The number of haystacks is increasing exponentially. We need new ways to find the needles.

“We've realized that if you leverage what machines are good at and what humans are good at, you can do really, really important work, at massive scales.”

It is, if anything, a growth industry. In the time it takes you to finish this sentence, 6,000 new messages will have whistled across Twitter alone.

Doug Irving