The Promise—and Pitfalls—of Researching Extremism Online


Jul 17, 2023

Facebook, TikTok, Twitter, YouTube, and Instagram apps are seen on a smartphone, July 13, 2021, photo by Dado Ruvic/Reuters

Facebook, TikTok, Twitter, YouTube, and Instagram apps on a smartphone, July 13, 2021

Photo by Dado Ruvic/Reuters

Social scientists are fascinated with social media. These new technologies have altered how humans interact with one another, providing new windows to observe how humans communicate, learn, and build relationships. The result has been a “data gold rush (PDF),” with scholars mining for insights into human behavior.

For the subset of us studying extremism, social media has a particular allure. Previously, doing such research demanded physically infiltrating groups to observe how they operated. Now, from laptops or cell phones, researchers can monitor extremists as they network, recruit, radicalize, communicate, and mobilize.

But while online spaces are key enablers for extremist movements, social media research hasn't provided many answers to fundamental questions. How big of a problem is extremism, in the United States or around the world? Is it getting worse? Are social media platforms responsible, or did the internet simply reveal existing trends? Why do some people become violent?

From laptops or cell phones, researchers can monitor extremists as they network, recruit, radicalize, communicate, and mobilize.

Share on Twitter

We have few answers because this research is easy to do poorly and hard to do well. The challenges fall into three buckets: users, platforms, and content.

Users Don't Equal People

Like the ever-controversial Twitter owner Elon Musk, extremism researchers want to know how many real people exist behind the voluminous number of user accounts. Researchers must account for the possibility that many of these are inauthentic. In addition to automated bots, a single user can control many accounts, share an online identity with others, or participate in a “cyborg” account comprised of both bots and real people.

Anonymity also intertwines with the user issue. Some social media platforms attempt to verify users' identities, but many require only an email address, and some don't require any registration at all. This makes it difficult to determine demographic details—like gender, age, or location—that might produce meaningful conclusions about who is involved in extremist movements. And extremists likely are drawn to platforms that provide anonymity.

Researchers have developed some tools to distinguish inauthentic accounts and infer basic demographic information. Still, these methods are imperfect and those involved in extremist networks have many incentives to hide or obscure their identities. At the same time, not everyone chooses to be on social media, so the available data may be skewed by an overrepresentation of certain demographic groups.

These constraints complicate answering some of the most basic questions: How big a problem are we facing? When responding to extremism, who do we need to help?

Platforms Are Dynamic and Opaque

Extremism researchers suffer from the problems of too much and too little data simultaneously. Social media platforms have mushroomed, and researchers can now access previously unfathomable quantities of data: Every day, 500 million tweets are sent. Every minute 500 hours of video are uploaded to YouTube. Brandwatch, a popular company for social media analytics, sells access to 1.4 trillion posts captured historically and across multiple platforms. This much information makes it difficult to distinguish the signal from the noise.

Staying abreast of trends in online communities isn't easy either. Researchers need to be able to predict where “deplatformed” users—those banned for violating a platform's user agreement—will migrate, and under what names. Research on a specific set of platforms can quickly become out of date as accounts open, close, or reopen under a new name or shift to a new platform. This fluidity makes it extremely difficult to study behavior over time and identify trends. The internet itself is also evolving; today's internet is literally different from the internet of yesterday, and radically different from the internet a decade ago.

Further, it is nearly impossible to account for the influence of platforms' proprietary algorithms, which determine what content users see, because social media companies are not transparent about how these work. Algorithms can distort users' online behavior and contribute to patterns they may not have otherwise sought out.

Algorithms can distort users' online behavior and contribute to patterns they may not have otherwise sought out.

Share on Twitter

Once the right platforms have been identified, researchers still need to figure out how to access the requisite data. Counterintuitively, this can be most difficult for mainstream platforms. Large social media companies such as Facebook and TikTok are not obligated to share data with researchers. Several have been known to cut off or reduce independent researchers' access to their platforms. When they do share data, often it represents only a fraction of the content hosted on their platform.

As a result, research on extremist use of social media has been skewed towards platforms like Twitter, which historically granted researchers' the greatest access. Admittedly, such popular platforms host plenty of extremist content despite their content moderation policies. In our recent research for the State Department, we identified a community of 300,000 individual Twitter users who employed language consistent with a set of personality traits—known as the Dark Triad—that correlates with violent behaviors. This 300,000 represented more people than all the combined users of the extremist-tilting platforms Gab and Stormfront. But if we only study Twitter, we cannot answer basic questions like: Is someone more likely to encounter extremist material on Facebook than on YouTube or Twitter? Or are there more extremists on Facebook than Twitter?

Inconsistent Ideas of What's Extremist

This brings us to an even thornier question: What qualifies as extremist content? The definition (PDF) is hotly debated by policymakers and researchers alike. Without a common standard, researchers often apply their own definitions, making it difficult to knit together existing analysis.

Capturing extremist content, once defined, introduces more difficulties. Researchers must contend with the constant flood of new material—and the likelihood that existing content may disappear at any moment. Although it is difficult to permanently delete content from the internet, the visible surface is ephemeral. This is particularly true of extremist material, which may be removed by platform moderators, and on most platforms, by users themselves.

Without a common standard for what qualifies as extremist content, researchers often apply their own definitions, making it difficult to knit together existing analysis.

Share on Twitter

Research in this area also has important ethical concerns. Scraping content from some platforms violates their platform use policies, for instance. Researchers must also decide whether to use data that has been illicitly acquired and published by hackers, particularly when it may compromise personal information of users.

The unique nature of speech on social media platforms also can be a problem. Slang runs rampant online, and the internet has its own set of acronyms (IYKYK). Language-processing software can sometimes account for platform-specific language, but such programs are often baselined on a large platform like Twitter, so they can be ineffective on niche platforms that have their own coded jargon. Further, extremist discourse often uses coded language. Researchers cannot go through millions of social media posts manually, and yet our pre-trained machine tools to process language aren't tailored for this type of content. In fact, sometimes lexical AI like ChatGPT have been trained purposively to exclude extremist, violent, and other offensive content. This raises its own ethical questions, as these lexical models are often developed by humans who can be traumatized from exposure to extreme content in the process.

In the end, researchers are left trying to parse messy content, messy user data, and messy platform inputs. While society hungers for effective policy solutions to—or even clear understanding of—extremism online, this messy combination produces suggestive, rather than strong, conclusions.

Heather Williams, a senior policy researcher at the nonprofit, nonpartisan RAND Corporation, is associate director of the International Security and Defense Policy Program. Alexandra T. Evans is a policy researcher at RAND. Luke Matthews is a behavioral and social scientist at RAND.