Addressing the National Security Risks of Bulk Data in the Age of AI 

commentary

Aug 23, 2024

The ChatGPT app is seen on a mobile device, May 14, 2024, photo by Jaap Arriens/NurPhoto via Reuters

The ChatGPT app is seen on a mobile device, May 14, 2024

Photo by Jaap Arriens/NurPhoto via Reuters

By Nate Lavoy

As artificial intelligence (AI) continues to revolutionize various sectors, the potential for both groundbreaking advancements and unprecedented threats grows exponentially. The aggregation of vast datasets and computational power has led to AI models with capabilities that were once in the realm of science fiction.

Current U.S. regulatory measures fall short in addressing the complexities and dangers associated with these powerful systems. There is an urgent need for a more nuanced and empirically grounded approach to data regulation, one that recognizes how seemingly innocuous data points can, when amassed in large quantities, pose substantial risks to national security.

The U.S. government is beginning to recognize the threats of large-scale AI development and has placed export controls on semiconductors to diminish the computational resources of countries of concern. Cross-border flows of training data, however, have largely been overlooked. Executive Order 14117 is an attempt to combat this problem by preventing the transfer of data above certain quantity thresholds (bulk data) to countries of concern, but it is an inadequate solution due to its narrow scope and loose definition of “bulk data.”

To understand the gravity of these limitations, it's crucial to recognize that data fuels artificial intelligence, enabling it to learn, adapt, and ultimately exhibit advanced capabilities. In many cases, significant improvements in AI performance can be accomplished solely by feeding a model more training examples.

Massive datasets, even those composed of non-sensitive information, may reveal patterns and insights that can be exploited to compromise national security when used as inputs to AI.

Share on Twitter

The risks of bulk data transfer to countries of concern may not be apparent at the individual datapoint level, but massive datasets, even those composed of non-sensitive information, may reveal patterns and insights that can be exploited to compromise national security when used as inputs to AI. For instance, models trained on large bodies of benign code can inadvertently learn to create malicious scripts.

By identifying common programming structures and techniques from training examples, AI models can generate new code that mimics these patterns, allowing them to adapt to novel use cases. This capability can be repurposed, however, to create harmful software that can exploit vulnerabilities (PDF), craft sophisticated malware, or develop evasion techniques to bypass security measures.

Similarly, massive amounts of videos (think the millions of daily posts on social media apps like TikTok) provide the perfect training material for generative AI systems that can create advanced deepfakes. These deepfakes can be used to launch discreet and highly targeted influence campaigns.

Executive Order 14117 recognizes that AI can “analyze and manipulate bulk sensitive personal data to engage in espionage, influence, kinetic, or cyber operations,” but neither code nor videos are included in the categories of data it protects. In fact, the data covered by the EO is limited to personal identifiers, biometrics, genomics, health and financial information, and geolocation data. All of these are important, but they only comprise a small fraction of the threat surface.

The Order loosely defines bulk data as “an amount of sensitive personal data that meets or exceeds a threshold over a set period of time” and delegates the specifics to the Department of Justice (DOJ). The DOJ defined thresholds in their Advanced Notice of Proposed Rulemaking “based on a preliminary risk assessment” but provide no mention of how they were decided upon.

In fairness, assigning these thresholds is immensely difficult. America is built on freedom, and it is tough to balance regulations on data with the freedom of speech, expression, and the free flow of information. Careful consideration is important when creating restrictions on what people can create and release in the United States. Moreover, different types of data have different levels at which they become dangerous. For example, state-of-the-art large language models require trillions of tokens of text to develop worrisome behavior, while a system trained on biological data may only require a few thousand datapoints to create significant national security risks.

Effective regulation must be grounded in a comprehensive understanding of specific types of data and the amounts necessary to train systems with concerning capabilities. This involves rigorous empirical analysis and red teaming to ensure that policies are not only protective but also pragmatic and necessary.

Before the age of AI, it was simple to determine which datapoints needed to be controlled—if the information could directly threaten national security, it should be kept away from adversaries. Now, that view needs to be expanded to consider the consequences of the aggregation of seemingly insignificant types of data. Balancing data regulation with the preservation of individual freedoms and the free flow of information is complex, but essential. As adversaries refine their strategies, U.S. regulatory frameworks must evolve to maintain technological leadership and robust national security.

More About This Commentary

Nate Lavoy is a summer associate at RAND, a nonprofit, nonpartisan think tank. He is pursuing a master's degree in computer science at New York University.

Commentary gives RAND researchers a platform to convey insights based on their professional expertise and often on their peer-reviewed research and analysis.