A Hadoop computer cluster of Cubieboards on Lubuntu operating system

Methods Center

Center for Scalable Computing and Analysis

An Apache Hadoop cluster built out of low-cost consumer electronics

Context

The RAND Center for Scalable Computing and Analysis supports data science within RAND by fostering a community of experts on best practices for use of large-scale data. This internal community also leverages external partnerships with academia and industry. The center has three main aims: First, the RAND SCAN Center acquires, maintains, and develops a suite of methods and software tools for engaging in large-scale analysis. Second, SCAN identifies policy questions that could benefit from large-scale data and tools, and catalyzes interactions between subject-matter, methods, and data experts. Third (and perhaps most importantly), members of the SCAN Methods Center consider and discuss the implications for ethics, equity, privacy, and other social dimensions.

"... A wealth of information creates a poverty of attention and a need to allocate that attention efficiently among the overabundance of information sources that might consume it"

Herbert Simon

Methodologies & Tools

Big Data is an approach to analysis in which decisions are supported by pervasively collected data using scalable algorithms and infrastructure. One way to organize thinking around Big Data is to consider three important characteristics: Infrastructure, Algorithms and Implications.

Infrastructure refers to the modern technology that supports Big Data in both the generation of data (e.g., the sensor and social media systems that produce enormous amounts of digital content) and the modern computational approaches to efficiently process the data. The underlying computational infrastructure involves a wide variety of solutions designed for various use cases, but a common theme of most Big Data infrastructure is the use of distributed computation executing on a cluster of loosely connected computers (e.g., open source, distributed processing frameworks such as Hadoop). The key advantage of cluster-based computing is that such systems are "scalable" in that as the amount of data to be processed increases, new computing power can be incrementally added to the cluster without the need for accurate predictions of resource needs. While the SCAN methods center largely focuses on tools and insights developed for Big Data, SCAN explores how tools can be applied to a wide variety of problems, some of which do not necessarily involve very large scale data sets. Our approach ensures that RAND researchers will be able to incorporate large-scale data sets in their analysis as data used for policy research increases in size and complexity.

Algorithms are recipes that encode a set of rules based on data inputs. In the case of Big Data, several qualities impact the way algorithms are employed. First, Big Data algorithms take advantage of and extend the distributed processing approaches of the underlying infrastructure to accommodate the characteristics of the data being processed. For example, some Big Data algorithms are designed for filtering rapid streams of data (as might be produced by Twitter). A second quality of Big Data algorithms is the ability to incorporate a much richer set of predictor variables in algorithms by leveraging the availability of previously inaccessible data (e.g., data made possible by inconspicuous, wearable sensors that can inform healthcare decision processes). Extracting useful information from such sensor data often requires a data fusion process to effectively integrate data from a variety of sources. A third quality is the tendency to use "machine learning" algorithms that place emphasis on modeling prediction (e.g., predicting an output based on an input) rather than modeling how the underlying data is generated; such algorithms are able to handle large scale complexity and, if used judiciously, are more broadly applicable than explicitly causal models.

Implications considers the policy questions raised by society's use of Big Data in decisionmaking processes. One of the most widely discussed implications of Big Data is the impact on privacy. Whereas traditional analysis methods typically involve collections of data from a well-controlled, small sample of users who explicitly give consent, Big Data analysis often involves data collected from a large set of users as a byproduct of technology that is repurposed for the analysis without user consent. Even in cases in which consent is granted, data usage that might be deemed appropriate in one scenario is often considered inappropriate when repurposed. The result is a trade-off between privacy concerns and the potential for public good. In addition to privacy, other issues are raised such as risks associated with dependence on automated algorithms as well as inequities that can unknowingly be encoded into those algorithms.

  • Infrastructure
  • Algorithms
  • Implications

Real-World Applications

MareNostrum III supercomputing cluster

MareNostrum III supercomputing cluster

Photo by Josep Tomàs / CC BY-NC-SA 2.0

At RAND, SCAN sponsors the Computational Efficiency Project program—an effort to improve the efficiency of computational analysis software created by RAND, so that those programs can be scaled up to more quickly process larger quantities of data. To date, we have performed an analysis of the RAND Health COMPARE program and shown how COMPARE can be modified with minimal effort to enable execution on a cloud-computing infrastructure. Our analysis and recommended modifications resulted in an order of magnitude speed up.

NSF Big Data Regional Innovation Hubs

Along with the Pardee RAND Graduate School, the SCAN methods center is participating in the National Science Foundation's (NSF) Big Data Regional Innovation Hubs, which NSF established to foster an ecosystem of collaborations among academia, industry, and government.

Washington D.C. Apache Spark Meetup

In an effort to engage with the external community on Big Data activities, the SCAN Methods Center hosted the Washington D.C. Area Apache Spark Meetup in May 2015. The Washington D.C. Area Apache Spark Meetup is an interactive meeting of Washington D.C., Virginia and Maryland users, enthusiasts, and explorers of Apache Spark. Spark is the powerful open source data processing framework built around speed, ease of use, and sophisticated analytics that extends and accelerates Hadoop.

Federal Highway Administration Big Data Webinar

RAND researchers presented a webinar series entitled Scalable, Data Driven Analysis for the Federal Highway Administration (FHWA) of the Department of Transportation. The webinar included material on the historical evolution and definitions pertaining to Big Data as well as an overview of key computing models, including batch, iterative and stream processing. Alternative data storage models were presented including NoSQL, graph databases, and distributed file systems. The implications of Big Data were discussed including privacy and ethical issues, as well as the applications of the Internet of Things within smart cities and the trade-offs that should be considered in technology investment decisions.

RAND hosted the Washington DC Spark Meetup, April 2014

RAND hosted the Washington DC Spark Meetup, April 2014

Photo by Donna F. / Meetup

Expertise

The RAND Center for Scalable Computing and Analysis Co-Directors

RAND's History of Achievement in Computing

RAND staff designed and built one of the earliest computers, developed an early on-line interactive terminal-based computer system, and invented the telecommunications technique that has become the basis for modern computer networks. Notable RAND alumni associated with computing technology include John von Neumann (a pioneer of the modern digital computer), Paul Baran (who developed packet-switched networking), and Allen Newell and Herbert Simon (who were joint recipients of the ACM Turing Award for basic contributions to artificial intelligence). Today, our analytical work includes Big Data tools and methodologies for processing larger and more varied data sets.

Learn More

RAND Contributions to the Development of Computing
Johnniac computer, Computer History Museum, CA, by Andrew Lih

Johnniac computer, Computer History Museum, CA

Photo by Andrew Lih / CC BY-SA 2.0

Work with Us

Interested in how SCAN's methods or software tools can be applied to the problems that interest you? Fill out the form below and we'll get back to you soon.

You can also contact: SCAN@rand.org

Methods Centers at RAND