RAND Statistics Seminar Series

Visualization Databases for Loseless Analysis of Complex Data Sets

Presented by William Cleveland, Purdue University
Thursday, February 15, 2007
RAND Corporation, Santa Monica, CA


Large, complex data sets are ubiquitous, the standard now rather than the exception. They present challenging problems of analysis because of their size and the complexity of their data structures and patterns.

One approach is to compute summary statistics at the outset to reduce the complexity, but this expedient risks losing important information in the data. The goal should be lossless analysis: analyze the data at a level of detail and comprehensiveness that does not sacrifice information.

Achieving lossless analysis of complex data today is immensely challenging. New fundamental approaches and methods are needed for each of the different areas that come into play in the analysis of the data — databases, data processing, data structures, statistical models and methods, machine learning algorithms, data visualization, computational algorithms, software environments, and hardware environments. In fact, it has never been harder to achieve lossless analysis because complexity has increased faster than our innovations in these areas.

Nothing serves lossless analysis better than data visualization, the only practical way to absorb large amounts of information in detail.

But for today's complex sets we must visualize far larger amounts than in the past. We must be ready to accept large displays each covering tens or even hundreds of screensful (pages). For a single data set it is reasonable to have hundreds of such displays. These displays become a new database produced from the data that is queried and studied.

For a display of 500 pages, we might query and study all or just a few of the pages depending on the task.

Producing, querying, and studying a visualization database needs new ideas. There are different modes of viewing the many pages and panels per page of a large display, from slow focused study to very rapid scans. We need creative interfaces to facilitate the different modes. We cannot fuss with very large displays, interacting with the micro-elements to get them right, because there is too much; instead there should be smart automation algorithms that get the large display right the first time. We must consider the physical screen space, its size and resolution, to make it work most effectively for the visual study. We need methods of display that result in pre-attentive visual formation of gestalts that show instantaneously the relevant patterns in the data. This necessitates, strangely, more displays, starting with broad brush looks to derivative displays whose redesigns show specific aspects of the broad brush more effectively. It also requires the study of visual perception.

Attending a Seminar

RAND Visitors are welcome to attend the statistics seminars, but must RSVP at least one day prior to the seminar.

For directions to RAND see: http://www.rand.org/about/locations/santa-monica.html
Visitors must enter through the north-parking garage, which is accessible from Main Street in our new office building. Inform the attendant that you are there for the Statistics Seminar Series and you will be directed to the appropriate parking area. If there is no attendant present, use the intercom and tell them that you are here for the Statistics Seminar Series. After parking, follow the instructions to the appropriate conference area. (1776 Main Street)

Reminder: the old RAND surface parking lots have been permanently closed.