Divide and Recombine for the Analysis of Very Large Datasets

May 11, 2010

11:00am - 11:00am

E1 106

Home
Events
Divide and Recombine for the Analysis of Very Large Datasets

Speaker

William S. Cleveland
Purdue University
http://www.stat.purdue.edu/~wsc/

Description

Divide and recombine (D&R) is a framework for the analysis of very large datasets, ubiquitous today in science, engineering, business, and government. The data are divided into subsets, an analysis method is applied to each subset or to each subset in a sample, and the subset outputs of the method are recombined.

The goal of data analysis, whether the dataset is very large or very small, should be comprehensive analysis that does not miss important information in the data. The 1000s of analysis methods of statistics and machine learning can be divided into two groups. Mathematical methods, which result in numerical output, enable automated learning by the computer. Visualization methods, which result in visual output, enable human guidance to the process of automated learning. Both mathematical methods and visualization methods are critical to comprehensive analysis.

The computing of D&R is embarrassingly parallel. Recent development of very effective distributed software environments that exploit this, have resulted in feasible computation. This provides a mechanism for comprehensive analysis of very large datasets because it enables both mathematical and visualization methods. In a D&R analysis, mathematical methods are typically applied to all subsets, and visualization methods are typically applied to a representative sample guided by variables from mathematical methods.

To achieve its potential, D&R requires much further research in all areas that are involved in the analysis of data: computational environments, mathematical methods, visualization methods, and theory. The goal of the research is to discover methods of division and recombination that provide optimal results from the analysis methods, given that the data must be divided.

Colloquia

Divide and Recombine for the Analysis of Very Large Datasets

Speaker

Description

Tags:

Learn more...

Divide and Recombine for the Analysis of Very Large Datasets

Time

Locations

Speaker

Description

Tags:

Learn more...

We use technologies such as cookies to customize content and advertising, to provide social media features, and to analyze traffic to the site. By using or registering on any portion of this site, you agree to our privacy and cookie statement.