Classroom Demonstrations of Big Data

Eric A. Suess

California State University, East Bay

JSM 2104 Boston

August 6, 2014

Abstract

We present examples of accessing and analyzing large data sets for use in a classroom at the first year graduate level or senior undergraduate level. Larger and larger data sets are used. The beginning examples focus on in memory data sets and the later examples extend beyond the physical memory available on the computer.

Simulated data sets are suggested and sources for real world data are given.

Data visualization, classification, and prediction are shown.

R, Revolution Analytics R, and mysql.

Introduction

We became interested in Big Data and the use of parallel computation (distributed computing) and to a lesser extent in the use of parallel data storage (distributed data storage) during the excitement of the NetFlix competition that was offered a few years ago (2009). Learning that the winners of the competition used amazon EC2 cluster and S3 storage inspired us to investiage and learn about these new computing environments, now referred to as “cloud computing”.

Introduction

The Heritage Health Data competition was also very exciting (2012). And now with kaggle, being a location on the Internet for these types of competitions, there are current open data analysis competitions posted regularly. Many of the kaggle completions focus on Large Data problems which benefit from the use of parallel computation and can be used as a stepping stone to preparing to work with Big Data, Analytics, and Data Science.

Introduction

The computing techniques and hardware needed to work with Big Data currently seem far from the introductory course work in Statistics at all levels, such as, lower division and upper division Statistics courses, and first year graduate courses in Statistics.

Introduction

However, with some effort to motivate the ideas of Big Data, parallel computing, and distributed storage at earlier stages in Statistics Education, it would then be possible to

cover some practical applications of Large Data
discuss the next steps toward working with Big Data
show parallel computation in action
introduce distributed storage much earlier in the Master's level curriculum and undergraduate curriculum.

This would be very exciting to students and it would connect their studies with the discussion that are going on in the media and beyond.

Why is Big Data not being discussed earlier in Statistics?

This is an important question!

All of the core curriculum focuses on traditional methods, t-tests, ANOVA, Linear Regression, etc., and uses Confidence Intervals, Hypothesis Testing, and p-values. These are core topics that will remain the core curriculum of Statistics.

Why is Big Data not being discussed earlier in Statistics?

This is an important question!

With the advances in computer technology, being able to access and analyze Large Data sets is now possible on PCs and laptops. However, opportunities to interact with such data is not currently a core part of the common Statistics curriculum.

Big Data is being discussed by students and is constantly a focus of questions (asked directly or indirectly) asked of faculty these days. The discussion of Big Data seems to be disconnect from the curriculum.

In the end, ...

Students of Statistics need to become much more capable with and more knowledgeable users of their own computers, which are now all inherently capable of storing Large Data sets for analysis because of larger amounts of RAM being installed (8, 16, 32 gigabytes), for storage alone, having very large harddrives (500 megabytes, 1 terabytes, beyond), and for parallel computation (Core2Duo, iCore3, iCore5, iCore7, 2, 4, 8 cores respectively).

In the end, ...

Statistics educators should try to incorporate more use of Large Data sets and create exercises that include data munging or data wrangling.

The size of the data should become a topic of discussion when presenting standard statistical techniques, such as linear and logistic regression.

What is Big Data?

The idea of Big Data seems to focus on very large databases containing data sets that may include many data tables that include very large numbers of variables and very very large numbers of observations. Or contain unstructured data that does not fall into a nice format (natural language, images, for example), this is a next step in the Big Data discussion.

These types of data sets may be stored in .csv files and/or databases (other formats are also used). They may be stored on multiple data servers. The size of the data sets far far exceeds the RAM in a usual PC or laptop and far exceeds the usual harddrive space.

What is Big Data?

The enormity of the size of Big Data makes giving student access to such data very difficult when the

students are providing their own computer hardware.

What can be taught in the classroom?

While hands on experience with Big Data is not yet easily accessible for most students at the introductory undergraduate level and difficult at the MS level, there are many foundational computing experiences that could be included into the current curriculum that would be very valuable to students to build their experiences with Large Data to eventually be prepared to work with Big Data.

Computing platform(s) and software

Suggestion: Students need to learn gnu/linux. (or Mac/BSD, or cygwin)

(Feel free to disagree with me.) It appears that unix skill are assumed when working with Big Data.

Students need to become familiar with the commandline ls, cp, and ssh. Commands such as head, tail, more, less and split should become common knowledge among Statistics undergraduate and MS level students.

Computing platform(s) and software

R is an open source software package. Using it on an open source operating system, such as gnu/linux, is the natural next step.

The basics of perl and python should also be added.

Learn about databases and SQL

Suggestion: Students need to learn mysql. (or mariabd or sqlite)

Student need to become familiar with the SQL commands such as SELECT, FROM, WHERE.

SQL should become common knowledge among Statistics senior undergraduate and MS level students.

Example 1. Simulated data for cluster analysis.

Simulate a Large Data set that has millions of observations with different clusters. Ideally the data set can be loaded into memory, within R, so a final analysis can be performed.
Split the file so the parts can be loaded quickly into memory, within R.
Access the same data in a database, from within R.
Propose a sampling procedure of the Large Data set to produce an appropriate random sample from the overall Large Data set.

Example 1. Simulated data for cluster analysis.

Propose various forms of analysis and perform them on the sample.
Propose a overall analysis and perform it.
Evaluate the analysis.
Communicate the results clearly.

Example 1. Analysis

Consider simulating data for cluster analysis. See the MixSim library in R.
Use the first 1000 observations to develop the R code.
Export the simulated data set out of R and import it to a mysql database. Start with the first 1000 observations. Use the write.csv() function in R. Use the RODBC library to connect to the mysql database.

Example 1. Analysis

Propose a sampling procedure, use the sample() function. For example,

A$X[sample(nrow(A$X), 3),]
Since we are considering cluster analysis, try kmeans() with different numbers of clusters.
Try to find the best number of clusters with the entire data set. Compare with the simulated values for the groups. Compute the errors in classification.
Examine the plots. Consider exporting the data to a .csv file and loading it into rattle() to use the Partition and the Evaluate tabs.

Example 1. Analysis

Communicate the results. Try posting the code on git hub and writing a blog post about the data and analysis. Try to make the final plot using the cloud based software tableau public.

Example 1. Some R code

library(MixSim)

n.size = 500000000

Q <- MixSim(MaxOmega = 0.20, BarOmega = 0.05, K = 5, p = 2)

A <- simdataset(n = n.size, Pi = Q$Pi, Mu = Q$Mu, S = Q$S)

Example 1. Some R code

X11()

colors <- c(“red”, “green”, “blue”, “brown”, “magenta”)

par(mar = c(0.1, 0.1, 0.1, 0.1))

plot(A$X, col = colors[A$id], pch = 19, cex = 0.8)

B <- kmeans(A$X,5)

X11()

plot(A$X, col = colors[B$cluster], pch = 19, cex = 0.8)

Example 1. Summary

Note: The simulation can get quite large and still be within the RAM memory on the computer.

Note: Writing to a .csv file gives a clear view of how large the data file is in KB.

Note: The plot function is very slow. There is a clear need for an alternative method of visualizing the data. Maybe hexbin() or smoothScatter() could be introduced.

Note: The kmeans() clustering method has limitations.

Example 2. NYC taxi data.

Read online about the FOIL request that Chris Whong blog made (2014) and the efforts that followed to decode the data. Two Large Data sets Trip Data (11.0 GB) and Fare Data (7.7GB).
Split the files so the parts can be loaded quickly into memory, within R.
Access the same data in a database, from within R.

Example 2. NYC taxi data. Chris Whong's blog.

Propose a sampling procedure of the Large Data set to produce an appropriate random sample from the overall Large Data set.
Propose various forms of analysis and perform them on the sample.
Propose a overall analysis and perform it.
Evaluate the analysis.
Communicate the results clearly.

Example 2. Analysis

These data sets give an opportunity for students to merge data. Trying to do the merge using SQL in a mysql database my be challenging.
Sampling when the full datafile cannot be read into R. How to proceed?
This Large Data set includes GPS data for pick up and drop off. How can the GPS information be used. This Large Data set include a variable that has time and date same field. How to split this field?
Consider Linear Regression and Logistic Regression.
Read of the other blog posts that followed.

Example 3. Airline on-time performance - Data expo 2009

Read online about the data and the original student data competition.
Split each file so the parts can be loaded quickly into memory, within R.
Access the same data in a database, from within R.
Propose a sampling procedure of the Large Data set to produce an appropriate random sample from the overall Large Data set.

Example 3. Airline on-time performance - Data expo 2009

Propose various forms of analysis and perform them on the sample.
Propose a overall analysis and perform it.
Evaluate the analysis.
Communicate the results clearly.

Example 4. Titanic

kaggle Predict survival on the Titanic.
Read the website to learn about the current and past competitions.
This is an excellent introduction to how these competions site works.

Example 5. Airline data again

Revolution Analytics rxLogit
See YouTube video. Show a large data analysis.
Should be possible for students to replicate the data analysis if they have a powerful enough computer or they can give it a try for free on amazon EC2.

Example 6. For introductory Statistics at the Freshman level

What do your passwords look like?
Do they look like this?

5'+M{bh6U7VGsT!2T&Zr}zv&HDSvi2M

How to generate random passwords?
How to store passwords in a database?

Example 6. Summary

The usual answer is some short collection of words and symbols.
No. This is one reason for studying probability and understanding equally likely outcomes these days.
KeePass is a nice program.

Next Steps

Further efforts need to be made to develop Statistics faculty experience with …

map-reduce, hadoop, Hive, Pig, etc. Hortonworks
Oracle R Distribution, Oracle Big Data Lite
NYC Open Data
Determine connections to courses offered in Business
Determine connections to courses offered in CS

Conclusions

Having recently learned about Oracle Big Data Lite effort, I see that I have been trying to do what Oracle has produces using R alone on linux. My next step is to get access to an install or VM of Oracle Linux.

Learning about Big Data and the computing software related to accessing and analyzing Large Data sets is now possible.

Statistics education needs to find a place for introducing these ideas to the students.

References

Melnykov, Chen and Maitra (213) MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms. Journal of Statistical Software.

Chris Whong blog, FOILing NYC's Taxi Trip Data

nyc taxi trips

kaggle, Predict survival on the Titanic

Fourth Paradigm of Science