Eric A. Suess
California State University, East Bay
JSM 2104 Boston
August 6, 2014
We present examples of accessing and analyzing large data sets for use in a classroom at the first year graduate level or senior undergraduate level. Larger and larger data sets are used. The beginning examples focus on in memory data sets and the later examples extend beyond the physical memory available on the computer.
Simulated data sets are suggested and sources for real world data are given.
Data visualization, classification, and prediction are shown.
R, Revolution Analytics R, and mysql.
We became interested in Big Data and the use of parallel computation (distributed computing) and to a lesser extent in the use of parallel data storage (distributed data storage) during the excitement of the NetFlix competition that was offered a few years ago (2009). Learning that the winners of the competition used amazon EC2 cluster and S3 storage inspired us to investiage and learn about these new computing environments, now referred to as “cloud computing”.
The Heritage Health Data competition was also very exciting (2012). And now with kaggle, being a location on the Internet for these types of competitions, there are current open data analysis competitions posted regularly. Many of the kaggle completions focus on Large Data problems which benefit from the use of parallel computation and can be used as a stepping stone to preparing to work with Big Data, Analytics, and Data Science.
The computing techniques and hardware needed to work with Big Data currently seem far from the introductory course work in Statistics at all levels, such as, lower division and upper division Statistics courses, and first year graduate courses in Statistics.
However, with some effort to motivate the ideas of Big Data, parallel computing, and distributed storage at earlier stages in Statistics Education, it would then be possible to
This would be very exciting to students and it would connect their studies with the discussion that are going on in the media and beyond.
This is an important question!
All of the core curriculum focuses on traditional methods, t-tests, ANOVA, Linear Regression, etc., and uses Confidence Intervals, Hypothesis Testing, and p-values. These are core topics that will remain the core curriculum of Statistics.
This is an important question!
With the advances in computer technology, being able to access and analyze Large Data sets is now possible on PCs and laptops. However, opportunities to interact with such data is not currently a core part of the common Statistics curriculum.
Big Data is being discussed by students and is constantly a focus of questions (asked directly or indirectly) asked of faculty these days. The discussion of Big Data seems to be disconnect from the curriculum.
Students of Statistics need to become much more capable with and more knowledgeable users of their own computers, which are now all inherently capable of storing Large Data sets for analysis because of larger amounts of RAM being installed (8, 16, 32 gigabytes), for storage alone, having very large harddrives (500 megabytes, 1 terabytes, beyond), and for parallel computation (Core2Duo, iCore3, iCore5, iCore7, 2, 4, 8 cores respectively).
Statistics educators should try to incorporate more use of Large Data sets and create exercises that include data munging or data wrangling.
The size of the data should become a topic of discussion when presenting standard statistical techniques, such as linear and logistic regression.
The idea of Big Data seems to focus on very large databases containing data sets that may include many data tables that include very large numbers of variables and very very large numbers of observations. Or contain unstructured data that does not fall into a nice format (natural language, images, for example), this is a next step in the Big Data discussion.
These types of data sets may be stored in .csv files and/or databases (other formats are also used). They may be stored on multiple data servers. The size of the data sets far far exceeds the RAM in a usual PC or laptop and far exceeds the usual harddrive space.
The enormity of the size of Big Data makes giving student access to such data very difficult when the
students are providing their own computer hardware.
While hands on experience with Big Data is not yet easily accessible for most students at the introductory undergraduate level and difficult at the MS level, there are many foundational computing experiences that could be included into the current curriculum that would be very valuable to students to build their experiences with Large Data to eventually be prepared to work with Big Data.
Suggestion: Students need to learn gnu/linux. (or Mac/BSD, or cygwin)
(Feel free to disagree with me.) It appears that unix skill are assumed when working with Big Data.
Students need to become familiar with the commandline ls, cp, and ssh. Commands such as head, tail, more, less and split should become common knowledge among Statistics undergraduate and MS level students.
write.csv() function in R. Use the RODBC library to connect to the mysql database.Propose a sampling procedure, use the sample() function. For example,
A$X[sample(nrow(A$X), 3),]
Since we are considering cluster analysis, try kmeans() with different numbers of clusters.
Try to find the best number of clusters with the entire data set. Compare with the simulated values for the groups. Compute the errors in classification.
Examine the plots. Consider exporting the data to a .csv file and loading it into rattle() to use the Partition and the Evaluate tabs.
library(MixSim)
n.size = 500000000
Q <- MixSim(MaxOmega = 0.20, BarOmega = 0.05, K = 5, p = 2)
A <- simdataset(n = n.size, Pi = Q$Pi, Mu = Q$Mu, S = Q$S)
X11()
colors <- c(“red”, “green”, “blue”, “brown”, “magenta”)
par(mar = c(0.1, 0.1, 0.1, 0.1))
plot(A$X, col = colors[A$id], pch = 19, cex = 0.8)
B <- kmeans(A$X,5)
X11()
plot(A$X, col = colors[B$cluster], pch = 19, cex = 0.8)
Note: The simulation can get quite large and still be within the RAM memory on the computer.
Note: Writing to a .csv file gives a clear view of how large the data file is in KB.
Note: The plot function is very slow. There is a clear need for an alternative method of visualizing the data. Maybe hexbin() or smoothScatter() could be introduced.
Note: The kmeans() clustering method has limitations.
rxLogitSee YouTube video. Show a large data analysis.
Should be possible for students to replicate the data analysis if they have a powerful enough computer or they can give it a try for free on amazon EC2.
5'+M{bh6U7VGsT!2T&Zr}zv&HDSvi2M
The usual answer is some short collection of words and symbols.
No. This is one reason for studying probability and understanding equally likely outcomes these days.
KeePass is a nice program.
Further efforts need to be made to develop Statistics faculty experience with …
Oracle R Distribution, Oracle Big Data Lite
Determine connections to courses offered in Business
Determine connections to courses offered in CS
Having recently learned about Oracle Big Data Lite effort, I see that I have been trying to do what Oracle has produces using R alone on linux. My next step is to get access to an install or VM of Oracle Linux.
Learning about Big Data and the computing software related to accessing and analyzing Large Data sets is now possible.
Statistics education needs to find a place for introducing these ideas to the students.
Melnykov, Chen and Maitra (213) MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms. Journal of Statistical Software.
Chris Whong blog, FOILing NYC's Taxi Trip Data
kaggle, Predict survival on the Titanic