Data, Data, More Data

Shige

Use data generated for other statistical packages

Using the foreign package, R can read and write data in many different formats, including

SAS
SPSS
Stata
Systat
Minitab
EpiInfo
dBase/Foxbase

Example 1: Stata

library(foreign)
auto <- read.dta("http://www.stata-press.com/data/r12/auto.dta")
names(auto)

 [1] "make"         "price"        "mpg"          "rep78"       
 [5] "headroom"     "trunk"        "weight"       "length"      
 [9] "turn"         "displacement" "gear_ratio"   "foreign"

We can do the same thing using Stata

Demo using Stata, run “read_data.do”

Excel

There are a number of R packages that can read and write Excel files, including:

XLConnect
excel.link
xlsx
RODBC
gdata

Here is a nice introduction. Again, it is a bad idea to use Excel for serious data analysis!

Whenever possible, store your data in plain text format

Proprietary software comes and goes, so does proprietary binary data format
If you want your data stay alive as long as possible, save them as plain text file
Comma-separated values (CSV) is the most widely used plain text file format
Any descent statistical package can read and write CSV

Let's take a look at an example.

CSV with R

The ability of reading and writing CSV files is part of the base R
The syntax is very simple

x08 <- read.csv("2008.csv", na.strings = "NA")

Creates an R data frame called “x08” from the plain text file “2008.csv”.

CSV with Stata

A simple demo importing a CSV data into Stata

Some HUGE CSV files

We are going to work on some huge CSV files.

I will use the data for 2004, 2005, 2005, 2006, 2007, and 2008. Combined together, we are dealing with CSV data roughly 3.5 GB in size. The resulted R data frame exceeds 5 GB.

My computer has 4GB physical memory
I am running 64-bit Windows
With Firefox and Rstudio running, I have roughly 2.4 GB free physical memory
In other words, the data are too big to fit in memory
Stata will not work

R for "big" data: I

R was not originally designed for huge data (not surprising given its origins in statistics department)
Over years, a number of facilities were developed that enables R to process huge data (i.e., larger than )
Thanks to the free and open source nature of R
More information can be found here

R for "big" data: II

In short, out-of-memory data can be handled in the following ways:

Use R with a SQL database
Use R with Hadoop
Use R packages (ff, bigmemory, etc.)
Use Revolution R

R for "big" data: III

The “ff” package is my big data package of choice
It provides a data.frame like construct (unlike bigmemory which only provides vectors and matrices)
Written in C++ with R wrapper function
Using highly efficient binary “swap” file
The size of data is limited by your hard drive instead of RAM
More helper functions were provided by the ffbase package

A demo session

Using airline data for 2004, 2005, 2006, 2007, and 2008
Each is about 700MB in size, includes 7 million cases and 29 variables
I am going to “stack” all five data, do some manipulation, and conduct some simple statistical analysis
Again, we are playing with 35 million records here

Doing the demo, this could take some time, be warned!

In summary

Store your raw data in plain text (or SQL database)
Read them as “ffdf”, a data.frame-like construct provided by the ff package, format
Doing data manipulation, paying special attention to factor variables with lots of levels (i.e., values)
Subset your ffdf data and conduct statistical analysis using the subset of choice

Going further

Materials on ff and ffbase packages are sparse
I benefit greatly from this and this blog posts
Online documentation for “ff”, “ffbase”, and “biglm” packages are helpful although they are not the most pleasant readings on earth

Revolution R

Pros:

Has similar core functions as the “ff” and “ffbase” combo
Better documented (commercial product)
More supported statistical analysis, at least for now

Cons:

Huge in size
Windows only (deal breaker)
Expensive (free for academics though)