Data, Data, More Data

Shige

Use data generated for other statistical packages

Using the foreign package, R can read and write data in many different formats, including

  • SAS
  • SPSS
  • Stata
  • Systat
  • Minitab
  • EpiInfo
  • dBase/Foxbase

Example 1: Stata

library(foreign)
auto <- read.dta("http://www.stata-press.com/data/r12/auto.dta")
names(auto)
 [1] "make"         "price"        "mpg"          "rep78"       
 [5] "headroom"     "trunk"        "weight"       "length"      
 [9] "turn"         "displacement" "gear_ratio"   "foreign"     

We can do the same thing using Stata

Demo using Stata, run “read_data.do”

Excel

There are a number of R packages that can read and write Excel files, including:

  • XLConnect
  • excel.link
  • xlsx
  • RODBC
  • gdata

Here is a nice introduction. Again, it is a bad idea to use Excel for serious data analysis!

Whenever possible, store your data in plain text format

  • Proprietary software comes and goes, so does proprietary binary data format
  • If you want your data stay alive as long as possible, save them as plain text file
  • Comma-separated values (CSV) is the most widely used plain text file format
  • Any descent statistical package can read and write CSV

Let's take a look at an example.

CSV with R

  • The ability of reading and writing CSV files is part of the base R
  • The syntax is very simple
x08 <- read.csv("2008.csv", na.strings = "NA")

Creates an R data frame called “x08” from the plain text file “2008.csv”.

CSV with Stata

A simple demo importing a CSV data into Stata

Some HUGE CSV files

We are going to work on some huge CSV files.

I will use the data for 2004, 2005, 2005, 2006, 2007, and 2008. Combined together, we are dealing with CSV data roughly 3.5 GB in size. The resulted R data frame exceeds 5 GB.

  • My computer has 4GB physical memory
  • I am running 64-bit Windows
  • With Firefox and Rstudio running, I have roughly 2.4 GB free physical memory
  • In other words, the data are too big to fit in memory
  • Stata will not work

R for "big" data: I

  • R was not originally designed for huge data (not surprising given its origins in statistics department)
  • Over years, a number of facilities were developed that enables R to process huge data (i.e., larger than )
  • Thanks to the free and open source nature of R
  • More information can be found here

R for "big" data: II

In short, out-of-memory data can be handled in the following ways:

  • Use R with a SQL database
  • Use R with Hadoop
  • Use R packages (ff, bigmemory, etc.)
  • Use Revolution R

R for "big" data: III

  • The “ff” package is my big data package of choice
  • It provides a data.frame like construct (unlike bigmemory which only provides vectors and matrices)
  • Written in C++ with R wrapper function
  • Using highly efficient binary “swap” file
  • The size of data is limited by your hard drive instead of RAM
  • More helper functions were provided by the ffbase package

A demo session

  • Using airline data for 2004, 2005, 2006, 2007, and 2008
  • Each is about 700MB in size, includes 7 million cases and 29 variables
  • I am going to “stack” all five data, do some manipulation, and conduct some simple statistical analysis
  • Again, we are playing with 35 million records here

Doing the demo, this could take some time, be warned!

In summary

  • Store your raw data in plain text (or SQL database)
  • Read them as “ffdf”, a data.frame-like construct provided by the ff package, format
  • Doing data manipulation, paying special attention to factor variables with lots of levels (i.e., values)
  • Subset your ffdf data and conduct statistical analysis using the subset of choice

Going further

  • Materials on ff and ffbase packages are sparse
  • I benefit greatly from this and this blog posts
  • Online documentation for “ff”, “ffbase”, and “biglm” packages are helpful although they are not the most pleasant readings on earth

Revolution R

Pros:

  • Has similar core functions as the “ff” and “ffbase” combo
  • Better documented (commercial product)
  • More supported statistical analysis, at least for now

Cons:

  • Huge in size
  • Windows only (deal breaker)
  • Expensive (free for academics though)