Setting Up

Reading Data

Some base R functions to read data

Writing Data

Similar functions to write data

Know Thy System

In general, when using R with larger datasets, it’s useful to know a few things about your system.

Calculating Memory Requirements

I have a data frame with 2,000,000 rows and 150 columns, all of which are numeric data. Roughly, how much memory is required to store this data frame?

# bytes
2000000*150*8 
## [1] 2.4e+09
# bytes/MB
2000000*150*8/2^{20}
## [1] 2288.818
# MB
round(2000000*150*8/2^{20},2)
## [1] 2288.82
# GB
round(2000000*150*8/2^{20}/1024, 2)
## [1] 2.24

Data files come from

Where you can get data

Downloading file from the web

wd = setwd('~/Syncplicity/WPI/Dingari Lectures/Lecture 2')
fileUrl = "https://raw.githubusercontent.com/dnchari/Rdatasets/master/csv/MASS/Cars93.csv"
download.file(fileUrl,destfile="./data/cars93.csv",method="curl")
list.files("./data")
## [1] "camera.xlsx" "cars93.csv"  "species.csv" "surveys.csv"
dateDownloaded = date()
dateDownloaded
## [1] "Sat May 16 12:22:23 2015"

Some notes about download.file()

Loading data you have saved - read.table()

Example - Cars93 data

cars = read.csv("./data/cars93.csv")
head(cars)
##   X Manufacturer   Model    Type Min.Price Price Max.Price MPG.city
## 1 1        Acura Integra   Small      12.9  15.9      18.8       25
## 2 2        Acura  Legend Midsize      29.2  33.9      38.7       18
## 3 3         Audi      90 Compact      25.9  29.1      32.3       20
## 4 4         Audi     100 Midsize      30.8  37.7      44.6       19
## 5 5          BMW    535i Midsize      23.7  30.0      36.2       22
## 6 6        Buick Century Midsize      14.2  15.7      17.3       22
##   MPG.highway            AirBags DriveTrain Cylinders EngineSize
## 1          31               None      Front         4        1.8
## 2          25 Driver & Passenger      Front         6        3.2
## 3          26        Driver only      Front         6        2.8
## 4          26 Driver & Passenger      Front         6        2.8
## 5          30        Driver only       Rear         4        3.5
## 6          31        Driver only      Front         4        2.2
##   Horsepower  RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity
## 1        140 6300         2890             Yes               13.2
## 2        200 5500         2335             Yes               18.0
## 3        172 5500         2280             Yes               16.9
## 4        172 5500         2535             Yes               21.1
## 5        208 5700         2545             Yes               21.1
## 6        110 5200         2565              No               16.4
##   Passengers Length Wheelbase Width Turn.circle Rear.seat.room
## 1          5    177       102    68          37           26.5
## 2          5    195       115    71          38           30.0
## 3          5    180       102    67          37           28.0
## 4          6    193       106    70          37           31.0
## 5          4    186       109    69          39           27.0
## 6          6    189       105    69          41           28.0
##   Luggage.room Weight  Origin          Make
## 1           11   2705 non-USA Acura Integra
## 2           15   3560 non-USA  Acura Legend
## 3           14   3375 non-USA       Audi 90
## 4           17   3405 non-USA      Audi 100
## 5           13   3640 non-USA      BMW 535i
## 6           16   2880     USA Buick Century

Package xlsx

library(xlsx)
## Loading required package: rJava
## Loading required package: xlsxjars
fileUrl = "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.xlsx?accessType=DOWNLOAD"
download.file(fileUrl,destfile="./data/camera.xlsx",method="curl")
cameraData = read.xlsx("./data/camera.xlsx",sheetIndex=1)
head(cameraData)
##                          address direction      street  crossStreet
## 1       S CATON AVE & BENSON AVE       N/B   Caton Ave   Benson Ave
## 2       S CATON AVE & BENSON AVE       S/B   Caton Ave   Benson Ave
## 3 WILKENS AVE & PINE HEIGHTS AVE       E/B Wilkens Ave Pine Heights
## 4        THE ALAMEDA & E 33RD ST       S/B The Alameda      33rd St
## 5        E 33RD ST & THE ALAMEDA       E/B      E 33rd  The Alameda
## 6        ERDMAN AVE & N MACON ST       E/B      Erdman     Macon St
##                 intersection                      Location.1
## 1     Caton Ave & Benson Ave (39.2693779962, -76.6688185297)
## 2     Caton Ave & Benson Ave (39.2693157898, -76.6689698176)
## 3 Wilkens Ave & Pine Heights  (39.2720252302, -76.676960806)
## 4     The Alameda  & 33rd St (39.3285013141, -76.5953545714)
## 5      E 33rd  & The Alameda (39.3283410623, -76.5953594625)
## 6         Erdman  & Macon St (39.3068045671, -76.5593167803)

Getting data off webpages

cite = url('http://scholar.google.com/citations?user=dC24-QsAAAAJ')
library(XML)
htmlCode = readLines(cite)
## Warning in readLines(cite): incomplete final line found on
## 'http://scholar.google.com/citations?user=dC24-QsAAAAJ'
close(cite)
html3 = htmlTreeParse("http://scholar.google.com/citations?user=dC24-QsAAAAJ", useInternalNodes=T)
xpathSApply(html3, "//title", xmlValue)
## [1] "Narahara Chari Dingari - Google Scholar Citations"

Other packages for getting data off webpages

Connecting to Databases

library(RJDBC)
# Oracle Database
drv = JDBC(driverClass = "oracle.jdbc.OracleDriver", classPath = "ojdbc/ojdbc6.jar", identifier.quote="`")
conn = dbConnect(drv, "jdbc:oracle:thin:username/password@myHost:1521:db")
# MySQL Database
drv = JDBC("com.mysql.jdbc.Driver", "driver/mysql-connector-java-5.1.34-bin.jar", identifier.quote="`")
conn = dbConnect(drv, "jdbc:mysql://myHost","username","password")

# RMySQL Package
library(RMySQL)
conn = dbConnect(MySQL(), user = 'myUserName', password = 'myPassword', host = 'myHost.com', dbname='dbName')
dbListTables(conn)
t1 = dbReadTable(conn, "table1")
t1Data = dbSendQuery(conn,"SELECT column1,column2,column2 FROM table1")

dbListTables(conn)
data1 = dbReadTable(conn, "Table_Name")
data2 = dbGetQuery(conn, "select * from SCHEMA.TABLE")
dbDisconnect(conn)

stringr package

library(stringr)
wpi = "Worcester Polytechnic Institute"
str_sub(wpi, 1, 9)
## [1] "Worcester"
str_sub(wpi, end=9)
## [1] "Worcester"
str_sub(wpi, -9)
## [1] "Institute"
months = month.name
str_detect(months, "a") #Contains
##  [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE
months[str_detect(months, "a")]
## [1] "January"  "February" "March"    "May"
months[str_detect(months, "^a")] # Starts with 
## character(0)
months[str_detect(months, ignore.case("^a"))] # Making case-insensitive
## [1] "April"  "August"
months[str_detect(months, ignore.case("y$"))] # Ends with
## [1] "January"  "February" "May"      "July"
months[str_detect(months, ignore.case('[aeiou]'))] # Contains Vowels
##  [1] "January"   "February"  "March"     "April"     "May"      
##  [6] "June"      "July"      "August"    "September" "October"  
## [11] "November"  "December"

Another example

library(RCurl)
slink = 'https://raw.githubusercontent.com/dnchari/DS501/dnchari-functions/numbers2words.R'
n2w = getURL(slink, ssl.verifypeer = FALSE)
eval(parse(text = n2w),envir=.GlobalEnv)
numbers=numbers2words(1:99)
numbers
##  [1] "one"           "two"           "three"         "four"         
##  [5] "five"          "six"           "seven"         "eight"        
##  [9] "nine"          "ten"           "eleven"        "twelve"       
## [13] "thirteen"      "fourteen"      "fifteen"       "sixteen"      
## [17] " seventeen"    "eighteen"      "nineteen"      "twenty"       
## [21] "twenty one"    "twenty two"    "twenty three"  "twenty four"  
## [25] "twenty five"   "twenty six"    "twenty seven"  "twenty eight" 
## [29] "twenty nine"   "thirty"        "thirty one"    "thirty two"   
## [33] "thirty three"  "thirty four"   "thirty five"   "thirty six"   
## [37] "thirty seven"  "thirty eight"  "thirty nine"   "forty"        
## [41] "forty one"     "forty two"     "forty three"   "forty four"   
## [45] "forty five"    "forty six"     "forty seven"   "forty eight"  
## [49] "forty nine"    "fifty"         "fifty one"     "fifty two"    
## [53] "fifty three"   "fifty four"    "fifty five"    "fifty six"    
## [57] "fifty seven"   "fifty eight"   "fifty nine"    "sixty"        
## [61] "sixty one"     "sixty two"     "sixty three"   "sixty four"   
## [65] "sixty five"    "sixty six"     "sixty seven"   "sixty eight"  
## [69] "sixty nine"    "seventy"       "seventy one"   "seventy two"  
## [73] "seventy three" "seventy four"  "seventy five"  "seventy six"  
## [77] "seventy seven" "seventy eight" "seventy nine"  "eighty"       
## [81] "eighty one"    "eighty two"    "eighty three"  "eighty four"  
## [85] "eighty five"   "eighty six"    "eighty seven"  "eighty eight" 
## [89] "eighty nine"   "ninety"        "ninety one"    "ninety two"   
## [93] "ninety three"  "ninety four"   "ninety five"   "ninety six"   
## [97] "ninety seven"  "ninety eight"  "ninety nine"

Detect if “A B C D” are there in this string

numbers[str_detect(numbers, ignore.case('[abcd]'))]
## character(0)
numbers=numbers2words(100:105)
numbers[str_detect(numbers, ignore.case('[abcd]'))]
## [1] "one hundred"       "one hundred one"   "one hundred two"  
## [4] "one hundred three" "one hundred four"  "one hundred five"
str_count(months, ignore.case("[aeiou]"))
##  [1] 3 3 1 2 1 2 1 3 3 3 3 3

Packages reshape2 and tidyr

library(reshape2)
aq = airquality[1:10,]
aq
##    Ozone Solar.R Wind Temp Month Day
## 1     41     190  7.4   67     5   1
## 2     36     118  8.0   72     5   2
## 3     12     149 12.6   74     5   3
## 4     18     313 11.5   62     5   4
## 5     NA      NA 14.3   56     5   5
## 6     28      NA 14.9   66     5   6
## 7     23     299  8.6   65     5   7
## 8     19      99 13.8   59     5   8
## 9      8      19 20.1   61     5   9
## 10    NA     194  8.6   69     5  10
# Wide to long
melt(aq, id.vars = c("Month", "Day"))
##    Month Day variable value
## 1      5   1    Ozone  41.0
## 2      5   2    Ozone  36.0
## 3      5   3    Ozone  12.0
## 4      5   4    Ozone  18.0
## 5      5   5    Ozone    NA
## 6      5   6    Ozone  28.0
## 7      5   7    Ozone  23.0
## 8      5   8    Ozone  19.0
## 9      5   9    Ozone   8.0
## 10     5  10    Ozone    NA
## 11     5   1  Solar.R 190.0
## 12     5   2  Solar.R 118.0
## 13     5   3  Solar.R 149.0
## 14     5   4  Solar.R 313.0
## 15     5   5  Solar.R    NA
## 16     5   6  Solar.R    NA
## 17     5   7  Solar.R 299.0
## 18     5   8  Solar.R  99.0
## 19     5   9  Solar.R  19.0
## 20     5  10  Solar.R 194.0
## 21     5   1     Wind   7.4
## 22     5   2     Wind   8.0
## 23     5   3     Wind  12.6
## 24     5   4     Wind  11.5
## 25     5   5     Wind  14.3
## 26     5   6     Wind  14.9
## 27     5   7     Wind   8.6
## 28     5   8     Wind  13.8
## 29     5   9     Wind  20.1
## 30     5  10     Wind   8.6
## 31     5   1     Temp  67.0
## 32     5   2     Temp  72.0
## 33     5   3     Temp  74.0
## 34     5   4     Temp  62.0
## 35     5   5     Temp  56.0
## 36     5   6     Temp  66.0
## 37     5   7     Temp  65.0
## 38     5   8     Temp  59.0
## 39     5   9     Temp  61.0
## 40     5  10     Temp  69.0
m1 = melt(aq, id.vars = c("Month", "Day"), variable.name='Var', value.name='Val')
head(m1)
##   Month Day   Var Val
## 1     5   1 Ozone  41
## 2     5   2 Ozone  36
## 3     5   3 Ozone  12
## 4     5   4 Ozone  18
## 5     5   5 Ozone  NA
## 6     5   6 Ozone  28
# Long to wide
dcast(m1, Month+Day~Var)
## Using Val as value column: use value.var to override.
##    Month Day Ozone Solar.R Wind Temp
## 1      5   1    41     190  7.4   67
## 2      5   2    36     118  8.0   72
## 3      5   3    12     149 12.6   74
## 4      5   4    18     313 11.5   62
## 5      5   5    NA      NA 14.3   56
## 6      5   6    28      NA 14.9   66
## 7      5   7    23     299  8.6   65
## 8      5   8    19      99 13.8   59
## 9      5   9     8      19 20.1   61
## 10     5  10    NA     194  8.6   69
library(tidyr)
aq = airquality[1:10,]
aq
##    Ozone Solar.R Wind Temp Month Day
## 1     41     190  7.4   67     5   1
## 2     36     118  8.0   72     5   2
## 3     12     149 12.6   74     5   3
## 4     18     313 11.5   62     5   4
## 5     NA      NA 14.3   56     5   5
## 6     28      NA 14.9   66     5   6
## 7     23     299  8.6   65     5   7
## 8     19      99 13.8   59     5   8
## 9      8      19 20.1   61     5   9
## 10    NA     194  8.6   69     5  10
# Wide to long
m1 = gather(aq, Var, Val, -Month, -Day)
head(m1)
##   Month Day   Var Val
## 1     5   1 Ozone  41
## 2     5   2 Ozone  36
## 3     5   3 Ozone  12
## 4     5   4 Ozone  18
## 5     5   5 Ozone  NA
## 6     5   6 Ozone  28
# Long to wide
spread(m1, Var, Val)
##    Month Day Ozone Solar.R Wind Temp
## 1      5   1    41     190  7.4   67
## 2      5   2    36     118  8.0   72
## 3      5   3    12     149 12.6   74
## 4      5   4    18     313 11.5   62
## 5      5   5    NA      NA 14.3   56
## 6      5   6    28      NA 14.9   66
## 7      5   7    23     299  8.6   65
## 8      5   8    19      99 13.8   59
## 9      5   9     8      19 20.1   61
## 10     5  10    NA     194  8.6   69

plyr package for Split-Apply-Combine

plyr has functions for operating on lists, data.frames and arrays. Each function performs:

INPUT/OUTPUT Array Data frame List Discarded
Array aaply adply alply a_ply
Data frame daply ddply dlply d_ply
list laply ldply llply l_ply
library(plyr)
d1 = chickwts
d1s = ddply(d1, c('feed'), summarise, mean.weight=mean(weight))
head(d1s)
##        feed mean.weight
## 1    casein    323.5833
## 2 horsebean    160.2000
## 3   linseed    218.7500
## 4  meatmeal    276.9091
## 5   soybean    246.4286
## 6 sunflower    328.9167
d1t = ddply(d1, c('feed'), transform, mean.weight=mean(weight))
head(d1t)
##   weight   feed mean.weight
## 1    368 casein    323.5833
## 2    390 casein    323.5833
## 3    379 casein    323.5833
## 4    260 casein    323.5833
## 5    404 casein    323.5833
## 6    318 casein    323.5833
# Coefficiet of Variation - Default function
d1f = ddply(d1, "feed", function(x) {
  mean.weight = mean(x$weight)
  sd.weight = sd(x$weight)
  cv = sd.weight/mean.weight
  data.frame(cv.weight = cv)
})
d1m = ddply(d1, c('feed'), mutate, mean.weight=mean(weight), sd.weight=sd(weight), cv.weight=mean.weight/sd.weight)
head(d1m)
##   weight   feed mean.weight sd.weight cv.weight
## 1    368 casein    323.5833  64.43384  5.021947
## 2    390 casein    323.5833  64.43384  5.021947
## 3    379 casein    323.5833  64.43384  5.021947
## 4    260 casein    323.5833  64.43384  5.021947
## 5    404 casein    323.5833  64.43384  5.021947
## 6    318 casein    323.5833  64.43384  5.021947

Additional capabilities

Dealing with errors

You can use the failwith function to control how errors are dealt with.

f = function(x) if (x == 1) stop("Error!") else 1
safe.f = failwith(NA, f, quiet = TRUE)
# llply(1:2, f)
llply(1:2, safe.f)
## [[1]]
## [1] NA
## 
## [[2]]
## [1] 1

Parallel processing

In conjunction with a package such as doParallel you can run your function separately on each core of your computer. On a dual core machine this make your code up to twice as fast. Simply register the cores and then set .parallel = TRUE. Look at the elapsed time in these examples.

x = c(1:10)
wait = function(i) Sys.sleep(0.1)
system.time(llply(x, wait))
##    user  system elapsed 
##   0.005   0.001   1.036
system.time(sapply(x, wait))
##    user  system elapsed 
##   0.001   0.001   1.035
library(doParallel)
registerDoParallel(cores = 16)
system.time(llply(x, wait, .parallel = TRUE))
##    user  system elapsed 
##   0.019   0.025   0.127

Progress bars

By specifying .progress = 'text' to your plyr function parameters, plyr will display a progress bar for long-running plyr operations.

Print

The *_ply functions have a .print option which will automatically wrap your function in a print() statement so outputs get written to display devices such as PDFs, PNGs etc.

Reasons for not using plyr

plyr can be slow, particularly if you are working with very large datasets that involve a lot of subsetting. Other faster options are

system.time(ddply(baseball, "id", summarize, length(year)))
##    user  system elapsed 
##   0.356   0.006   0.363
system.time(tapply(baseball$year, baseball$id,function(x) length(x)))
##    user  system elapsed 
##   0.006   0.000   0.006
library(data.table)
dt = data.table(baseball, key = "id")
system.time(dt[, length(year), by = list(id)])
##    user  system elapsed 
##   0.004   0.000   0.004
library(dplyr)
system.time(baseball %>%
  group_by(id) %>%
  summarize(length(year)))
##    user  system elapsed 
##   0.005   0.000   0.005

References

  • xlsx package by Adrian A. Dragulescu
  • XML and RCurl package by Duncan Temple Lang
  • RJDBC package by Simon Urbanek
  • RMySQL package by Jeroen Ooms et. al.
  • data.table package by Matt Dowle
  • stringr, tidyr, reshape2,plyr and dplyr packages by Hadley Wickham
  • doParallel package by Revolution Analytics, Steve Weston