Data Collection and Data Blending

Setting Up

setwd, set the working directory
getwd, returns an absolute filepath representing the current working directory
list.files or dir, list the files
system invokes the OS command

Reading Data

Some base R functions to read data

read.table, read.csv to read flat files
readLines, for reading lines of a text file
source, for reading in R code files
load, for reading in saved workspaces
download.file, for downloading file from webpage

Writing Data

Similar functions to write data

write.table, write.csv to wrote data
writeLines, for writing lines to a text file
save, writes an external representation of R objects to the specified file.

Know Thy System

In general, when using R with larger datasets, it’s useful to know a few things about your system.

How much memory is available?
What other applications are in use?
Are there other users logged into the same system?
What operating system?
Is the OS 32 or 64 bit?

Calculating Memory Requirements

I have a data frame with 2,000,000 rows and 150 columns, all of which are numeric data. Roughly, how much memory is required to store this data frame?

2,000,000 × 150 × 8 bytes/numeric

# bytes
2000000*150*8

## [1] 2.4e+09

# bytes/MB
2000000*150*8/2^{20}

## [1] 2288.818

# MB
round(2000000*150*8/2^{20},2)

## [1] 2288.82

# GB
round(2000000*150*8/2^{20}/1024, 2)

## [1] 2.24

2.24 GB

Data files come from

Tab-delimited text, Comma-separated text
Excel file
JSON File
HTML/XML file
Database

Where you can get data

From a colleague
From the web
From an application programming interface
By scraping a web page

Downloading file from the web

wd = setwd('~/Syncplicity/WPI/Dingari Lectures/Lecture 2')
fileUrl = "https://raw.githubusercontent.com/dnchari/Rdatasets/master/csv/MASS/Cars93.csv"
download.file(fileUrl,destfile="./data/cars93.csv",method="curl")
list.files("./data")

## [1] "camera.xlsx" "cars93.csv"  "species.csv" "surveys.csv"

dateDownloaded = date()
dateDownloaded

## [1] "Sat May 16 12:22:23 2015"

Some notes about download.file()

If the url starts with http you can use download.file()
If the url starts with https on Windows you may be ok
If the url starts with https on Mac you may need to set method=“curl”
If the file is big, this might take a while

Loading data you have saved - read.table()

This is the main function for reading data into R
Flexible and robust but requires more parameters
Reads the data into RAM - big data can cause problems
Important parameters file, header, sep, row.names, nrows
Related: read.csv, read.csv2

Example - Cars93 data

cars = read.csv("./data/cars93.csv")
head(cars)

##   X Manufacturer   Model    Type Min.Price Price Max.Price MPG.city
## 1 1        Acura Integra   Small      12.9  15.9      18.8       25
## 2 2        Acura  Legend Midsize      29.2  33.9      38.7       18
## 3 3         Audi      90 Compact      25.9  29.1      32.3       20
## 4 4         Audi     100 Midsize      30.8  37.7      44.6       19
## 5 5          BMW    535i Midsize      23.7  30.0      36.2       22
## 6 6        Buick Century Midsize      14.2  15.7      17.3       22
##   MPG.highway            AirBags DriveTrain Cylinders EngineSize
## 1          31               None      Front         4        1.8
## 2          25 Driver & Passenger      Front         6        3.2
## 3          26        Driver only      Front         6        2.8
## 4          26 Driver & Passenger      Front         6        2.8
## 5          30        Driver only       Rear         4        3.5
## 6          31        Driver only      Front         4        2.2
##   Horsepower  RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity
## 1        140 6300         2890             Yes               13.2
## 2        200 5500         2335             Yes               18.0
## 3        172 5500         2280             Yes               16.9
## 4        172 5500         2535             Yes               21.1
## 5        208 5700         2545             Yes               21.1
## 6        110 5200         2565              No               16.4
##   Passengers Length Wheelbase Width Turn.circle Rear.seat.room
## 1          5    177       102    68          37           26.5
## 2          5    195       115    71          38           30.0
## 3          5    180       102    67          37           28.0
## 4          6    193       106    70          37           31.0
## 5          4    186       109    69          39           27.0
## 6          6    189       105    69          41           28.0
##   Luggage.room Weight  Origin          Make
## 1           11   2705 non-USA Acura Integra
## 2           15   3560 non-USA  Acura Legend
## 3           14   3375 non-USA       Audi 90
## 4           17   3405 non-USA      Audi 100
## 5           13   3640 non-USA      BMW 535i
## 6           16   2880     USA Buick Century

Package `xlsx`

read.xlsx, to read the contents of a worksheet into an R data.frame. read.xlsx(file, sheetIndex, sheetName=NULL, header=TRUE, ...)
write.xlsx, to write a data.frame to an Excel workbook. write.xlsx(x, file, sheetName="Sheet1", row.names=TRUE, append=FALSE, ...)

library(xlsx)

## Loading required package: rJava
## Loading required package: xlsxjars

fileUrl = "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.xlsx?accessType=DOWNLOAD"
download.file(fileUrl,destfile="./data/camera.xlsx",method="curl")
cameraData = read.xlsx("./data/camera.xlsx",sheetIndex=1)
head(cameraData)

##                          address direction      street  crossStreet
## 1       S CATON AVE & BENSON AVE       N/B   Caton Ave   Benson Ave
## 2       S CATON AVE & BENSON AVE       S/B   Caton Ave   Benson Ave
## 3 WILKENS AVE & PINE HEIGHTS AVE       E/B Wilkens Ave Pine Heights
## 4        THE ALAMEDA & E 33RD ST       S/B The Alameda      33rd St
## 5        E 33RD ST & THE ALAMEDA       E/B      E 33rd  The Alameda
## 6        ERDMAN AVE & N MACON ST       E/B      Erdman     Macon St
##                 intersection                      Location.1
## 1     Caton Ave & Benson Ave (39.2693779962, -76.6688185297)
## 2     Caton Ave & Benson Ave (39.2693157898, -76.6689698176)
## 3 Wilkens Ave & Pine Heights  (39.2720252302, -76.676960806)
## 4     The Alameda  & 33rd St (39.3285013141, -76.5953545714)
## 5      E 33rd  & The Alameda (39.3283410623, -76.5953594625)
## 6         Erdman  & Macon St (39.3068045671, -76.5593167803)

Getting data off webpages

cite = url('http://scholar.google.com/citations?user=dC24-QsAAAAJ')
library(XML)
htmlCode = readLines(cite)

## Warning in readLines(cite): incomplete final line found on
## 'http://scholar.google.com/citations?user=dC24-QsAAAAJ'

close(cite)

html3 = htmlTreeParse("http://scholar.google.com/citations?user=dC24-QsAAAAJ", useInternalNodes=T)
xpathSApply(html3, "//title", xmlValue)

## [1] "Narahara Chari Dingari - Google Scholar Citations"

Other packages for getting data off webpages

XML, httr, RCurl

Connecting to Databases

Packages RJDBC, RMySQL
Function JDBC creates a new DBI driver that can be used to start JDBC connections.

library(RJDBC)
# Oracle Database
drv = JDBC(driverClass = "oracle.jdbc.OracleDriver", classPath = "ojdbc/ojdbc6.jar", identifier.quote="`")
conn = dbConnect(drv, "jdbc:oracle:thin:username/password@myHost:1521:db")
# MySQL Database
drv = JDBC("com.mysql.jdbc.Driver", "driver/mysql-connector-java-5.1.34-bin.jar", identifier.quote="`")
conn = dbConnect(drv, "jdbc:mysql://myHost","username","password")

# RMySQL Package
library(RMySQL)
conn = dbConnect(MySQL(), user = 'myUserName', password = 'myPassword', host = 'myHost.com', dbname='dbName')
dbListTables(conn)
t1 = dbReadTable(conn, "table1")
t1Data = dbSendQuery(conn,"SELECT column1,column2,column2 FROM table1")

dbListTables(conn)
data1 = dbReadTable(conn, "Table_Name")
data2 = dbGetQuery(conn, "select * from SCHEMA.TABLE")
dbDisconnect(conn)

`stringr` package

Extract substrings from a character vector.

library(stringr)
wpi = "Worcester Polytechnic Institute"
str_sub(wpi, 1, 9)

## [1] "Worcester"

str_sub(wpi, end=9)

## [1] "Worcester"

str_sub(wpi, -9)

## [1] "Institute"

Detect the presence or absence of a pattern in a string.

months = month.name
str_detect(months, "a") #Contains

##  [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE

months[str_detect(months, "a")]

## [1] "January"  "February" "March"    "May"

months[str_detect(months, "^a")] # Starts with

## character(0)

months[str_detect(months, ignore.case("^a"))] # Making case-insensitive

## [1] "April"  "August"

months[str_detect(months, ignore.case("y$"))] # Ends with

## [1] "January"  "February" "May"      "July"

months[str_detect(months, ignore.case('[aeiou]'))] # Contains Vowels

##  [1] "January"   "February"  "March"     "April"     "May"      
##  [6] "June"      "July"      "August"    "September" "October"  
## [11] "November"  "December"

Another example

library(RCurl)
slink = 'https://raw.githubusercontent.com/dnchari/DS501/dnchari-functions/numbers2words.R'
n2w = getURL(slink, ssl.verifypeer = FALSE)
eval(parse(text = n2w),envir=.GlobalEnv)
numbers=numbers2words(1:99)
numbers

##  [1] "one"           "two"           "three"         "four"         
##  [5] "five"          "six"           "seven"         "eight"        
##  [9] "nine"          "ten"           "eleven"        "twelve"       
## [13] "thirteen"      "fourteen"      "fifteen"       "sixteen"      
## [17] " seventeen"    "eighteen"      "nineteen"      "twenty"       
## [21] "twenty one"    "twenty two"    "twenty three"  "twenty four"  
## [25] "twenty five"   "twenty six"    "twenty seven"  "twenty eight" 
## [29] "twenty nine"   "thirty"        "thirty one"    "thirty two"   
## [33] "thirty three"  "thirty four"   "thirty five"   "thirty six"   
## [37] "thirty seven"  "thirty eight"  "thirty nine"   "forty"        
## [41] "forty one"     "forty two"     "forty three"   "forty four"   
## [45] "forty five"    "forty six"     "forty seven"   "forty eight"  
## [49] "forty nine"    "fifty"         "fifty one"     "fifty two"    
## [53] "fifty three"   "fifty four"    "fifty five"    "fifty six"    
## [57] "fifty seven"   "fifty eight"   "fifty nine"    "sixty"        
## [61] "sixty one"     "sixty two"     "sixty three"   "sixty four"   
## [65] "sixty five"    "sixty six"     "sixty seven"   "sixty eight"  
## [69] "sixty nine"    "seventy"       "seventy one"   "seventy two"  
## [73] "seventy three" "seventy four"  "seventy five"  "seventy six"  
## [77] "seventy seven" "seventy eight" "seventy nine"  "eighty"       
## [81] "eighty one"    "eighty two"    "eighty three"  "eighty four"  
## [85] "eighty five"   "eighty six"    "eighty seven"  "eighty eight" 
## [89] "eighty nine"   "ninety"        "ninety one"    "ninety two"   
## [93] "ninety three"  "ninety four"   "ninety five"   "ninety six"   
## [97] "ninety seven"  "ninety eight"  "ninety nine"

Detect if “A B C D” are there in this string

numbers[str_detect(numbers, ignore.case('[abcd]'))]

## character(0)

numbers=numbers2words(100:105)
numbers[str_detect(numbers, ignore.case('[abcd]'))]

## [1] "one hundred"       "one hundred one"   "one hundred two"  
## [4] "one hundred three" "one hundred four"  "one hundred five"

String count

str_count(months, ignore.case("[aeiou]"))

##  [1] 3 3 1 2 1 2 1 3 3 3 3 3

Similarly str_extract, str_extract_all, str_length, str_sub

Packages `reshape2` and `tidyr`

In reshape2 package

library(reshape2)
aq = airquality[1:10,]
aq

##    Ozone Solar.R Wind Temp Month Day
## 1     41     190  7.4   67     5   1
## 2     36     118  8.0   72     5   2
## 3     12     149 12.6   74     5   3
## 4     18     313 11.5   62     5   4
## 5     NA      NA 14.3   56     5   5
## 6     28      NA 14.9   66     5   6
## 7     23     299  8.6   65     5   7
## 8     19      99 13.8   59     5   8
## 9      8      19 20.1   61     5   9
## 10    NA     194  8.6   69     5  10

# Wide to long
melt(aq, id.vars = c("Month", "Day"))

##    Month Day variable value
## 1      5   1    Ozone  41.0
## 2      5   2    Ozone  36.0
## 3      5   3    Ozone  12.0
## 4      5   4    Ozone  18.0
## 5      5   5    Ozone    NA
## 6      5   6    Ozone  28.0
## 7      5   7    Ozone  23.0
## 8      5   8    Ozone  19.0
## 9      5   9    Ozone   8.0
## 10     5  10    Ozone    NA
## 11     5   1  Solar.R 190.0
## 12     5   2  Solar.R 118.0
## 13     5   3  Solar.R 149.0
## 14     5   4  Solar.R 313.0
## 15     5   5  Solar.R    NA
## 16     5   6  Solar.R    NA
## 17     5   7  Solar.R 299.0
## 18     5   8  Solar.R  99.0
## 19     5   9  Solar.R  19.0
## 20     5  10  Solar.R 194.0
## 21     5   1     Wind   7.4
## 22     5   2     Wind   8.0
## 23     5   3     Wind  12.6
## 24     5   4     Wind  11.5
## 25     5   5     Wind  14.3
## 26     5   6     Wind  14.9
## 27     5   7     Wind   8.6
## 28     5   8     Wind  13.8
## 29     5   9     Wind  20.1
## 30     5  10     Wind   8.6
## 31     5   1     Temp  67.0
## 32     5   2     Temp  72.0
## 33     5   3     Temp  74.0
## 34     5   4     Temp  62.0
## 35     5   5     Temp  56.0
## 36     5   6     Temp  66.0
## 37     5   7     Temp  65.0
## 38     5   8     Temp  59.0
## 39     5   9     Temp  61.0
## 40     5  10     Temp  69.0

m1 = melt(aq, id.vars = c("Month", "Day"), variable.name='Var', value.name='Val')
head(m1)

##   Month Day   Var Val
## 1     5   1 Ozone  41
## 2     5   2 Ozone  36
## 3     5   3 Ozone  12
## 4     5   4 Ozone  18
## 5     5   5 Ozone  NA
## 6     5   6 Ozone  28

# Long to wide
dcast(m1, Month+Day~Var)

## Using Val as value column: use value.var to override.

##    Month Day Ozone Solar.R Wind Temp
## 1      5   1    41     190  7.4   67
## 2      5   2    36     118  8.0   72
## 3      5   3    12     149 12.6   74
## 4      5   4    18     313 11.5   62
## 5      5   5    NA      NA 14.3   56
## 6      5   6    28      NA 14.9   66
## 7      5   7    23     299  8.6   65
## 8      5   8    19      99 13.8   59
## 9      5   9     8      19 20.1   61
## 10     5  10    NA     194  8.6   69

In tidyr package

library(tidyr)
aq = airquality[1:10,]
aq

##    Ozone Solar.R Wind Temp Month Day
## 1     41     190  7.4   67     5   1
## 2     36     118  8.0   72     5   2
## 3     12     149 12.6   74     5   3
## 4     18     313 11.5   62     5   4
## 5     NA      NA 14.3   56     5   5
## 6     28      NA 14.9   66     5   6
## 7     23     299  8.6   65     5   7
## 8     19      99 13.8   59     5   8
## 9      8      19 20.1   61     5   9
## 10    NA     194  8.6   69     5  10

# Wide to long
m1 = gather(aq, Var, Val, -Month, -Day)
head(m1)

##   Month Day   Var Val
## 1     5   1 Ozone  41
## 2     5   2 Ozone  36
## 3     5   3 Ozone  12
## 4     5   4 Ozone  18
## 5     5   5 Ozone  NA
## 6     5   6 Ozone  28

# Long to wide
spread(m1, Var, Val)

##    Month Day Ozone Solar.R Wind Temp
## 1      5   1    41     190  7.4   67
## 2      5   2    36     118  8.0   72
## 3      5   3    12     149 12.6   74
## 4      5   4    18     313 11.5   62
## 5      5   5    NA      NA 14.3   56
## 6      5   6    28      NA 14.9   66
## 7      5   7    23     299  8.6   65
## 8      5   8    19      99 13.8   59
## 9      5   9     8      19 20.1   61
## 10     5  10    NA     194  8.6   69

`plyr` package for Split-Apply-Combine

plyr has functions for operating on lists, data.frames and arrays. Each function performs:

A splitting operation
An apply where a function is called on each split in turn.
A recombination of output data as a single data object.

INPUT/OUTPUT	Array	Data frame	List	Discarded
Array	aaply	adply	alply	a_ply
Data frame	daply	ddply	dlply	d_ply
list	laply	ldply	llply	l_ply

library(plyr)
d1 = chickwts
d1s = ddply(d1, c('feed'), summarise, mean.weight=mean(weight))
head(d1s)

##        feed mean.weight
## 1    casein    323.5833
## 2 horsebean    160.2000
## 3   linseed    218.7500
## 4  meatmeal    276.9091
## 5   soybean    246.4286
## 6 sunflower    328.9167

d1t = ddply(d1, c('feed'), transform, mean.weight=mean(weight))
head(d1t)

##   weight   feed mean.weight
## 1    368 casein    323.5833
## 2    390 casein    323.5833
## 3    379 casein    323.5833
## 4    260 casein    323.5833
## 5    404 casein    323.5833
## 6    318 casein    323.5833

# Coefficiet of Variation - Default function
d1f = ddply(d1, "feed", function(x) {
  mean.weight = mean(x$weight)
  sd.weight = sd(x$weight)
  cv = sd.weight/mean.weight
  data.frame(cv.weight = cv)
})

mutate is similar to transform but lets us built on calculated columns

d1m = ddply(d1, c('feed'), mutate, mean.weight=mean(weight), sd.weight=sd(weight), cv.weight=mean.weight/sd.weight)
head(d1m)

##   weight   feed mean.weight sd.weight cv.weight
## 1    368 casein    323.5833  64.43384  5.021947
## 2    390 casein    323.5833  64.43384  5.021947
## 3    379 casein    323.5833  64.43384  5.021947
## 4    260 casein    323.5833  64.43384  5.021947
## 5    404 casein    323.5833  64.43384  5.021947
## 6    318 casein    323.5833  64.43384  5.021947

Additional capabilities

Dealing with errors

You can use the failwith function to control how errors are dealt with.

f = function(x) if (x == 1) stop("Error!") else 1
safe.f = failwith(NA, f, quiet = TRUE)
# llply(1:2, f)
llply(1:2, safe.f)

## [[1]]
## [1] NA
## 
## [[2]]
## [1] 1

Parallel processing

In conjunction with a package such as doParallel you can run your function separately on each core of your computer. On a dual core machine this make your code up to twice as fast. Simply register the cores and then set .parallel = TRUE. Look at the elapsed time in these examples.

x = c(1:10)
wait = function(i) Sys.sleep(0.1)
system.time(llply(x, wait))

##    user  system elapsed 
##   0.005   0.001   1.036

system.time(sapply(x, wait))

##    user  system elapsed 
##   0.001   0.001   1.035

library(doParallel)
registerDoParallel(cores = 16)
system.time(llply(x, wait, .parallel = TRUE))

##    user  system elapsed 
##   0.019   0.025   0.127

Progress bars

By specifying .progress = 'text' to your plyr function parameters, plyr will display a progress bar for long-running plyr operations.

Print

The *_ply functions have a .print option which will automatically wrap your function in a print() statement so outputs get written to display devices such as PDFs, PNGs etc.

Reasons for not using `plyr`

plyr can be slow, particularly if you are working with very large datasets that involve a lot of subsetting. Other faster options are

Base R apply function

system.time(ddply(baseball, "id", summarize, length(year)))

##    user  system elapsed 
##   0.356   0.006   0.363

system.time(tapply(baseball$year, baseball$id,function(x) length(x)))

##    user  system elapsed 
##   0.006   0.000   0.006

data.table package

library(data.table)
dt = data.table(baseball, key = "id")
system.time(dt[, length(year), by = list(id)])

##    user  system elapsed 
##   0.004   0.000   0.004

dplyr package

library(dplyr)
system.time(baseball %>%
  group_by(id) %>%
  summarize(length(year)))

##    user  system elapsed 
##   0.005   0.000   0.005

References

xlsx package by Adrian A. Dragulescu
XML and RCurl package by Duncan Temple Lang
RJDBC package by Simon Urbanek
RMySQL package by Jeroen Ooms et. al.
data.table package by Matt Dowle
stringr, tidyr, reshape2,plyr and dplyr packages by Hadley Wickham
doParallel package by Revolution Analytics, Steve Weston

Data Collection and Data Blending

DS 501 - Worcester Polytechnic Institute

April 23, 2015

Setting Up

Reading Data

Writing Data

Know Thy System

Calculating Memory Requirements

Data files come from

Where you can get data

Downloading file from the web

Some notes about download.file()

Loading data you have saved - read.table()

Example - Cars93 data

Package `xlsx`

Getting data off webpages

Other packages for getting data off webpages

Connecting to Databases

`stringr` package

Packages `reshape2` and `tidyr`

`plyr` package for Split-Apply-Combine

Additional capabilities

Dealing with errors

Parallel processing

Progress bars

Print

Reasons for not using `plyr`

References

Data Collection and Data Blending

DS 501 - Worcester Polytechnic Institute

April 23, 2015

Setting Up

Reading Data

Writing Data

Know Thy System

Calculating Memory Requirements

Data files come from

Where you can get data

Downloading file from the web

Some notes about download.file()

Loading data you have saved - read.table()

Example - Cars93 data

Package xlsx

Getting data off webpages

Other packages for getting data off webpages

Connecting to Databases

stringr package

Packages reshape2 and tidyr

plyr package for Split-Apply-Combine

Additional capabilities

Dealing with errors

Parallel processing

Progress bars

Print

Reasons for not using plyr

References

Package `xlsx`

`stringr` package

Packages `reshape2` and `tidyr`

`plyr` package for Split-Apply-Combine

Reasons for not using `plyr`