setwd, set the working directorygetwd, returns an absolute filepath representing the current working directorylist.files or dir, list the filessystem invokes the OS commandSome base R functions to read data
read.table, read.csv to read flat filesreadLines, for reading lines of a text filesource, for reading in R code filesload, for reading in saved workspacesdownload.file, for downloading file from webpageSimilar functions to write data
write.table, write.csv to wrote datawriteLines, for writing lines to a text filesave, writes an external representation of R objects to the specified file.In general, when using R with larger datasets, it’s useful to know a few things about your system.
I have a data frame with 2,000,000 rows and 150 columns, all of which are numeric data. Roughly, how much memory is required to store this data frame?
# bytes
2000000*150*8
## [1] 2.4e+09
# bytes/MB
2000000*150*8/2^{20}
## [1] 2288.818
# MB
round(2000000*150*8/2^{20},2)
## [1] 2288.82
# GB
round(2000000*150*8/2^{20}/1024, 2)
## [1] 2.24
wd = setwd('~/Syncplicity/WPI/Dingari Lectures/Lecture 2')
fileUrl = "https://raw.githubusercontent.com/dnchari/Rdatasets/master/csv/MASS/Cars93.csv"
download.file(fileUrl,destfile="./data/cars93.csv",method="curl")
list.files("./data")
## [1] "camera.xlsx" "cars93.csv" "species.csv" "surveys.csv"
dateDownloaded = date()
dateDownloaded
## [1] "Sat May 16 12:22:23 2015"
file, header, sep, row.names, nrowsread.csv, read.csv2cars = read.csv("./data/cars93.csv")
head(cars)
## X Manufacturer Model Type Min.Price Price Max.Price MPG.city
## 1 1 Acura Integra Small 12.9 15.9 18.8 25
## 2 2 Acura Legend Midsize 29.2 33.9 38.7 18
## 3 3 Audi 90 Compact 25.9 29.1 32.3 20
## 4 4 Audi 100 Midsize 30.8 37.7 44.6 19
## 5 5 BMW 535i Midsize 23.7 30.0 36.2 22
## 6 6 Buick Century Midsize 14.2 15.7 17.3 22
## MPG.highway AirBags DriveTrain Cylinders EngineSize
## 1 31 None Front 4 1.8
## 2 25 Driver & Passenger Front 6 3.2
## 3 26 Driver only Front 6 2.8
## 4 26 Driver & Passenger Front 6 2.8
## 5 30 Driver only Rear 4 3.5
## 6 31 Driver only Front 4 2.2
## Horsepower RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity
## 1 140 6300 2890 Yes 13.2
## 2 200 5500 2335 Yes 18.0
## 3 172 5500 2280 Yes 16.9
## 4 172 5500 2535 Yes 21.1
## 5 208 5700 2545 Yes 21.1
## 6 110 5200 2565 No 16.4
## Passengers Length Wheelbase Width Turn.circle Rear.seat.room
## 1 5 177 102 68 37 26.5
## 2 5 195 115 71 38 30.0
## 3 5 180 102 67 37 28.0
## 4 6 193 106 70 37 31.0
## 5 4 186 109 69 39 27.0
## 6 6 189 105 69 41 28.0
## Luggage.room Weight Origin Make
## 1 11 2705 non-USA Acura Integra
## 2 15 3560 non-USA Acura Legend
## 3 14 3375 non-USA Audi 90
## 4 17 3405 non-USA Audi 100
## 5 13 3640 non-USA BMW 535i
## 6 16 2880 USA Buick Century
xlsxread.xlsx, to read the contents of a worksheet into an R data.frame. read.xlsx(file, sheetIndex, sheetName=NULL, header=TRUE, ...)write.xlsx, to write a data.frame to an Excel workbook. write.xlsx(x, file, sheetName="Sheet1", row.names=TRUE, append=FALSE, ...)library(xlsx)
## Loading required package: rJava
## Loading required package: xlsxjars
fileUrl = "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.xlsx?accessType=DOWNLOAD"
download.file(fileUrl,destfile="./data/camera.xlsx",method="curl")
cameraData = read.xlsx("./data/camera.xlsx",sheetIndex=1)
head(cameraData)
## address direction street crossStreet
## 1 S CATON AVE & BENSON AVE N/B Caton Ave Benson Ave
## 2 S CATON AVE & BENSON AVE S/B Caton Ave Benson Ave
## 3 WILKENS AVE & PINE HEIGHTS AVE E/B Wilkens Ave Pine Heights
## 4 THE ALAMEDA & E 33RD ST S/B The Alameda 33rd St
## 5 E 33RD ST & THE ALAMEDA E/B E 33rd The Alameda
## 6 ERDMAN AVE & N MACON ST E/B Erdman Macon St
## intersection Location.1
## 1 Caton Ave & Benson Ave (39.2693779962, -76.6688185297)
## 2 Caton Ave & Benson Ave (39.2693157898, -76.6689698176)
## 3 Wilkens Ave & Pine Heights (39.2720252302, -76.676960806)
## 4 The Alameda & 33rd St (39.3285013141, -76.5953545714)
## 5 E 33rd & The Alameda (39.3283410623, -76.5953594625)
## 6 Erdman & Macon St (39.3068045671, -76.5593167803)
cite = url('http://scholar.google.com/citations?user=dC24-QsAAAAJ')
library(XML)
htmlCode = readLines(cite)
## Warning in readLines(cite): incomplete final line found on
## 'http://scholar.google.com/citations?user=dC24-QsAAAAJ'
close(cite)
html3 = htmlTreeParse("http://scholar.google.com/citations?user=dC24-QsAAAAJ", useInternalNodes=T)
xpathSApply(html3, "//title", xmlValue)
## [1] "Narahara Chari Dingari - Google Scholar Citations"
XML, httr, RCurlRJDBC, RMySQLJDBC creates a new DBI driver that can be used to start JDBC connections.library(RJDBC)
# Oracle Database
drv = JDBC(driverClass = "oracle.jdbc.OracleDriver", classPath = "ojdbc/ojdbc6.jar", identifier.quote="`")
conn = dbConnect(drv, "jdbc:oracle:thin:username/password@myHost:1521:db")
# MySQL Database
drv = JDBC("com.mysql.jdbc.Driver", "driver/mysql-connector-java-5.1.34-bin.jar", identifier.quote="`")
conn = dbConnect(drv, "jdbc:mysql://myHost","username","password")
# RMySQL Package
library(RMySQL)
conn = dbConnect(MySQL(), user = 'myUserName', password = 'myPassword', host = 'myHost.com', dbname='dbName')
dbListTables(conn)
t1 = dbReadTable(conn, "table1")
t1Data = dbSendQuery(conn,"SELECT column1,column2,column2 FROM table1")
dbListTables(conn)
data1 = dbReadTable(conn, "Table_Name")
data2 = dbGetQuery(conn, "select * from SCHEMA.TABLE")
dbDisconnect(conn)
stringr packagelibrary(stringr)
wpi = "Worcester Polytechnic Institute"
str_sub(wpi, 1, 9)
## [1] "Worcester"
str_sub(wpi, end=9)
## [1] "Worcester"
str_sub(wpi, -9)
## [1] "Institute"
months = month.name
str_detect(months, "a") #Contains
## [1] TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE
months[str_detect(months, "a")]
## [1] "January" "February" "March" "May"
months[str_detect(months, "^a")] # Starts with
## character(0)
months[str_detect(months, ignore.case("^a"))] # Making case-insensitive
## [1] "April" "August"
months[str_detect(months, ignore.case("y$"))] # Ends with
## [1] "January" "February" "May" "July"
months[str_detect(months, ignore.case('[aeiou]'))] # Contains Vowels
## [1] "January" "February" "March" "April" "May"
## [6] "June" "July" "August" "September" "October"
## [11] "November" "December"
Another example
library(RCurl)
slink = 'https://raw.githubusercontent.com/dnchari/DS501/dnchari-functions/numbers2words.R'
n2w = getURL(slink, ssl.verifypeer = FALSE)
eval(parse(text = n2w),envir=.GlobalEnv)
numbers=numbers2words(1:99)
numbers
## [1] "one" "two" "three" "four"
## [5] "five" "six" "seven" "eight"
## [9] "nine" "ten" "eleven" "twelve"
## [13] "thirteen" "fourteen" "fifteen" "sixteen"
## [17] " seventeen" "eighteen" "nineteen" "twenty"
## [21] "twenty one" "twenty two" "twenty three" "twenty four"
## [25] "twenty five" "twenty six" "twenty seven" "twenty eight"
## [29] "twenty nine" "thirty" "thirty one" "thirty two"
## [33] "thirty three" "thirty four" "thirty five" "thirty six"
## [37] "thirty seven" "thirty eight" "thirty nine" "forty"
## [41] "forty one" "forty two" "forty three" "forty four"
## [45] "forty five" "forty six" "forty seven" "forty eight"
## [49] "forty nine" "fifty" "fifty one" "fifty two"
## [53] "fifty three" "fifty four" "fifty five" "fifty six"
## [57] "fifty seven" "fifty eight" "fifty nine" "sixty"
## [61] "sixty one" "sixty two" "sixty three" "sixty four"
## [65] "sixty five" "sixty six" "sixty seven" "sixty eight"
## [69] "sixty nine" "seventy" "seventy one" "seventy two"
## [73] "seventy three" "seventy four" "seventy five" "seventy six"
## [77] "seventy seven" "seventy eight" "seventy nine" "eighty"
## [81] "eighty one" "eighty two" "eighty three" "eighty four"
## [85] "eighty five" "eighty six" "eighty seven" "eighty eight"
## [89] "eighty nine" "ninety" "ninety one" "ninety two"
## [93] "ninety three" "ninety four" "ninety five" "ninety six"
## [97] "ninety seven" "ninety eight" "ninety nine"
Detect if “A B C D” are there in this string
numbers[str_detect(numbers, ignore.case('[abcd]'))]
## character(0)
numbers=numbers2words(100:105)
numbers[str_detect(numbers, ignore.case('[abcd]'))]
## [1] "one hundred" "one hundred one" "one hundred two"
## [4] "one hundred three" "one hundred four" "one hundred five"
str_count(months, ignore.case("[aeiou]"))
## [1] 3 3 1 2 1 2 1 3 3 3 3 3
str_extract, str_extract_all, str_length, str_subreshape2 and tidyrreshape2 packagelibrary(reshape2)
aq = airquality[1:10,]
aq
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
## 9 8 19 20.1 61 5 9
## 10 NA 194 8.6 69 5 10
# Wide to long
melt(aq, id.vars = c("Month", "Day"))
## Month Day variable value
## 1 5 1 Ozone 41.0
## 2 5 2 Ozone 36.0
## 3 5 3 Ozone 12.0
## 4 5 4 Ozone 18.0
## 5 5 5 Ozone NA
## 6 5 6 Ozone 28.0
## 7 5 7 Ozone 23.0
## 8 5 8 Ozone 19.0
## 9 5 9 Ozone 8.0
## 10 5 10 Ozone NA
## 11 5 1 Solar.R 190.0
## 12 5 2 Solar.R 118.0
## 13 5 3 Solar.R 149.0
## 14 5 4 Solar.R 313.0
## 15 5 5 Solar.R NA
## 16 5 6 Solar.R NA
## 17 5 7 Solar.R 299.0
## 18 5 8 Solar.R 99.0
## 19 5 9 Solar.R 19.0
## 20 5 10 Solar.R 194.0
## 21 5 1 Wind 7.4
## 22 5 2 Wind 8.0
## 23 5 3 Wind 12.6
## 24 5 4 Wind 11.5
## 25 5 5 Wind 14.3
## 26 5 6 Wind 14.9
## 27 5 7 Wind 8.6
## 28 5 8 Wind 13.8
## 29 5 9 Wind 20.1
## 30 5 10 Wind 8.6
## 31 5 1 Temp 67.0
## 32 5 2 Temp 72.0
## 33 5 3 Temp 74.0
## 34 5 4 Temp 62.0
## 35 5 5 Temp 56.0
## 36 5 6 Temp 66.0
## 37 5 7 Temp 65.0
## 38 5 8 Temp 59.0
## 39 5 9 Temp 61.0
## 40 5 10 Temp 69.0
m1 = melt(aq, id.vars = c("Month", "Day"), variable.name='Var', value.name='Val')
head(m1)
## Month Day Var Val
## 1 5 1 Ozone 41
## 2 5 2 Ozone 36
## 3 5 3 Ozone 12
## 4 5 4 Ozone 18
## 5 5 5 Ozone NA
## 6 5 6 Ozone 28
# Long to wide
dcast(m1, Month+Day~Var)
## Using Val as value column: use value.var to override.
## Month Day Ozone Solar.R Wind Temp
## 1 5 1 41 190 7.4 67
## 2 5 2 36 118 8.0 72
## 3 5 3 12 149 12.6 74
## 4 5 4 18 313 11.5 62
## 5 5 5 NA NA 14.3 56
## 6 5 6 28 NA 14.9 66
## 7 5 7 23 299 8.6 65
## 8 5 8 19 99 13.8 59
## 9 5 9 8 19 20.1 61
## 10 5 10 NA 194 8.6 69
tidyr packagelibrary(tidyr)
aq = airquality[1:10,]
aq
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
## 9 8 19 20.1 61 5 9
## 10 NA 194 8.6 69 5 10
# Wide to long
m1 = gather(aq, Var, Val, -Month, -Day)
head(m1)
## Month Day Var Val
## 1 5 1 Ozone 41
## 2 5 2 Ozone 36
## 3 5 3 Ozone 12
## 4 5 4 Ozone 18
## 5 5 5 Ozone NA
## 6 5 6 Ozone 28
# Long to wide
spread(m1, Var, Val)
## Month Day Ozone Solar.R Wind Temp
## 1 5 1 41 190 7.4 67
## 2 5 2 36 118 8.0 72
## 3 5 3 12 149 12.6 74
## 4 5 4 18 313 11.5 62
## 5 5 5 NA NA 14.3 56
## 6 5 6 28 NA 14.9 66
## 7 5 7 23 299 8.6 65
## 8 5 8 19 99 13.8 59
## 9 5 9 8 19 20.1 61
## 10 5 10 NA 194 8.6 69
plyr package for Split-Apply-Combineplyr has functions for operating on lists, data.frames and arrays. Each function performs:
apply where a function is called on each split in turn.| INPUT/OUTPUT | Array | Data frame | List | Discarded |
|---|---|---|---|---|
| Array | aaply | adply | alply | a_ply |
| Data frame | daply | ddply | dlply | d_ply |
| list | laply | ldply | llply | l_ply |
library(plyr)
d1 = chickwts
d1s = ddply(d1, c('feed'), summarise, mean.weight=mean(weight))
head(d1s)
## feed mean.weight
## 1 casein 323.5833
## 2 horsebean 160.2000
## 3 linseed 218.7500
## 4 meatmeal 276.9091
## 5 soybean 246.4286
## 6 sunflower 328.9167
d1t = ddply(d1, c('feed'), transform, mean.weight=mean(weight))
head(d1t)
## weight feed mean.weight
## 1 368 casein 323.5833
## 2 390 casein 323.5833
## 3 379 casein 323.5833
## 4 260 casein 323.5833
## 5 404 casein 323.5833
## 6 318 casein 323.5833
# Coefficiet of Variation - Default function
d1f = ddply(d1, "feed", function(x) {
mean.weight = mean(x$weight)
sd.weight = sd(x$weight)
cv = sd.weight/mean.weight
data.frame(cv.weight = cv)
})
mutate is similar to transform but lets us built on calculated columnsd1m = ddply(d1, c('feed'), mutate, mean.weight=mean(weight), sd.weight=sd(weight), cv.weight=mean.weight/sd.weight)
head(d1m)
## weight feed mean.weight sd.weight cv.weight
## 1 368 casein 323.5833 64.43384 5.021947
## 2 390 casein 323.5833 64.43384 5.021947
## 3 379 casein 323.5833 64.43384 5.021947
## 4 260 casein 323.5833 64.43384 5.021947
## 5 404 casein 323.5833 64.43384 5.021947
## 6 318 casein 323.5833 64.43384 5.021947
You can use the failwith function to control how errors are dealt with.
f = function(x) if (x == 1) stop("Error!") else 1
safe.f = failwith(NA, f, quiet = TRUE)
# llply(1:2, f)
llply(1:2, safe.f)
## [[1]]
## [1] NA
##
## [[2]]
## [1] 1
In conjunction with a package such as doParallel you can run your function separately on each core of your computer. On a dual core machine this make your code up to twice as fast. Simply register the cores and then set .parallel = TRUE. Look at the elapsed time in these examples.
x = c(1:10)
wait = function(i) Sys.sleep(0.1)
system.time(llply(x, wait))
## user system elapsed
## 0.005 0.001 1.036
system.time(sapply(x, wait))
## user system elapsed
## 0.001 0.001 1.035
library(doParallel)
registerDoParallel(cores = 16)
system.time(llply(x, wait, .parallel = TRUE))
## user system elapsed
## 0.019 0.025 0.127
By specifying .progress = 'text' to your plyr function parameters, plyr will display a progress bar for long-running plyr operations.
The *_ply functions have a .print option which will automatically wrap your function in a print() statement so outputs get written to display devices such as PDFs, PNGs etc.
plyrplyr can be slow, particularly if you are working with very large datasets that involve a lot of subsetting. Other faster options are
R apply functionsystem.time(ddply(baseball, "id", summarize, length(year)))
## user system elapsed
## 0.356 0.006 0.363
system.time(tapply(baseball$year, baseball$id,function(x) length(x)))
## user system elapsed
## 0.006 0.000 0.006
data.table packagelibrary(data.table)
dt = data.table(baseball, key = "id")
system.time(dt[, length(year), by = list(id)])
## user system elapsed
## 0.004 0.000 0.004
dplyr packagelibrary(dplyr)
system.time(baseball %>%
group_by(id) %>%
summarize(length(year)))
## user system elapsed
## 0.005 0.000 0.005
xlsx package by Adrian A. DragulescuXML and RCurl package by Duncan Temple LangRJDBC package by Simon UrbanekRMySQL package by Jeroen Ooms et. al.data.table package by Matt Dowlestringr, tidyr, reshape2,plyr and dplyr packages by Hadley WickhamdoParallel package by Revolution Analytics, Steve Weston