Due: 11:59pm Feb 7, 2016
Follow these instructions paying full attention to details to get proper credit. Failure to do so will cost you points!
q_1a, q_2, q_4i, etc. Your answer code chunks will be evaluated using R as an expression, the value of which is your answer. Where you are asked to name the resulting variable, do so.eval options. It might be easier for you to split this markdown into pieces for Q1, Q2 and Q3 together, Q4 and Q5 while working on the problems. You are free to do so, but it is your responsibility to put things together as one file in the end.ass2.html) file by uploading it to stat290.stanford.edu. Put it under your private directory and ensure that its name is ass2.Rmd. NOTE: No other name, case exactly as noted, no other format!stat290.ass2_1.0.tar.gz, also on stat290.stanford.edu in your private directory. This is in addition to the ass2.Rmd file.1. In lecture 4 code, we showed how to exploit scoping rules (see person.html) and created the function makePerson. Here is the excerpted relevant part.
makePerson <- function(lastName = "", firstName = "") {
getLastName <- function() lastName
getFirstName <- function() firstName
setLastName <- function(what)
lastName <<- what
setFirstName <- function(what)
firstName <<- what
getFullName <- function()
paste(lastName, firstName, sep = ", ")
list(getFullName = getFullName,
getFirstName = getFirstName,
getLastName = getLastName,
setFirstName = setFirstName,
setLastName = setLastName)
}
1a. What are the local variables to makePerson? Your answers should be a character vector of the local variables like c("a", "b") with a and b replaced appropriately. Empty vectors should be left as c() as shown.
c("getLastName","getFirstName","setLastName","setFirstName","getFullName")
## [1] "getLastName" "getFirstName" "setLastName" "setFirstName"
## [5] "getFullName"
1b. What are the bound variables to makePerson? Same answer format as 1a.
c("lastName","firstName")
## [1] "lastName" "firstName"
1c. What are the local variables to setLastName?
c()
## NULL
1d. What are the bound variables to setLastName?
c("what")
## [1] "what"
1e. The above code can be used as follows:
p1 <- makePerson("Blow", "Joe")
p1$setLastName("Sixpack")
p1$getLastName()
## [1] "Sixpack"
Now suppose only the arguments of makePerson above are changed so that it now becomes:
makePerson <- function(x = "", y = "") {
getLastName <- function() lastName
getFirstName <- function() firstName
setLastName <- function(what)
lastName <<- what
setFirstName <- function(what)
firstName <<- what
getFullName <- function()
paste(lastName, firstName, sep = ", ")
list(getFullName = getFullName,
getFirstName = getFirstName,
getLastName = getLastName,
setFirstName = setFirstName,
setLastName = setLastName)
}
Now consider what happens when we use it in an R session.
p2 <- makePerson("Blow", "Joe")
p2$setLastName("Sixpack")
p2$getLastName()
## [1] "Sixpack"
Are the above two versions of makePerson equivalent? Answer TRUE if yes, FALSE if no. (Not quoted "TRUE" or "FALSE", just plain R values, without quotes.)
FALSE
## [1] FALSE
1f. Justify your answer to 1e (1-2 sentences). Your answer should be a quoted string (line breaks ok). If you need to use quotes within the answer, use single quote.
"These two 'makePerson' functions are not the same because of the difference in bound variables ('x' and 'y' in the modified versus 'firtName' and 'lastName' in the original). The reason that the answers came out to be the same is because the variable 'lastName' was created by deep binding ('<<-') in both cases and so it's available in the global environment when it was called. If let's say we call the 'getFirstName' function using the modified 'makePerson', we'd get an error because it was never assigned anything inside the modified function (it's an unbound variable in the modified version)."
2. Robert Carver in the paper describes a data cleaning exercise. This data, scraped from http://www.centennialofflight.net/chrono/log/1904HuffmanPrairie.htm is provided to you in 1904Prarie.rds as a data frame named d and also the paper as carver.pdf. Produce the numbers for table 2 on page 133 of the paper using simple one-liners that we will evaluate. Most of your numbers should match the table, although some will differ as noted. Hint: see help on sum, grep.
d <- readRDS("/Users/bwu/Desktop/Assignment-2/1904Prarie.rds")
2a. Total number of flights documented (will get 90 instead)
dim(subset(d, !(PILOT=="" & TIME=="" & DISTANCE=="" & ALTITUDE==""& REMARKS=="")))[1]
## [1] 93
2b. Some flight data reported (will get 87 instead)
dim(subset(d, !(PILOT=="" & TIME=="" & DISTANCE=="" & ALTITUDE=="")))[1]
## [1] 87
2c. Pilot identified
dim(subset(d, PILOT!=""))[1]
## [1] 78
2d. Time (flight duration)
dim(subset(d, TIME!=""))[1]
## [1] 68
2e. Distance (ground covered)
dim(subset(d, DISTANCE!=""))[1]
## [1] 80
2f. Altitude
dim(subset(d, ALTITUDE!=""))[1]
## [1] 1
2g. Both time and distance recorded
dim(subset(d, TIME!="" & DISTANCE!=""))[1]
## [1] 68
2h. Distance measured in feet
sum(grepl("ft", d$DISTANCE))
## [1] 26
2i. Distance measured in meters (will get 52 instead)
sum(grepl(" m", d$DISTANCE))
## [1] 52
2j. Other distance metric
length(grep(" m|ft", d$DISTANCE, invert=TRUE)) - dim(subset(d, DISTANCE==""))[1]
## [1] 2
2k. Distance estimated (e.g. “ca. 25 ft.”)
sum(grepl("ca.", d$DISTANCE))
## [1] 4
3. The 1904 Huffman Prairie data has a DATE column with some empty values. It would be nice to ensure that all columns have the appropriate date value, even if they are repeated.
3a. Modify the one date July to July 15.
d$DATE <- gsub("July","July 15",d$DATE)
3b. Modify the FLIGHT variable so that it is integer (NAs are ok).
d$FLIGHT <- as.integer(d$FLIGHT)
3c. Provide code that will ensure all DATE fields are populated correctly. There are many convoluted ways to do this, but the R way (in 3 short lines of code) would use the fact that FLIGHT is an integer in a computable range. This will help you detect rows that have a date versus those that don’t. Exploit this with the function cumsum. Your result should be:
[1] "May 26" "Jun 10" "Jun 21" "Jun 21" "Jun 21" "Jun 21" "Jun 23"
[8] "Jun 23" "Jun 23" "Jun 25" "July 15" "Aug 2" "Aug 2" "Aug 2"
...
[120] "Dec 1" "Dec 1" "Dec 1" "Dec 5" "Dec 5" "Dec 6" "Dec 6"
[127] "Dec 7" "Dec 7" "Dec 9" "Dec 9" "Dec 9"
missingRows <- which(is.na(d$FLIGHT))
filledDate <- rep(d$DATE[missingRows], times = diff(c(missingRows, length(d$DATE) + 1)))
filledDate
## [1] "May 26" "Jun 10" "Jun 21" "Jun 21" "Jun 21" "Jun 21" "Jun 23"
## [8] "Jun 23" "Jun 23" "Jun 25" "July 15" "Aug 2" "Aug 2" "Aug 2"
## [15] "Aug 4" "Aug 4" "Aug 4" "Aug 5" "Aug 5" "Aug 5" "Aug 6"
## [22] "Aug 6" "Aug 6" "Aug 6" "Aug 8" "Aug 8" "Aug 10" "Aug 10"
## [29] "Aug 10" "Aug 13" "Aug 13" "Aug 13" "Aug 13" "Aug 13" "Aug 16"
## [36] "Aug 16" "Aug 22" "Aug 22" "Aug 22" "Aug 22" "Aug 22" "Aug 23"
## [43] "Aug 23" "Aug 23" "Aug 24" "Aug 24" "Aug 24" "Sep 7" "Sep 7"
## [50] "Sep 7" "Sep 7" "Sep 9" "Sep 9" "Sep 9" "Sep 9" "Sep 13"
## [57] "Sep 13" "Sep 13" "Sep 14" "Sep 14" "Sep 15" "Sep 15" "Sep 15"
## [64] "Sep 20" "Sep 20" "Sep 20" "Sep 26" "Sep 26" "Sep 26" "Sep 27"
## [71] "Sep 27" "Sep 27" "Sep 27" "Sep 28" "Sep 28" "Sep 28" "Sep 30"
## [78] "Sep 30" "Oct 1" "Oct 1" "Oct 1" "Oct 4" "Oct 4" "Oct 4"
## [85] "Oct 11" "Oct 11" "Oct 13" "Oct 13" "Oct 13" "Oct 14" "Oct 14"
## [92] "Oct 14" "Oct 14" "Oct 15" "Oct 15" "Oct 26" "Oct 26" "Nov 2"
## [99] "Nov 2" "Nov 3" "Nov 3" "Nov 9" "Nov 9" "Nov 9" "Nov 9"
## [106] "Nov 16" "Nov 16" "Nov 16" "Nov 16" "Nov 16" "Nov 22" "Nov 22"
## [113] "Nov 25" "Nov 25" "Nov 25" "Nov 25" "Nov 25" "Nov 25" "Dec 1"
## [120] "Dec 1" "Dec 1" "Dec 1" "Dec 5" "Dec 5" "Dec 6" "Dec 6"
## [127] "Dec 7" "Dec 7" "Dec 9" "Dec 9" "Dec 9"
3d. Assuming that the year is 1904, convert the DATE column to an actual date like:
[1] "1904-05-26" "1904-06-10" "1904-06-21" "1904-06-21" "1904-06-21"
[6] "1904-06-21" "1904-06-23" "1904-06-23" "1904-06-23" "1904-06-25"
...
[126] "1904-12-06" "1904-12-07" "1904-12-07" "1904-12-09" "1904-12-09"
[131] "1904-12-09"
d$Date <- as.Date(unlist(lapply(filledDate, function(x) paste(x,"1904"))), "%b%d%Y")
d$Date
## [1] "1904-05-26" "1904-06-10" "1904-06-21" "1904-06-21" "1904-06-21"
## [6] "1904-06-21" "1904-06-23" "1904-06-23" "1904-06-23" "1904-06-25"
## [11] "1904-07-15" "1904-08-02" "1904-08-02" "1904-08-02" "1904-08-04"
## [16] "1904-08-04" "1904-08-04" "1904-08-05" "1904-08-05" "1904-08-05"
## [21] "1904-08-06" "1904-08-06" "1904-08-06" "1904-08-06" "1904-08-08"
## [26] "1904-08-08" "1904-08-10" "1904-08-10" "1904-08-10" "1904-08-13"
## [31] "1904-08-13" "1904-08-13" "1904-08-13" "1904-08-13" "1904-08-16"
## [36] "1904-08-16" "1904-08-22" "1904-08-22" "1904-08-22" "1904-08-22"
## [41] "1904-08-22" "1904-08-23" "1904-08-23" "1904-08-23" "1904-08-24"
## [46] "1904-08-24" "1904-08-24" "1904-09-07" "1904-09-07" "1904-09-07"
## [51] "1904-09-07" "1904-09-09" "1904-09-09" "1904-09-09" "1904-09-09"
## [56] "1904-09-13" "1904-09-13" "1904-09-13" "1904-09-14" "1904-09-14"
## [61] "1904-09-15" "1904-09-15" "1904-09-15" "1904-09-20" "1904-09-20"
## [66] "1904-09-20" "1904-09-26" "1904-09-26" "1904-09-26" "1904-09-27"
## [71] "1904-09-27" "1904-09-27" "1904-09-27" "1904-09-28" "1904-09-28"
## [76] "1904-09-28" "1904-09-30" "1904-09-30" "1904-10-01" "1904-10-01"
## [81] "1904-10-01" "1904-10-04" "1904-10-04" "1904-10-04" "1904-10-11"
## [86] "1904-10-11" "1904-10-13" "1904-10-13" "1904-10-13" "1904-10-14"
## [91] "1904-10-14" "1904-10-14" "1904-10-14" "1904-10-15" "1904-10-15"
## [96] "1904-10-26" "1904-10-26" "1904-11-02" "1904-11-02" "1904-11-03"
## [101] "1904-11-03" "1904-11-09" "1904-11-09" "1904-11-09" "1904-11-09"
## [106] "1904-11-16" "1904-11-16" "1904-11-16" "1904-11-16" "1904-11-16"
## [111] "1904-11-22" "1904-11-22" "1904-11-25" "1904-11-25" "1904-11-25"
## [116] "1904-11-25" "1904-11-25" "1904-11-25" "1904-12-01" "1904-12-01"
## [121] "1904-12-01" "1904-12-01" "1904-12-05" "1904-12-05" "1904-12-06"
## [126] "1904-12-06" "1904-12-07" "1904-12-07" "1904-12-09" "1904-12-09"
## [131] "1904-12-09"
dplyr manipulations and summaries4. For this exercise, download the SQLite database named 2012.sqlite3 (783Mb) from here. It has one table called ontime with columns described in Stat. Computing 2009 Expo site. Use dplyr for working with this data set.
Hints There are many functions in R that have no equivalents in SQL. It is advantageous to only bring summaries into R after doing as much processing in SQL as possible. The function collect in dplyr will collect the results of a SQL query and return the results into R as a tbl_df. Of course, the summary has to be of a reasonable size to fit into R, but once it is in R, you can use all the (vectorized) R functions to process them. So the idea is often to decide what can be done in SQL and what can be done in R to get the results desired. Remember dplyr works on SQL databses as well as data frames!
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
airline2012_db <- src_sqlite("/Users/bwu/Desktop/Assignment-2/2012.sqlite3")
ontime <- tbl(src_sqlite("/Users/bwu/Desktop/Assignment-2/2012.sqlite3"), "ontime")
4a. Construct a data frame (of class tbl_df in dplyr which is the result of a collect call) that contains the number of flights by each plane (identified by unique tail number). The result should have two columns named TailNum, count with highest count first. (Drop all empty tail numbers.) Your data frame should be named d4a.
library(magrittr)
d4a <- ontime %>% filter(TailNum != "") %>% group_by(TailNum) %>% summarize(count = n(TailNum)) %>% arrange(desc(count)) %>% collect
d4a
## Source: local data frame [4,721 x 2]
##
## TailNum count
## (chr) (int)
## 1 N486HA 4090
## 2 N477HA 3913
## 3 N485HA 3890
## 4 N479HA 3881
## 5 N476HA 3867
## 6 N484HA 3835
## 7 N487HA 3711
## 8 N481HA 3660
## 9 N488HA 3509
## 10 N475HA 3473
## .. ... ...
if (exists("d4a") && ("tbl_df" %in% class(d4a))) {
cat(sprintf("d4a has %d rows\n", nrow(d4a)))
} else {
cat("d4a missing!\n")
}
## d4a has 4721 rows
4b. Which airline did the plane that flew most frequently belong to? Expected answer is a single row tbl_df with column names TailNum and UniqueCarrier. Your answer data frame should be named d4b.
d4b <- ontime %>% filter(TailNum == "N486HA") %>% group_by(TailNum, UniqueCarrier) %>% summarize() %>% collect
d4b
## Source: local data frame [1 x 2]
## Groups: TailNum [1]
##
## TailNum UniqueCarrier
## (chr) (chr)
## 1 N486HA HA
if (exists("d4b") && ("tbl_df" %in% class(d4b))) {
cat(sprintf("d4b has %d rows\n", nrow(d4b)))
} else {
cat("d4b missing!\n")
}
## d4b has 1 rows
4c. Compute the number of flights of this frequently used plane between each origin, destination pair, and the average duration in terms of actual elapsed time (use columns ActualElapsedTime). Your result should be a three column tbl_df with columns airport1, airport2 and avg_ftime. The airport1 should be lexicographically less than airport2 for each row. Hint You can use pmin and pmax functions in R. Your answer should be named d4c.
d4c <- ontime %>%
filter(TailNum == "N486HA") %>%
group_by(Origin, Dest) %>%
summarize(total_flights = n(TailNum), total_time = sum(ActualElapsedTime)) %>%
transform(airport1 = pmax(Origin,Dest), airport2 = pmin(Origin, Dest)) %>%
group_by(airport1, airport2) %>%
summarize(grouped_time = sum(total_time), grouped_flights = sum(total_flights)) %>%
mutate(avg_ftime = grouped_time/grouped_flights) %>%
select(airport1, airport2, avg_ftime) %>%
collect
d4c
## Source: local data frame [7 x 3]
## Groups: airport1 [4]
##
## airport1 airport2 avg_ftime
## (fctr) (fctr) (dbl)
## 1 ITO HNL 50.85070
## 2 KOA HNL 43.77014
## 3 LIH HNL 35.73340
## 4 OGG HNL 36.39436
## 5 OGG ITO 38.42529
## 6 OGG KOA 33.04082
## 7 OGG LIH 46.27473
if (exists("d4c") && ("tbl_df" %in% class(d4c))) {
cat(sprintf("d4c has %d rows\n", nrow(d4c)))
} else {
cat("d4c missing!\n")
}
## d4c has 7 rows
4d. Construct a data frame that lists each origin, destination pair and the number of flights between them, either way. Your result named flights should be a three column tbl_df with columns airport1, airport2 and count. The airport1 should be lexicographically less than airport2 for each row.
flights <- ontime %>%
filter(TailNum != "") %>%
group_by(Origin, Dest) %>%
summarize(total_flights = n(TailNum)) %>%
transform(airport1 = pmax(Origin,Dest), airport2 = pmin(Origin, Dest)) %>%
group_by(airport1, airport2) %>%
summarize(count = sum(total_flights)) %>%
collect
flights
## Source: local data frame [2,380 x 3]
## Groups: airport1 [?]
##
## airport1 airport2 count
## (fctr) (fctr) (int)
## 1 ANC ADK 210
## 2 ANC ADQ 1340
## 3 ANC AKN 168
## 4 ATL ABE 1097
## 5 ATL ABQ 1841
## 6 ATL ABY 1998
## 7 ATL ACY 23
## 8 ATL AEX 2764
## 9 ATL AGS 6034
## 10 ATL ALB 1530
## .. ... ... ...
if (exists("flights") && ("tbl_df" %in% class(flights))) {
cat(sprintf("flights has %d rows\n", nrow(flights)))
} else {
cat("flights missing!\n")
}
## flights has 2380 rows
4e. Using the example of Nathan Yau at Flowing Data, use his code to produce a plot shown in figure below.
The code from Nathan is included here as a function so you don’t have to cut and paste. The file airports.RDS provided with this assignment which contains the airport latitudes and longitudes. The function doPlot below expects the flights data from question 4d above and the data frame containing the airport latitudes and longitudes. You need to install the packages maps and geosphere for this exercise.
installIfNeeded <- function(packages, ...) {
toInstall <- setdiff(packages, installed.packages()[, 1])
if (length(toInstall) > 0) {
install.packages(toInstall, ...)
}
}
installIfNeeded(c("maps", "geosphere"))
library(maps)
##
## # ATTENTION: maps v3.0 has an updated 'world' map. #
## # Many country borders and names have changed since 1990. #
## # Type '?world' or 'news(package="maps")'. See README_v3. #
library(geosphere)
## Loading required package: sp
## Nathan's code wrapped into a function
doPlot <- function(flightData, airportInfo) {
## SOURCE: Nathan Yau @ flowing data
## URL: http://flowingdata.com/2011/05/11/how-to-map-connections-with-great-circles
xlim <- c(-171.738281, -56.601563)
ylim <- c(12.039321, 71.856229)
## Color
pal <- colorRampPalette(c("#333333", "white", "#1292db"))
colors <- pal(100)
map("world", col = "#191919", fill = TRUE, bg = "#000000", lwd = 0.05, xlim = xlim, ylim = ylim)
flightData <- flightData[order(flightData$count), ]
maxcount <- max(flightData$count)
for (j in seq_len(nrow(flightData))) {
air1 <- airportInfo[flightData$airport1[j], ]
air2 <- airportInfo[flightData$airport2[j], ]
inter <- gcIntermediate(c(air1[1, ]$long, air1[1, ]$lat),
c(air2[1, ]$long, air2[1, ]$lat), n = 100, addStartEnd = TRUE)
colindex <- round( (flightData$count[j] / maxcount) * length(colors) )
lines(inter, col = colors[colindex], lwd = 0.6)
}
}
Use doPlot on your data to produce the graphic.
airportInfo <- readRDS("/Users/bwu/Desktop/Assignment-2/airports.RDS")
doPlot(flights, airportInfo)
5. This exercise leads you through the package building process that will be necessary for project. You will be using some functions and data provided in class lectures. You only submit the resulting package file as noted above, no answers here. However, note the requirements.
5a. Your package should be named stat290.ass2.
5b. The version number of your package will be 1.0.
5c. You are provided a data set that will be part of the package. It is in bitly.RDS. As part of your package, you must include a dataset named bitly, so that when you package is used by a user as below, the following will work.
library(stat290.ass2)
data(bitly)
head(bitly)
5d. Your package should export two functions: one is the plotTzByOS function here and the other is the convolve following function. Use the function as is (no need to modify). Document this function briefly so that it passes checks and include an example in the documentation using the bitly data as shown in class in lecture 10.
plotTzByOS <- function(data, color = c("#999999", "#E69F00")) {
tz <- unlist(lapply(data, function(x) x$tz))
tz[tz == ""] <- "Unknown"
tzTable <- sort(table(tz), decreasing=TRUE)
d <- head(tzTable, 10)
d <- data.frame(tz = names(d), count = d, stringsAsFactors = FALSE)
d$tz <- factor(d$tz, levels=d$tz)
agents <- unlist(lapply(data, function(x) x$a))
win <- ifelse(grepl("Windows", agents), "Windows", "Nonwindows")
tzWin <- table(tz, win)
d2 <- data.frame(tzWin[as.character(d$tz), ])
d2$tz <- factor(rownames(d2), levels = rownames(d2))
d3 <- d2 %>% gather(key=os, value = "value", Nonwindows, Windows)
ggplot(d3, aes(x = tz, y = value, fill = os)) +
geom_bar(position = "fill", stat = "identity") +
scale_y_continuous(labels = percent_format()) +
scale_fill_manual(values = color) +
coord_flip()
}
5e. Also place the files convolve.R and convolve.c in the appropriate areas of the package source tree. Document convolve.R with an example. You can borrow from the example in lecture 5.
5f. Build your package, as will be shown in class and ensure that the resulting file, which will be stat290.ass2_1.0.tar.gz passes R CMD check with only NOTEs. No errors, no warnings.
R CMD check stat290.ass2_1.0.tar.gz