Stat 290: Assignment 2

Due: 11:59pm Feb 7, 2016

Instructions

Follow these instructions paying full attention to details to get proper credit. Failure to do so will cost you points!

R version 3.2.3 should be used for all exercises.
This problem set uses data that is provided to you. You’ll have to download the SQLite database as noted for question 4. They are all expected to be in the same directory as this markdown.
Fill in your answers into the markdown chunks named q_1a, q_2, q_4i, etc. Your answer code chunks will be evaluated using R as an expression, the value of which is your answer. Where you are asked to name the resulting variable, do so.
If you are unable to answer a question, leave the corresponding chunk blank, as in the original file.
Do not edit or shorten this markdown file, or rename code chunks, named or unnamed, or modify eval options. It might be easier for you to split this markdown into pieces for Q1, Q2 and Q3 together, Q4 and Q5 while working on the problems. You are free to do so, but it is your responsibility to put things together as one file in the end.
Check that you can knit this markdown in Rstudio without error before you enter anything, and, after you enter your answers.
Submit the markdown file (not the html file ass2.html) file by uploading it to stat290.stanford.edu. Put it under your private directory and ensure that its name is ass2.Rmd. NOTE: No other name, case exactly as noted, no other format!
For question 5, you will submit a file that will be named (exactly!) stat290.ass2_1.0.tar.gz, also on stat290.stanford.edu in your private directory. This is in addition to the ass2.Rmd file.
Your submission should be in the right place by the due date and time.

Q1 (Short answers)

1. In lecture 4 code, we showed how to exploit scoping rules (see person.html) and created the function makePerson. Here is the excerpted relevant part.

makePerson <- function(lastName = "", firstName = "") {
  getLastName <- function() lastName
  getFirstName <- function() firstName
  setLastName <- function(what)
      lastName <<- what
  setFirstName <- function(what)
      firstName <<- what
  getFullName <- function()
      paste(lastName, firstName, sep = ", ")
  list(getFullName = getFullName,
       getFirstName = getFirstName,
       getLastName = getLastName,
       setFirstName = setFirstName,
       setLastName = setLastName)
}

1a. What are the local variables to makePerson? Your answers should be a character vector of the local variables like c("a", "b") with a and b replaced appropriately. Empty vectors should be left as c() as shown.

c("getLastName","getFirstName","setLastName","setFirstName","getFullName")

## [1] "getLastName"  "getFirstName" "setLastName"  "setFirstName"
## [5] "getFullName"

1b. What are the bound variables to makePerson? Same answer format as 1a.

c("lastName","firstName")

## [1] "lastName"  "firstName"

1c. What are the local variables to setLastName?

c()

## NULL

1d. What are the bound variables to setLastName?

c("what")

## [1] "what"

1e. The above code can be used as follows:

p1 <- makePerson("Blow", "Joe")
p1$setLastName("Sixpack")
p1$getLastName()

## [1] "Sixpack"

Now suppose only the arguments of makePerson above are changed so that it now becomes:

makePerson <- function(x = "", y = "") {
    getLastName <- function() lastName
    getFirstName <- function() firstName
    setLastName <- function(what)
        lastName <<- what
    setFirstName <- function(what)
        firstName <<- what
    getFullName <- function()
        paste(lastName, firstName, sep = ", ")
    list(getFullName = getFullName,
         getFirstName = getFirstName,
         getLastName = getLastName,
         setFirstName = setFirstName,
         setLastName = setLastName)
}

Now consider what happens when we use it in an R session.

p2 <- makePerson("Blow", "Joe")
p2$setLastName("Sixpack")
p2$getLastName()

## [1] "Sixpack"

Are the above two versions of makePerson equivalent? Answer TRUE if yes, FALSE if no. (Not quoted "TRUE" or "FALSE", just plain R values, without quotes.)

FALSE

## [1] FALSE

1f. Justify your answer to 1e (1-2 sentences). Your answer should be a quoted string (line breaks ok). If you need to use quotes within the answer, use single quote.

"These two 'makePerson' functions are not the same because of the difference in bound variables ('x' and 'y' in the modified versus 'firtName' and 'lastName' in the original). The reason that the answers came out to be the same is because the variable 'lastName' was created by deep binding ('<<-') in both cases and so it's available in the global environment when it was called. If let's say we call the 'getFirstName' function using the modified 'makePerson', we'd get an error because it was never assigned anything inside the modified function (it's an unbound variable in the modified version)."

Q2. Data Summary

2. Robert Carver in the paper describes a data cleaning exercise. This data, scraped from http://www.centennialofflight.net/chrono/log/1904HuffmanPrairie.htm is provided to you in 1904Prarie.rds as a data frame named d and also the paper as carver.pdf. Produce the numbers for table 2 on page 133 of the paper using simple one-liners that we will evaluate. Most of your numbers should match the table, although some will differ as noted. Hint: see help on sum, grep.

d <- readRDS("/Users/bwu/Desktop/Assignment-2/1904Prarie.rds")

2a. Total number of flights documented (will get 90 instead)

dim(subset(d, !(PILOT=="" & TIME=="" & DISTANCE=="" & ALTITUDE==""& REMARKS=="")))[1]

## [1] 93

2b. Some flight data reported (will get 87 instead)

dim(subset(d, !(PILOT=="" & TIME=="" & DISTANCE=="" & ALTITUDE=="")))[1]

## [1] 87

2c. Pilot identified

dim(subset(d, PILOT!=""))[1]

## [1] 78

2d. Time (flight duration)

dim(subset(d, TIME!=""))[1]

## [1] 68

2e. Distance (ground covered)

dim(subset(d, DISTANCE!=""))[1]

## [1] 80

2f. Altitude

dim(subset(d, ALTITUDE!=""))[1]

## [1] 1

2g. Both time and distance recorded

dim(subset(d, TIME!="" & DISTANCE!=""))[1]

## [1] 68

2h. Distance measured in feet

sum(grepl("ft", d$DISTANCE))

## [1] 26

2i. Distance measured in meters (will get 52 instead)

sum(grepl(" m", d$DISTANCE))

## [1] 52

2j. Other distance metric

length(grep(" m|ft", d$DISTANCE, invert=TRUE)) - dim(subset(d, DISTANCE==""))[1]

## [1] 2

2k. Distance estimated (e.g. “ca. 25 ft.”)

sum(grepl("ca.", d$DISTANCE))

## [1] 4

Q3. Data Cleanup

3. The 1904 Huffman Prairie data has a DATE column with some empty values. It would be nice to ensure that all columns have the appropriate date value, even if they are repeated.

3a. Modify the one date July to July 15.

d$DATE <- gsub("July","July 15",d$DATE)

3b. Modify the FLIGHT variable so that it is integer (NAs are ok).

d$FLIGHT <- as.integer(d$FLIGHT)

3c. Provide code that will ensure all DATE fields are populated correctly. There are many convoluted ways to do this, but the R way (in 3 short lines of code) would use the fact that FLIGHT is an integer in a computable range. This will help you detect rows that have a date versus those that don’t. Exploit this with the function cumsum. Your result should be:

  [1] "May 26"  "Jun 10"  "Jun 21"  "Jun 21"  "Jun 21"  "Jun 21"  "Jun 23"
  [8] "Jun 23"  "Jun 23"  "Jun 25"  "July 15" "Aug 2"   "Aug 2"   "Aug 2"
...
[120] "Dec 1"   "Dec 1"   "Dec 1"   "Dec 5"   "Dec 5"   "Dec 6"   "Dec 6"
[127] "Dec 7"   "Dec 7"   "Dec 9"   "Dec 9"   "Dec 9"

missingRows <- which(is.na(d$FLIGHT))
filledDate <- rep(d$DATE[missingRows], times = diff(c(missingRows, length(d$DATE) + 1)))
filledDate

##   [1] "May 26"  "Jun 10"  "Jun 21"  "Jun 21"  "Jun 21"  "Jun 21"  "Jun 23" 
##   [8] "Jun 23"  "Jun 23"  "Jun 25"  "July 15" "Aug 2"   "Aug 2"   "Aug 2"  
##  [15] "Aug 4"   "Aug 4"   "Aug 4"   "Aug 5"   "Aug 5"   "Aug 5"   "Aug 6"  
##  [22] "Aug 6"   "Aug 6"   "Aug 6"   "Aug 8"   "Aug 8"   "Aug 10"  "Aug 10" 
##  [29] "Aug 10"  "Aug 13"  "Aug 13"  "Aug 13"  "Aug 13"  "Aug 13"  "Aug 16" 
##  [36] "Aug 16"  "Aug 22"  "Aug 22"  "Aug 22"  "Aug 22"  "Aug 22"  "Aug 23" 
##  [43] "Aug 23"  "Aug 23"  "Aug 24"  "Aug 24"  "Aug 24"  "Sep 7"   "Sep 7"  
##  [50] "Sep 7"   "Sep 7"   "Sep 9"   "Sep 9"   "Sep 9"   "Sep 9"   "Sep 13" 
##  [57] "Sep 13"  "Sep 13"  "Sep 14"  "Sep 14"  "Sep 15"  "Sep 15"  "Sep 15" 
##  [64] "Sep 20"  "Sep 20"  "Sep 20"  "Sep 26"  "Sep 26"  "Sep 26"  "Sep 27" 
##  [71] "Sep 27"  "Sep 27"  "Sep 27"  "Sep 28"  "Sep 28"  "Sep 28"  "Sep 30" 
##  [78] "Sep 30"  "Oct 1"   "Oct 1"   "Oct 1"   "Oct 4"   "Oct 4"   "Oct 4"  
##  [85] "Oct 11"  "Oct 11"  "Oct 13"  "Oct 13"  "Oct 13"  "Oct 14"  "Oct 14" 
##  [92] "Oct 14"  "Oct 14"  "Oct 15"  "Oct 15"  "Oct 26"  "Oct 26"  "Nov 2"  
##  [99] "Nov 2"   "Nov 3"   "Nov 3"   "Nov 9"   "Nov 9"   "Nov 9"   "Nov 9"  
## [106] "Nov 16"  "Nov 16"  "Nov 16"  "Nov 16"  "Nov 16"  "Nov 22"  "Nov 22" 
## [113] "Nov 25"  "Nov 25"  "Nov 25"  "Nov 25"  "Nov 25"  "Nov 25"  "Dec 1"  
## [120] "Dec 1"   "Dec 1"   "Dec 1"   "Dec 5"   "Dec 5"   "Dec 6"   "Dec 6"  
## [127] "Dec 7"   "Dec 7"   "Dec 9"   "Dec 9"   "Dec 9"

3d. Assuming that the year is 1904, convert the DATE column to an actual date like:

  [1] "1904-05-26" "1904-06-10" "1904-06-21" "1904-06-21" "1904-06-21"
  [6] "1904-06-21" "1904-06-23" "1904-06-23" "1904-06-23" "1904-06-25"
...
[126] "1904-12-06" "1904-12-07" "1904-12-07" "1904-12-09" "1904-12-09"
[131] "1904-12-09"

d$Date <- as.Date(unlist(lapply(filledDate, function(x) paste(x,"1904"))), "%b%d%Y")
d$Date

##   [1] "1904-05-26" "1904-06-10" "1904-06-21" "1904-06-21" "1904-06-21"
##   [6] "1904-06-21" "1904-06-23" "1904-06-23" "1904-06-23" "1904-06-25"
##  [11] "1904-07-15" "1904-08-02" "1904-08-02" "1904-08-02" "1904-08-04"
##  [16] "1904-08-04" "1904-08-04" "1904-08-05" "1904-08-05" "1904-08-05"
##  [21] "1904-08-06" "1904-08-06" "1904-08-06" "1904-08-06" "1904-08-08"
##  [26] "1904-08-08" "1904-08-10" "1904-08-10" "1904-08-10" "1904-08-13"
##  [31] "1904-08-13" "1904-08-13" "1904-08-13" "1904-08-13" "1904-08-16"
##  [36] "1904-08-16" "1904-08-22" "1904-08-22" "1904-08-22" "1904-08-22"
##  [41] "1904-08-22" "1904-08-23" "1904-08-23" "1904-08-23" "1904-08-24"
##  [46] "1904-08-24" "1904-08-24" "1904-09-07" "1904-09-07" "1904-09-07"
##  [51] "1904-09-07" "1904-09-09" "1904-09-09" "1904-09-09" "1904-09-09"
##  [56] "1904-09-13" "1904-09-13" "1904-09-13" "1904-09-14" "1904-09-14"
##  [61] "1904-09-15" "1904-09-15" "1904-09-15" "1904-09-20" "1904-09-20"
##  [66] "1904-09-20" "1904-09-26" "1904-09-26" "1904-09-26" "1904-09-27"
##  [71] "1904-09-27" "1904-09-27" "1904-09-27" "1904-09-28" "1904-09-28"
##  [76] "1904-09-28" "1904-09-30" "1904-09-30" "1904-10-01" "1904-10-01"
##  [81] "1904-10-01" "1904-10-04" "1904-10-04" "1904-10-04" "1904-10-11"
##  [86] "1904-10-11" "1904-10-13" "1904-10-13" "1904-10-13" "1904-10-14"
##  [91] "1904-10-14" "1904-10-14" "1904-10-14" "1904-10-15" "1904-10-15"
##  [96] "1904-10-26" "1904-10-26" "1904-11-02" "1904-11-02" "1904-11-03"
## [101] "1904-11-03" "1904-11-09" "1904-11-09" "1904-11-09" "1904-11-09"
## [106] "1904-11-16" "1904-11-16" "1904-11-16" "1904-11-16" "1904-11-16"
## [111] "1904-11-22" "1904-11-22" "1904-11-25" "1904-11-25" "1904-11-25"
## [116] "1904-11-25" "1904-11-25" "1904-11-25" "1904-12-01" "1904-12-01"
## [121] "1904-12-01" "1904-12-01" "1904-12-05" "1904-12-05" "1904-12-06"
## [126] "1904-12-06" "1904-12-07" "1904-12-07" "1904-12-09" "1904-12-09"
## [131] "1904-12-09"

Q4. `dplyr` manipulations and summaries

4. For this exercise, download the SQLite database named 2012.sqlite3 (783Mb) from here. It has one table called ontime with columns described in Stat. Computing 2009 Expo site. Use dplyr for working with this data set.

Hints There are many functions in R that have no equivalents in SQL. It is advantageous to only bring summaries into R after doing as much processing in SQL as possible. The function collect in dplyr will collect the results of a SQL query and return the results into R as a tbl_df. Of course, the summary has to be of a reasonable size to fit into R, but once it is in R, you can use all the (vectorized) R functions to process them. So the idea is often to decide what can be done in SQL and what can be done in R to get the results desired. Remember dplyr works on SQL databses as well as data frames!

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

airline2012_db <- src_sqlite("/Users/bwu/Desktop/Assignment-2/2012.sqlite3")
ontime <- tbl(src_sqlite("/Users/bwu/Desktop/Assignment-2/2012.sqlite3"), "ontime")

4a. Construct a data frame (of class tbl_df in dplyr which is the result of a collect call) that contains the number of flights by each plane (identified by unique tail number). The result should have two columns named TailNum, count with highest count first. (Drop all empty tail numbers.) Your data frame should be named d4a.

library(magrittr)
d4a <- ontime %>% filter(TailNum != "") %>% group_by(TailNum) %>% summarize(count = n(TailNum)) %>% arrange(desc(count)) %>% collect
d4a

## Source: local data frame [4,721 x 2]
## 
##    TailNum count
##      (chr) (int)
## 1   N486HA  4090
## 2   N477HA  3913
## 3   N485HA  3890
## 4   N479HA  3881
## 5   N476HA  3867
## 6   N484HA  3835
## 7   N487HA  3711
## 8   N481HA  3660
## 9   N488HA  3509
## 10  N475HA  3473
## ..     ...   ...

if (exists("d4a") && ("tbl_df" %in% class(d4a))) {
    cat(sprintf("d4a has %d rows\n", nrow(d4a)))
} else {
    cat("d4a missing!\n")
}

## d4a has 4721 rows

4b. Which airline did the plane that flew most frequently belong to? Expected answer is a single row tbl_df with column names TailNum and UniqueCarrier. Your answer data frame should be named d4b.

d4b <- ontime %>% filter(TailNum == "N486HA") %>% group_by(TailNum, UniqueCarrier) %>% summarize() %>% collect
d4b

## Source: local data frame [1 x 2]
## Groups: TailNum [1]
## 
##   TailNum UniqueCarrier
##     (chr)         (chr)
## 1  N486HA            HA

if (exists("d4b") && ("tbl_df" %in% class(d4b))) {
    cat(sprintf("d4b has %d rows\n", nrow(d4b)))
} else {
    cat("d4b missing!\n")
}

## d4b has 1 rows

4c. Compute the number of flights of this frequently used plane between each origin, destination pair, and the average duration in terms of actual elapsed time (use columns ActualElapsedTime). Your result should be a three column tbl_df with columns airport1, airport2 and avg_ftime. The airport1 should be lexicographically less than airport2 for each row. Hint You can use pmin and pmax functions in R. Your answer should be named d4c.

d4c <- ontime %>% 
  filter(TailNum == "N486HA") %>% 
  group_by(Origin, Dest) %>% 
  summarize(total_flights = n(TailNum), total_time = sum(ActualElapsedTime)) %>%
  transform(airport1 = pmax(Origin,Dest), airport2 = pmin(Origin, Dest)) %>%
  group_by(airport1, airport2) %>% 
  summarize(grouped_time = sum(total_time), grouped_flights = sum(total_flights)) %>% 
  mutate(avg_ftime = grouped_time/grouped_flights) %>% 
  select(airport1, airport2, avg_ftime) %>% 
  collect

d4c

## Source: local data frame [7 x 3]
## Groups: airport1 [4]
## 
##   airport1 airport2 avg_ftime
##     (fctr)   (fctr)     (dbl)
## 1      ITO      HNL  50.85070
## 2      KOA      HNL  43.77014
## 3      LIH      HNL  35.73340
## 4      OGG      HNL  36.39436
## 5      OGG      ITO  38.42529
## 6      OGG      KOA  33.04082
## 7      OGG      LIH  46.27473

if (exists("d4c") && ("tbl_df" %in% class(d4c))) {
    cat(sprintf("d4c has %d rows\n", nrow(d4c)))
} else {
    cat("d4c missing!\n")
}

## d4c has 7 rows

4d. Construct a data frame that lists each origin, destination pair and the number of flights between them, either way. Your result named flights should be a three column tbl_df with columns airport1, airport2 and count. The airport1 should be lexicographically less than airport2 for each row.

flights <- ontime %>% 
      filter(TailNum != "") %>% 
      group_by(Origin, Dest) %>% 
      summarize(total_flights = n(TailNum)) %>%
      transform(airport1 = pmax(Origin,Dest), airport2 = pmin(Origin, Dest)) %>%
      group_by(airport1, airport2) %>% 
      summarize(count = sum(total_flights)) %>% 
      collect

flights

## Source: local data frame [2,380 x 3]
## Groups: airport1 [?]
## 
##    airport1 airport2 count
##      (fctr)   (fctr) (int)
## 1       ANC      ADK   210
## 2       ANC      ADQ  1340
## 3       ANC      AKN   168
## 4       ATL      ABE  1097
## 5       ATL      ABQ  1841
## 6       ATL      ABY  1998
## 7       ATL      ACY    23
## 8       ATL      AEX  2764
## 9       ATL      AGS  6034
## 10      ATL      ALB  1530
## ..      ...      ...   ...

if (exists("flights") && ("tbl_df" %in% class(flights))) {
    cat(sprintf("flights has %d rows\n", nrow(flights)))
} else {
    cat("flights missing!\n")
}

## flights has 2380 rows

4e. Using the example of Nathan Yau at Flowing Data, use his code to produce a plot shown in figure below.

The code from Nathan is included here as a function so you don’t have to cut and paste. The file airports.RDS provided with this assignment which contains the airport latitudes and longitudes. The function doPlot below expects the flights data from question 4d above and the data frame containing the airport latitudes and longitudes. You need to install the packages maps and geosphere for this exercise.

installIfNeeded <- function(packages, ...) {
    toInstall <- setdiff(packages, installed.packages()[, 1])
    if (length(toInstall) > 0) {
        install.packages(toInstall, ...)
    }
}
installIfNeeded(c("maps", "geosphere"))
library(maps)

## 
##  # ATTENTION: maps v3.0 has an updated 'world' map.        #
##  # Many country borders and names have changed since 1990. #
##  # Type '?world' or 'news(package="maps")'. See README_v3. #

library(geosphere)

## Loading required package: sp

## Nathan's code wrapped into a function
doPlot <- function(flightData, airportInfo) {
    ## SOURCE: Nathan Yau @ flowing data
    ## URL: http://flowingdata.com/2011/05/11/how-to-map-connections-with-great-circles
    xlim <- c(-171.738281, -56.601563)
    ylim <- c(12.039321, 71.856229)

    ## Color
    pal <- colorRampPalette(c("#333333", "white", "#1292db"))
    colors <- pal(100)

    map("world", col = "#191919", fill = TRUE, bg = "#000000", lwd = 0.05, xlim = xlim, ylim = ylim)
    flightData <- flightData[order(flightData$count), ]
    maxcount <- max(flightData$count)
    for (j in seq_len(nrow(flightData))) {
        air1 <- airportInfo[flightData$airport1[j], ]
        air2 <- airportInfo[flightData$airport2[j], ]
        inter <- gcIntermediate(c(air1[1, ]$long, air1[1, ]$lat),
                                c(air2[1, ]$long, air2[1, ]$lat), n = 100, addStartEnd = TRUE)
        colindex <- round( (flightData$count[j] / maxcount) * length(colors) )
        lines(inter, col = colors[colindex], lwd = 0.6)
    }
}

Use doPlot on your data to produce the graphic.

airportInfo <- readRDS("/Users/bwu/Desktop/Assignment-2/airports.RDS")
doPlot(flights, airportInfo)

Q5. Package building and checking

5. This exercise leads you through the package building process that will be necessary for project. You will be using some functions and data provided in class lectures. You only submit the resulting package file as noted above, no answers here. However, note the requirements.

5a. Your package should be named stat290.ass2.

5b. The version number of your package will be 1.0.

5c. You are provided a data set that will be part of the package. It is in bitly.RDS. As part of your package, you must include a dataset named bitly, so that when you package is used by a user as below, the following will work.

library(stat290.ass2)
data(bitly)
head(bitly)

5d. Your package should export two functions: one is the plotTzByOS function here and the other is the convolve following function. Use the function as is (no need to modify). Document this function briefly so that it passes checks and include an example in the documentation using the bitly data as shown in class in lecture 10.

plotTzByOS <- function(data, color = c("#999999", "#E69F00")) {
    tz <- unlist(lapply(data, function(x) x$tz))
    tz[tz == ""] <- "Unknown"
    tzTable <- sort(table(tz), decreasing=TRUE)
    d <- head(tzTable, 10)
    d <- data.frame(tz = names(d), count = d, stringsAsFactors = FALSE)
    d$tz <- factor(d$tz, levels=d$tz)
    agents <- unlist(lapply(data, function(x) x$a))
    win <- ifelse(grepl("Windows", agents), "Windows", "Nonwindows")

    tzWin <- table(tz, win)
    d2 <- data.frame(tzWin[as.character(d$tz), ])
    d2$tz <- factor(rownames(d2), levels = rownames(d2))
    d3 <- d2 %>% gather(key=os, value = "value", Nonwindows, Windows)
    ggplot(d3, aes(x = tz, y = value, fill = os)) +
        geom_bar(position = "fill", stat = "identity") +
        scale_y_continuous(labels = percent_format()) +
        scale_fill_manual(values = color) +
        coord_flip()
}

5e. Also place the files convolve.R and convolve.c in the appropriate areas of the package source tree. Document convolve.R with an example. You can borrow from the example in lecture 5.

5f. Build your package, as will be shown in class and ensure that the resulting file, which will be stat290.ass2_1.0.tar.gz passes R CMD check with only NOTEs. No errors, no warnings.

R CMD check stat290.ass2_1.0.tar.gz