Assignment5

Load libraries:

tidyr tidies the data
dplyr transforms the data

 library(tidyr)
  library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

Read the .CSV file:

fileData <- read.csv(file="//Users/suma/Desktop/CUNY SPS - Masters Data Science/Data 607/Assignment5/ArrivalDelays.csv", header = T)

## Warning in read.table(file = file, header = header, sep = sep, quote
## = quote, : incomplete final line found by readTableHeader on '//Users/
## suma/Desktop/CUNY SPS - Masters Data Science/Data 607/Assignment5/
## ArrivalDelays.csv'

head(fileData)

##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4         delayed         117     415        65           129      61

Tidy the data using tidyr:

Here, we use the gather method. This is where we have one row per case, a column for each variable, and a cell for each value. We put the data’s columns into rows. In this way, we are making the data from wide to long.

fileData <- gather(fileData, Destination, ArrivalDelay, 3:7)
names(fileData) <- c("Airline", "Status", "Destination", "ArrivalDelay")

fileData

##    Airline  Status   Destination ArrivalDelay
## 1   ALASKA on time   Los.Angeles          497
## 2          delayed   Los.Angeles           62
## 3  AM WEST on time   Los.Angeles          694
## 4          delayed   Los.Angeles          117
## 5   ALASKA on time       Phoenix          221
## 6          delayed       Phoenix           12
## 7  AM WEST on time       Phoenix         4840
## 8          delayed       Phoenix          415
## 9   ALASKA on time     San.Diego          212
## 10         delayed     San.Diego           20
## 11 AM WEST on time     San.Diego          383
## 12         delayed     San.Diego           65
## 13  ALASKA on time San.Francisco          503
## 14         delayed San.Francisco          102
## 15 AM WEST on time San.Francisco          320
## 16         delayed San.Francisco          129
## 17  ALASKA on time       Seattle         1841
## 18         delayed       Seattle          305
## 19 AM WEST on time       Seattle          201
## 20         delayed       Seattle           61

We want to further tidy the data by making it wide again, using the spread function, to split the Status column’s rows into two columns: On Time and Delayed, with values being the numbers from the ArrivalDelay column. Note: I’ve been getting an error in the code for the spread function so I didn’t spread this data in time.

  #fileData <- spread(fileData, Status, ArrivalDelay, 3:7)
#fileData

Transform the data using dplyr:

Here, we filter the data as an example of data transformation. We filter by the destination of Phoenix.

  filter(fileData, Destination == 'Phoenix')

##   Airline  Status Destination ArrivalDelay
## 1  ALASKA on time     Phoenix          221
## 2         delayed     Phoenix           12
## 3 AM WEST on time     Phoenix         4840
## 4         delayed     Phoenix          415

Statistical Analysis:

Here we analyze our data. There are some summary useful statistics for Arrival Delays. For the plot: It seems ggplot works best for long data, not wide data, which is why the graph below is not a meaningful data analysis. As mentioned earlier, with the coding error, I did not spread this data.

  summary(fileData)

##     Airline       Status   Destination         ArrivalDelay    
##         :10   delayed:10   Length:20          Min.   :  12.00  
##  ALASKA : 5   on time:10   Class :character   1st Qu.:  92.75  
##  AM WEST: 5                Mode  :character   Median : 216.50  
##                                               Mean   : 550.00  
##                                               3rd Qu.: 435.50  
##                                               Max.   :4840.00

  ggplot(fileData, aes(Airline)) + 
  geom_line(aes(y = Airline, colour = "Airline")) + 
  geom_line(aes(y = ArrivalDelay, colour = "ArrivalDelay"))