library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
fileData <- read.csv(file="//Users/suma/Desktop/CUNY SPS - Masters Data Science/Data 607/Assignment5/ArrivalDelays.csv", header = T)
## Warning in read.table(file = file, header = header, sep = sep, quote
## = quote, : incomplete final line found by readTableHeader on '//Users/
## suma/Desktop/CUNY SPS - Masters Data Science/Data 607/Assignment5/
## ArrivalDelays.csv'
head(fileData)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 delayed 117 415 65 129 61
Here, we use the gather method. This is where we have one row per case, a column for each variable, and a cell for each value. We put the data’s columns into rows. In this way, we are making the data from wide to long.
fileData <- gather(fileData, Destination, ArrivalDelay, 3:7)
names(fileData) <- c("Airline", "Status", "Destination", "ArrivalDelay")
fileData
## Airline Status Destination ArrivalDelay
## 1 ALASKA on time Los.Angeles 497
## 2 delayed Los.Angeles 62
## 3 AM WEST on time Los.Angeles 694
## 4 delayed Los.Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 delayed Phoenix 12
## 7 AM WEST on time Phoenix 4840
## 8 delayed Phoenix 415
## 9 ALASKA on time San.Diego 212
## 10 delayed San.Diego 20
## 11 AM WEST on time San.Diego 383
## 12 delayed San.Diego 65
## 13 ALASKA on time San.Francisco 503
## 14 delayed San.Francisco 102
## 15 AM WEST on time San.Francisco 320
## 16 delayed San.Francisco 129
## 17 ALASKA on time Seattle 1841
## 18 delayed Seattle 305
## 19 AM WEST on time Seattle 201
## 20 delayed Seattle 61
We want to further tidy the data by making it wide again, using the spread function, to split the Status column’s rows into two columns: On Time and Delayed, with values being the numbers from the ArrivalDelay column. Note: I’ve been getting an error in the code for the spread function so I didn’t spread this data in time.
#fileData <- spread(fileData, Status, ArrivalDelay, 3:7)
#fileData
Here, we filter the data as an example of data transformation. We filter by the destination of Phoenix.
filter(fileData, Destination == 'Phoenix')
## Airline Status Destination ArrivalDelay
## 1 ALASKA on time Phoenix 221
## 2 delayed Phoenix 12
## 3 AM WEST on time Phoenix 4840
## 4 delayed Phoenix 415
Here we analyze our data. There are some summary useful statistics for Arrival Delays. For the plot: It seems ggplot works best for long data, not wide data, which is why the graph below is not a meaningful data analysis. As mentioned earlier, with the coding error, I did not spread this data.
summary(fileData)
## Airline Status Destination ArrivalDelay
## :10 delayed:10 Length:20 Min. : 12.00
## ALASKA : 5 on time:10 Class :character 1st Qu.: 92.75
## AM WEST: 5 Mode :character Median : 216.50
## Mean : 550.00
## 3rd Qu.: 435.50
## Max. :4840.00
ggplot(fileData, aes(Airline)) +
geom_line(aes(y = Airline, colour = "Airline")) +
geom_line(aes(y = ArrivalDelay, colour = "ArrivalDelay"))