Assignment

Source: Numbersense, Kaiser Fung, McGraw Hill, 2013
Source: Numbersense, Kaiser Fung, McGraw Hill, 2013

The chart above describes arrival delays for two airlines across five destinations. Your task is to:

  1. Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
  2. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
  3. Perform analysis to compare the arrival delays for the two airlines.

Read data from CSV file

flights <- read.csv("FlightData.csv", header = T)
dim(flights)
## [1] 5 7
head(flights)
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     210
## 5         delayed         117     415        65           129      61

Get rid of blank row

flights <- flights[c(1,2,4,5), ]
flights
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 4 AM WEST on time         694    4840       383           320     210
## 5         delayed         117     415        65           129      61

Fix column 1 so that there is data in every row

flights[c(2,4),1] <- flights[c(1,3),1]
flights
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 4 AM WEST on time         694    4840       383           320     210
## 5 AM WEST delayed         117     415        65           129      61

Fix first two column names

colnames(flights)[1:2] <- c('airline', 'status')
flights
##   airline  status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 4 AM WEST on time         694    4840       383           320     210
## 5 AM WEST delayed         117     415        65           129      61

Use tidyr to move header to column

tidyflights <- gather(flights, key = 'destination', value = 'n', 3:7)
tidyflights
##    airline  status   destination    n
## 1   ALASKA on time   Los.Angeles  497
## 2   ALASKA delayed   Los.Angeles   62
## 3  AM WEST on time   Los.Angeles  694
## 4  AM WEST delayed   Los.Angeles  117
## 5   ALASKA on time       Phoenix  221
## 6   ALASKA delayed       Phoenix   12
## 7  AM WEST on time       Phoenix 4840
## 8  AM WEST delayed       Phoenix  415
## 9   ALASKA on time     San.Diego  212
## 10  ALASKA delayed     San.Diego   20
## 11 AM WEST on time     San.Diego  383
## 12 AM WEST delayed     San.Diego   65
## 13  ALASKA on time San.Francisco  503
## 14  ALASKA delayed San.Francisco  102
## 15 AM WEST on time San.Francisco  320
## 16 AM WEST delayed San.Francisco  129
## 17  ALASKA on time       Seattle 1841
## 18  ALASKA delayed       Seattle  305
## 19 AM WEST on time       Seattle  210
## 20 AM WEST delayed       Seattle   61

Use tidyr to spread data into two columns for Status

tidyflights <- spread(tidyflights, status, n)
tidyflights
##    airline   destination delayed on time
## 1   ALASKA   Los.Angeles      62     497
## 2   ALASKA       Phoenix      12     221
## 3   ALASKA     San.Diego      20     212
## 4   ALASKA San.Francisco     102     503
## 5   ALASKA       Seattle     305    1841
## 6  AM WEST   Los.Angeles     117     694
## 7  AM WEST       Phoenix     415    4840
## 8  AM WEST     San.Diego      65     383
## 9  AM WEST San.Francisco     129     320
## 10 AM WEST       Seattle      61     210

Use dplyr to add columns for total, percent on time, and percent delayed

tidyflights <- tidyflights %>% 
  mutate(total = delayed + `on time`,
         percent_on_time = round(`on time`/total*100, 2),
         percent_delayed = round(delayed/total*100, 2))
datatable(tidyflights)

Use dplyr summarize function to compare airlines

compareAll <- tidyflights %>% 
  group_by(airline) %>%
  summarize(OnTime = sum(`on time`), Delayed = sum(delayed), 
            PercentOnTime = round(OnTime/sum(OnTime,Delayed),2))%>%
  select(airline, PercentOnTime)
compareAll %>% arrange(desc(PercentOnTime))
## # A tibble: 2 x 2
##   airline PercentOnTime
##   <fct>           <dbl>
## 1 AM WEST         0.890
## 2 ALASKA          0.870

This is not what I expected because Alaska airlines has 4 out of the top five spots for percentage of on time flights! AM West has a better overall average because they have a very large number of flights to Phoenix with a high on time percentage to that destination.

Use dplyr to filter out Phoenix to see the comparison accross the other destinations

minusPhoenix <- tidyflights %>%
  filter(!destination %in% c('Phoenix')) %>%
  group_by(airline) %>%
  summarize(OnTime = sum(`on time`), Delayed = sum(delayed), 
            PercentOnTime = round(OnTime/sum(OnTime,Delayed),2)) %>%
  select(airline, PercentOnTime)
minusPhoenix %>% arrange(desc(PercentOnTime))
## # A tibble: 2 x 2
##   airline PercentOnTime
##   <fct>           <dbl>
## 1 ALASKA          0.860
## 2 AM WEST         0.810

When you filter out Phoenix you can see that now Alaska Airlines has a better average on time percentage.

So if you are flying to Phoenix take AM West! But if you are flying to any of the other destinations you have a better chance of being on time with Alaska airlines.

Alaska <- tidyflights %>% filter(airline %in% c('ALASKA')) %>% 
  select(airline, destination, percent_on_time)
Alaska %>% arrange(desc(percent_on_time))
##   airline   destination percent_on_time
## 1  ALASKA       Phoenix           94.85
## 2  ALASKA     San.Diego           91.38
## 3  ALASKA   Los.Angeles           88.91
## 4  ALASKA       Seattle           85.79
## 5  ALASKA San.Francisco           83.14
AMWest <- tidyflights %>% filter(airline %in% c('AM WEST')) %>% 
  select(airline, destination, percent_on_time)
AMWest %>% arrange(desc(percent_on_time))
##   airline   destination percent_on_time
## 1 AM WEST       Phoenix           92.10
## 2 AM WEST   Los.Angeles           85.57
## 3 AM WEST     San.Diego           85.49
## 4 AM WEST       Seattle           77.49
## 5 AM WEST San.Francisco           71.27