Assignment:

Read untidy data from .CSV file into R, and use tidyr and dplyr as needed to tidy and transform data. Perform analysis to compare the arrival delays for the two airlines.

1) Load tidyr and dplyr packages

library(tidyr)
library(dplyr)

2) Read .CSV file from Github

airlines <- read.csv("https://raw.githubusercontent.com/miachen410/DATA607/master/airlines.csv", header = TRUE, stringsAsFactors = FALSE)
airlines
##       ï..       X Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503   1,841
## 2         delayed          62      12        20           102     305
## 3                          NA                NA            NA        
## 4 AM WEST on time         694   4,840       383           320     201
## 5         delayed         117     415        65           129      61

Take a look at the data:

  • First two columns have no names;
  • There is a blank row;
  • Two values are missing in column 1.

3) Add column names

names(airlines)[1] <- "Airline"
names(airlines)[2] <- "Status"
airlines
##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503   1,841
## 2         delayed          62      12        20           102     305
## 3                          NA                NA            NA        
## 4 AM WEST on time         694   4,840       383           320     201
## 5         delayed         117     415        65           129      61

4) Remove the blank row that contains NAs

#https://stackoverflow.com/questions/6437164/removing-empty-rows-of-a-data-file-in-r

airlines <- airlines[!apply(is.na(airlines) | airlines == "", 1, all),] 
airlines
##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503   1,841
## 2         delayed          62      12        20           102     305
## 4 AM WEST on time         694   4,840       383           320     201
## 5         delayed         117     415        65           129      61

5) Fill in the missing values

airlines$Airline[2] <- airlines$Airline[1]
airlines$Airline[4] <- airlines$Airline[3]
airlines
##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503   1,841
## 2  ALASKA delayed          62      12        20           102     305
## 4 AM WEST on time         694   4,840       383           320     201
## 5 AM WEST delayed         117     415        65           129      61

6) Gather columns into rows

airlines_tidy1 <- gather(airlines, "City", "Flights", 3:7) #using gather function in tidyr, I gather cities from column 3 to 7, and transform them into rows
airlines_tidy1
##    Airline  Status          City Flights
## 1   ALASKA on time   Los.Angeles     497
## 2   ALASKA delayed   Los.Angeles      62
## 3  AM WEST on time   Los.Angeles     694
## 4  AM WEST delayed   Los.Angeles     117
## 5   ALASKA on time       Phoenix     221
## 6   ALASKA delayed       Phoenix      12
## 7  AM WEST on time       Phoenix   4,840
## 8  AM WEST delayed       Phoenix     415
## 9   ALASKA on time     San.Diego     212
## 10  ALASKA delayed     San.Diego      20
## 11 AM WEST on time     San.Diego     383
## 12 AM WEST delayed     San.Diego      65
## 13  ALASKA on time San.Francisco     503
## 14  ALASKA delayed San.Francisco     102
## 15 AM WEST on time San.Francisco     320
## 16 AM WEST delayed San.Francisco     129
## 17  ALASKA on time       Seattle   1,841
## 18  ALASKA delayed       Seattle     305
## 19 AM WEST on time       Seattle     201
## 20 AM WEST delayed       Seattle      61

7) Spread rows into columns

airlines_tidy2 <- spread(airlines_tidy1, Status, Flights) #using spread function in tidyr, I spread the rows from status into two columns
airlines_tidy2
##    Airline          City delayed on time
## 1   ALASKA   Los.Angeles      62     497
## 2   ALASKA       Phoenix      12     221
## 3   ALASKA     San.Diego      20     212
## 4   ALASKA San.Francisco     102     503
## 5   ALASKA       Seattle     305   1,841
## 6  AM WEST   Los.Angeles     117     694
## 7  AM WEST       Phoenix     415   4,840
## 8  AM WEST     San.Diego      65     383
## 9  AM WEST San.Francisco     129     320
## 10 AM WEST       Seattle      61     201

8) Remove the commas in numeric values and change type to as.numeric

airlines_tidy2$delayed <- as.numeric(gsub("\\,", "", airlines_tidy2$delayed))
airlines_tidy2$`on time` <- as.numeric(gsub("\\,", "", airlines_tidy2$`on time`))
airlines_tidy2
##    Airline          City delayed on time
## 1   ALASKA   Los.Angeles      62     497
## 2   ALASKA       Phoenix      12     221
## 3   ALASKA     San.Diego      20     212
## 4   ALASKA San.Francisco     102     503
## 5   ALASKA       Seattle     305    1841
## 6  AM WEST   Los.Angeles     117     694
## 7  AM WEST       Phoenix     415    4840
## 8  AM WEST     San.Diego      65     383
## 9  AM WEST San.Francisco     129     320
## 10 AM WEST       Seattle      61     201

9) Perform analysis on airlines

compare_airlines <- airlines_tidy2 %>%
  group_by(Airline) %>% 
  summarize(total_delayed = sum(delayed), total_flights = sum(delayed) + sum(`on time`), delay_ratio = sum(delayed) / (sum(delayed) + sum(`on time`))) 

compare_airlines
## # A tibble: 2 x 4
##   Airline total_delayed total_flights delay_ratio
##   <chr>           <dbl>         <dbl>       <dbl>
## 1 ALASKA            501          3775       0.133
## 2 AM WEST           787          7225       0.109

From airlines delay comparison on total flights and delay ratio, we see that AM WEST had more flights than ALASKA (about twice as more as ALASKA), yet a smaller delay ratio. It’s likely for someone to assume that AM WEST performs better than ALASKA based on this comparison.

10) Perform analysis on delays in cities

compare_cities <- airlines_tidy2 %>%
  group_by(City, Airline) %>% 
  summarize(delay_ratio = sum(delayed) / (sum(delayed) + sum(`on time`)))
compare_cities %>% spread(Airline, delay_ratio)
## # A tibble: 5 x 3
## # Groups:   City [5]
##   City          ALASKA `AM WEST`
##   <chr>          <dbl>     <dbl>
## 1 Los.Angeles   0.111     0.144 
## 2 Phoenix       0.0515    0.0790
## 3 San.Diego     0.0862    0.145 
## 4 San.Francisco 0.169     0.287 
## 5 Seattle       0.142     0.233

From cities delay comparison, we see that ALASKA outperforms AM WEST in all five cities since its delay ratio is smaller than AM WEST’s in every city. Frankly, it’s the opposite of what we have assumed from the previous analysis.