library(tidyr)
library(dplyr)
airlines <- read.csv("https://raw.githubusercontent.com/miachen410/DATA607/master/airlines.csv", header = TRUE, stringsAsFactors = FALSE)
airlines
## ï.. X Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1,841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA
## 4 AM WEST on time 694 4,840 383 320 201
## 5 delayed 117 415 65 129 61
Take a look at the data:
- First two columns have no names;
- There is a blank row;
- Two values are missing in column 1.
names(airlines)[1] <- "Airline"
names(airlines)[2] <- "Status"
airlines
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1,841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA
## 4 AM WEST on time 694 4,840 383 320 201
## 5 delayed 117 415 65 129 61
#https://stackoverflow.com/questions/6437164/removing-empty-rows-of-a-data-file-in-r
airlines <- airlines[!apply(is.na(airlines) | airlines == "", 1, all),]
airlines
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1,841
## 2 delayed 62 12 20 102 305
## 4 AM WEST on time 694 4,840 383 320 201
## 5 delayed 117 415 65 129 61
airlines$Airline[2] <- airlines$Airline[1]
airlines$Airline[4] <- airlines$Airline[3]
airlines
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1,841
## 2 ALASKA delayed 62 12 20 102 305
## 4 AM WEST on time 694 4,840 383 320 201
## 5 AM WEST delayed 117 415 65 129 61
airlines_tidy1 <- gather(airlines, "City", "Flights", 3:7) #using gather function in tidyr, I gather cities from column 3 to 7, and transform them into rows
airlines_tidy1
## Airline Status City Flights
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA delayed Los.Angeles 62
## 3 AM WEST on time Los.Angeles 694
## 4 AM WEST delayed Los.Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 ALASKA delayed Phoenix 12
## 7 AM WEST on time Phoenix 4,840
## 8 AM WEST delayed Phoenix 415
## 9 ALASKA on time San.Diego 212
## 10 ALASKA delayed San.Diego 20
## 11 AM WEST on time San.Diego 383
## 12 AM WEST delayed San.Diego 65
## 13 ALASKA on time San.Francisco 503
## 14 ALASKA delayed San.Francisco 102
## 15 AM WEST on time San.Francisco 320
## 16 AM WEST delayed San.Francisco 129
## 17 ALASKA on time Seattle 1,841
## 18 ALASKA delayed Seattle 305
## 19 AM WEST on time Seattle 201
## 20 AM WEST delayed Seattle 61
airlines_tidy2 <- spread(airlines_tidy1, Status, Flights) #using spread function in tidyr, I spread the rows from status into two columns
airlines_tidy2
## Airline City delayed on time
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1,841
## 6 AM WEST Los.Angeles 117 694
## 7 AM WEST Phoenix 415 4,840
## 8 AM WEST San.Diego 65 383
## 9 AM WEST San.Francisco 129 320
## 10 AM WEST Seattle 61 201
airlines_tidy2$delayed <- as.numeric(gsub("\\,", "", airlines_tidy2$delayed))
airlines_tidy2$`on time` <- as.numeric(gsub("\\,", "", airlines_tidy2$`on time`))
airlines_tidy2
## Airline City delayed on time
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San.Diego 65 383
## 9 AM WEST San.Francisco 129 320
## 10 AM WEST Seattle 61 201
compare_airlines <- airlines_tidy2 %>%
group_by(Airline) %>%
summarize(total_delayed = sum(delayed), total_flights = sum(delayed) + sum(`on time`), delay_ratio = sum(delayed) / (sum(delayed) + sum(`on time`)))
compare_airlines
## # A tibble: 2 x 4
## Airline total_delayed total_flights delay_ratio
## <chr> <dbl> <dbl> <dbl>
## 1 ALASKA 501 3775 0.133
## 2 AM WEST 787 7225 0.109
From airlines delay comparison on total flights and delay ratio, we see that AM WEST had more flights than ALASKA (about twice as more as ALASKA), yet a smaller delay ratio. It’s likely for someone to assume that AM WEST performs better than ALASKA based on this comparison.
compare_cities <- airlines_tidy2 %>%
group_by(City, Airline) %>%
summarize(delay_ratio = sum(delayed) / (sum(delayed) + sum(`on time`)))
compare_cities %>% spread(Airline, delay_ratio)
## # A tibble: 5 x 3
## # Groups: City [5]
## City ALASKA `AM WEST`
## <chr> <dbl> <dbl>
## 1 Los.Angeles 0.111 0.144
## 2 Phoenix 0.0515 0.0790
## 3 San.Diego 0.0862 0.145
## 4 San.Francisco 0.169 0.287
## 5 Seattle 0.142 0.233
From cities delay comparison, we see that ALASKA outperforms AM WEST in all five cities since its delay ratio is smaller than AM WEST’s in every city. Frankly, it’s the opposite of what we have assumed from the previous analysis.