Next is to reshape the data. I would like to have all the destinations(column 3-7) in a column and add an AMOUNT column for the number of times the flight was on time and delayed.
flight.chart <- gather(flight.chart, "Destination", "Amount", 3:7 )
colnames(flight.chart) <- c("Airline", "Status", "Destination", "Amount")
flight.chart <-dplyr::arrange(flight.chart, Airline)
flight.chart
## Airline Status Destination Amount
## 1 ALASKA on time LA 497
## 2 ALASKA delayed LA 62
## 3 ALASKA on time PHO 221
## 4 ALASKA delayed PHO 12
## 5 ALASKA on time San Diego 212
## 6 ALASKA delayed San Diego 20
## 7 ALASKA on time San Fran. 503
## 8 ALASKA delayed San Fran. 102
## 9 ALASKA on time SEA 1841
## 10 ALASKA delayed SEA 305
## 11 AM WEST on time LA 694
## 12 AM WEST delayed LA 117
## 13 AM WEST on time PHO 4840
## 14 AM WEST delayed PHO 415
## 15 AM WEST on time San Diego 383
## 16 AM WEST delayed San Diego 65
## 17 AM WEST on time San Fran. 320
## 18 AM WEST delayed San Fran. 129
## 19 AM WEST on time SEA 201
## 20 AM WEST delayed SEA 61
I want to create an on time and delayed column with appropriate counts in each column.
flight.chart <- tidyr::spread(flight.chart, Status, Amount)
colnames(flight.chart) <- c("Airline", "Destination","Delayed","OnTime", "Unknown")
flight.chart
## Airline Destination Delayed OnTime Unknown
## 1 ALASKA LA 62 497 <NA>
## 2 ALASKA PHO 12 221 <NA>
## 3 ALASKA San Diego 20 212 <NA>
## 4 ALASKA San Fran. 102 503 <NA>
## 5 ALASKA SEA 305 1841 <NA>
## 6 AM WEST LA 117 <NA> 694
## 7 AM WEST PHO 415 <NA> 4840
## 8 AM WEST San Diego 65 <NA> 383
## 9 AM WEST San Fran. 129 <NA> 320
## 10 AM WEST SEA 61 <NA> 201
# I noticed that my result was split up so I did the following to fix it.
flight.chart$OnTime <- c(497, 221, 212, 503, 1841, 694, 4840, 383, 320, 201)
flight.chart
## Airline Destination Delayed OnTime Unknown
## 1 ALASKA LA 62 497 <NA>
## 2 ALASKA PHO 12 221 <NA>
## 3 ALASKA San Diego 20 212 <NA>
## 4 ALASKA San Fran. 102 503 <NA>
## 5 ALASKA SEA 305 1841 <NA>
## 6 AM WEST LA 117 694 694
## 7 AM WEST PHO 415 4840 4840
## 8 AM WEST San Diego 65 383 383
## 9 AM WEST San Fran. 129 320 320
## 10 AM WEST SEA 61 201 201
flight.chart$Unknown <- NULL
flight.chart
## Airline Destination Delayed OnTime
## 1 ALASKA LA 62 497
## 2 ALASKA PHO 12 221
## 3 ALASKA San Diego 20 212
## 4 ALASKA San Fran. 102 503
## 5 ALASKA SEA 305 1841
## 6 AM WEST LA 117 694
## 7 AM WEST PHO 415 4840
## 8 AM WEST San Diego 65 383
## 9 AM WEST San Fran. 129 320
## 10 AM WEST SEA 61 201
Step 2: We will now perform some analysis on the data. My main goal is to figure out which airline is more on time overall.
I’ll need to ensure that my Delayed and On Time columns are numeric so I can do calculations on them
flight.chart$Delayed <- as.numeric(as.character((flight.chart$Delayed)))
flight.chart$OnTime <- as.numeric(as.character(flight.chart$OnTime))
flight.chart <- tbl_df(flight.chart) #This was more so to confirm that those two columns were converted to numeric.
flight.chart
## # A tibble: 10 × 4
## Airline Destination Delayed OnTime
## * <chr> <chr> <dbl> <dbl>
## 1 ALASKA LA 62 497
## 2 ALASKA PHO 12 221
## 3 ALASKA San Diego 20 212
## 4 ALASKA San Fran. 102 503
## 5 ALASKA SEA 305 1841
## 6 AM WEST LA 117 694
## 7 AM WEST PHO 415 4840
## 8 AM WEST San Diego 65 383
## 9 AM WEST San Fran. 129 320
## 10 AM WEST SEA 61 201
I wanted to create a column that would show the ratio of delays vs. on time for each trip. Just from skimming the latest output, we see that the Alaska flight to Phoenix had the lowest rate of delays, while the AM West flight to San Fransico had the highest delay rate.
flight.chart <- mutate(flight.chart, Ratio = (Delayed /(OnTime + Delayed)))
flight.chart <- arrange(flight.chart, desc(Ratio))
flight.chart
## # A tibble: 10 × 5
## Airline Destination Delayed OnTime Ratio
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 AM WEST San Fran. 129 320 0.28730512
## 2 AM WEST SEA 61 201 0.23282443
## 3 ALASKA San Fran. 102 503 0.16859504
## 4 AM WEST San Diego 65 383 0.14508929
## 5 AM WEST LA 117 694 0.14426634
## 6 ALASKA SEA 305 1841 0.14212488
## 7 ALASKA LA 62 497 0.11091234
## 8 ALASKA San Diego 20 212 0.08620690
## 9 AM WEST PHO 415 4840 0.07897241
## 10 ALASKA PHO 12 221 0.05150215
*In order to get a better view of which airline has more delays, I want to produce a grouped bar graph. I chose the destination to be my x-axis, and for my y-axis, I went with the delay ratio (delays/(delays+ontime)
ggplot(flight.chart,aes(x=Destination, y=Ratio, fill=factor(Airline))) +
geom_bar(stat="identity",position="dodge") +
scale_fill_brewer(palette = "Set1", name="Airline", breaks =c("ALASKA", "AM WEST"), labels=c("Alaska", "AM WEST")) +
xlab("Destination") +
ylab("Delayed Ratio")

# from: http://stackoverflow.com/questions/17721126/simplest-way-to-do-grouped-barplot
*From this, we can easily see that overall, AM WEST has more delays than ALASKA. This is shown by every destination having a higher delay rate for AM wEST THAN Alaska.