The chart above describes arrival delays for two airlines across five destinations.
Your task is to:
I created a MySQL database data607_wk5asg and entered the provided information into a table called delays. I used a dynamic query to format the data in a wide structure, similar to a pivot table in M.S. Excel.
library(RMySQL)
library(DBI)
localuserpassword <- "root"
con <- dbConnect(MySQL(),
user='root', password=localuserpassword,
dbname='data607_wk5asg',
host='localhost')
dbReadTable(con, "delays")
## id airline stat city n
## 1 1 ALASKA on time Los Angeles 497
## 2 1 ALASKA on time Phoenix 221
## 3 1 ALASKA on time San Diego 212
## 4 1 ALASKA on time San Francisco 503
## 5 1 ALASKA on time Seattle 1841
## 6 1 <NA> delayed Los Angeles 62
## 7 1 <NA> delayed Phoenix 12
## 8 1 <NA> delayed San Diego 20
## 9 1 <NA> delayed San Francisco 102
## 10 1 <NA> delayed Seattle 305
## 11 2 AM WEST on time Los Angeles 694
## 12 2 AM WEST on time Phoenix 4840
## 13 2 AM WEST on time San Diego 383
## 14 2 AM WEST on time San Francisco 320
## 15 2 AM WEST on time Seattle 201
## 16 2 <NA> delayed Los Angeles 117
## 17 2 <NA> delayed Phoenix 415
## 18 2 <NA> delayed San Diego 65
## 19 2 <NA> delayed San Francisco 129
## 20 2 <NA> delayed Seattle 61
dbDisconnect(con)
## [1] TRUE
.CSV file into R & use tidyr and dplyr to tidy and transform data.library(stringr)
delayscsv <- read.csv(paste0("/Users/15082/Documents/CUNY_MSDS/DATA607/delayquery.csv"), stringsAsFactors = F)
delayscsv$stat <- str_replace_all(delayscsv$stat, "on time", "on_time")
head(delayscsv)
## airline stat Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on_time 497 221 212 503 1841
## 2 NULL delayed 62 12 20 102 305
## 3 AM WEST on_time 694 4840 383 320 201
## 4 NULL delayed 117 415 65 129 61
library(tidyr)
library(dplyr)
delays <- delayscsv %>%
mutate(airline=c("ALASKA", "ALASKA", "AM WEST", "AM WEST")) %>%
gather(city, n, 3:7) %>%
arrange(airline, desc(stat), city, n)
knitr::kable(delays, caption = 'Airline Delays Dataset', format = "html") %>%
kable_styling(bootstrap_options = "condensed", full_width = F, position = "left") %>%
row_spec(row = 0:0, background = "#D4E0F7") %>%
column_spec(column = 1, bold = T)
| airline | stat | city | n |
|---|---|---|---|
| ALASKA | on_time | Los.Angeles | 497 |
| ALASKA | on_time | Phoenix | 221 |
| ALASKA | on_time | San.Diego | 212 |
| ALASKA | on_time | San.Francisco | 503 |
| ALASKA | on_time | Seattle | 1841 |
| ALASKA | delayed | Los.Angeles | 62 |
| ALASKA | delayed | Phoenix | 12 |
| ALASKA | delayed | San.Diego | 20 |
| ALASKA | delayed | San.Francisco | 102 |
| ALASKA | delayed | Seattle | 305 |
| AM WEST | on_time | Los.Angeles | 694 |
| AM WEST | on_time | Phoenix | 4840 |
| AM WEST | on_time | San.Diego | 383 |
| AM WEST | on_time | San.Francisco | 320 |
| AM WEST | on_time | Seattle | 201 |
| AM WEST | delayed | Los.Angeles | 117 |
| AM WEST | delayed | Phoenix | 415 |
| AM WEST | delayed | San.Diego | 65 |
| AM WEST | delayed | San.Francisco | 129 |
| AM WEST | delayed | Seattle | 61 |
To avoid future syntax errors, I first used the stringr package to replace the space in value “on time” with a “_" character. Using the dplyr package, I then applied the mutate and c() function to overwrite the airline null values with their proper names. Last, I used the gather function from tidyr to combind the multiple city columns into one field.
status <- delays %>%
group_by(airline) %>%
group_by(airline, stat) %>%
mutate(id = row_number()) %>%
spread(stat, n) %>%
select(-id, -city) %>%
summarise_all(funs(sum)) %>%
group_by(airline) %>%
mutate(total = sum(delayed) + sum(on_time)) %>%
summarise(delay_ratio = delayed / total, on_time_ratio = on_time / total, total_flights = total)
citystatus <- delays %>%
unite(city_airline, city, airline, sep = "_") %>%
arrange(desc(city_airline)) %>%
group_by(city_airline) %>%
group_by(city_airline, stat) %>%
mutate(id = row_number()) %>%
spread(stat, n) %>%
select(-id) %>%
summarise_all(funs(sum)) %>%
group_by(city_airline) %>%
mutate(total = sum(delayed) + sum(on_time)) %>%
summarise(delay_ratio = delayed / total, on_time_ratio = on_time / total, total_flights = total)
I again used the dplyr and tidyr packages to calculate the percentages of delayed and on time arrivals for each airline.
- The group_by function was used to collapse repeated data within the identified columns.
- The mutate function was used to append new columns of data to allow for further synthesis.
- The spread function was used to separate the status of airline flights.
- The select function was used in conjunction with the - indicator to remove columns from the selection.
- The summarise function was used to calculate the ratio of delays and on_time arrivals into a single row of values for each element.
| airline | delay_ratio | on_time_ratio | total_flights |
|---|---|---|---|
| ALASKA | 0.1327152 | 0.8672848 | 3775 |
| AM WEST | 0.1089273 | 0.8910727 | 7225 |
| city_airline | delay_ratio | on_time_ratio | total_flights |
|---|---|---|---|
| Los.Angeles_ALASKA | 0.1109123 | 0.8890877 | 559 |
| Los.Angeles_AM WEST | 0.1442663 | 0.8557337 | 811 |
| Phoenix_ALASKA | 0.0515021 | 0.9484979 | 233 |
| Phoenix_AM WEST | 0.0789724 | 0.9210276 | 5255 |
| San.Diego_ALASKA | 0.0862069 | 0.9137931 | 232 |
| San.Diego_AM WEST | 0.1450893 | 0.8549107 | 448 |
| San.Francisco_ALASKA | 0.1685950 | 0.8314050 | 605 |
| San.Francisco_AM WEST | 0.2873051 | 0.7126949 | 449 |
| Seattle_ALASKA | 0.1421249 | 0.8578751 | 2146 |
| Seattle_AM WEST | 0.2328244 | 0.7671756 | 262 |
In total, Alaska Airlines flights were found to arrive on-time 86.72%, compared to AM West, which arrived on-time 89.10%. However, if you look at the ratio’s in the citystatus table, you will see that AM West experiences higher delays at each city.