I start with a screenshot of a small dataset and the goal is to put it into a tidy data format and then perform analysis to compare the arrival delays for the two airlines.
For the sake of practice, I’ll create two separate tables in MySQL, one for each airline. In MySQL I create the empty tables with the variables in such a way that the data will be in a long structure per the grading rubric. I then input the handful of rows of data.
Image of MySQL Code.
First I load the RMySQL package in order to connect with MySQL. I also load the keyring package so I can access my stored password and keep it hidden from the world. Let’s load kable too for tables.
library('keyring')
library("RMySQL")
library("kableExtra")
Now we must access the datasets from MySQL.And let’s preview them quick as well.
mysqlpass <- key_get('mysql', 'root')
rachdb = dbConnect(MySQL(), user='root', password=mysqlpass,
dbname='607assign5_flights', host='localhost')
amwest_fl <- dbGetQuery(rachdb, "select * from amwest_fl")
alaska_fl <- dbGetQuery(rachdb, "select * from alaska_fl")
kable(amwest_fl, format = "markdown")
| destination | on_time | delay |
|---|---|---|
| Los Angeles | 694 | 117 |
| Phoenix | 4840 | 415 |
| San Diego | 383 | 65 |
| San Francisco | 320 | 129 |
| Seattle | 201 | 61 |
kable(alaska_fl, format = "markdown")
| destination | on_time | delay |
|---|---|---|
| Los Angeles | 497 | 62 |
| Phoenix | 221 | 12 |
| San Diego | 212 | 20 |
| San Francisco | 503 | 102 |
| Seattle | 1841 | 305 |
Using dplyr we need to combine our two datasets into one and add a column to identify which airline the data is from.
library(dplyr)
#add airline variable with correct airline for each dataset
alaska_fl <- mutate(alaska_fl, airline = "ALASKA")
amwest_fl <- mutate(amwest_fl, airline = "AMWEST")
#bind the two datasets together vertically then arrange by destination
flights <- bind_rows(alaska_fl, amwest_fl)
flights <- arrange(flights, destination)
kable(flights, format = "markdown")
| destination | on_time | delay | airline |
|---|---|---|---|
| Los Angeles | 497 | 62 | ALASKA |
| Los Angeles | 694 | 117 | AMWEST |
| Phoenix | 221 | 12 | ALASKA |
| Phoenix | 4840 | 415 | AMWEST |
| San Diego | 212 | 20 | ALASKA |
| San Diego | 383 | 65 | AMWEST |
| San Francisco | 503 | 102 | ALASKA |
| San Francisco | 320 | 129 | AMWEST |
| Seattle | 1841 | 305 | ALASKA |
| Seattle | 201 | 61 | AMWEST |
First by looking at a summary of the full dataset we see that the median frequency of on time flights is 440 and the median frequency of delayed flights is 83.5. One airline, we don’t know which yet, has the max of 415 delayed flights to a certain destination.
kable((summary(flights)), format = "markdown")
| destination | on_time | delay | airline | |
|---|---|---|---|---|
| Length:10 | Min. : 201.0 | Min. : 12.00 | Length:10 | |
| Class :character | 1st Qu.: 245.8 | 1st Qu.: 61.25 | Class :character | |
| Mode :character | Median : 440.0 | Median : 83.50 | Mode :character | |
| NA | Mean : 971.2 | Mean :128.80 | NA | |
| NA | 3rd Qu.: 646.2 | 3rd Qu.:126.00 | NA | |
| NA | Max. :4840.0 | Max. :415.00 | NA |
Looking by airline we can see a comparison between the two. ALASKA has a much higher median on-time frequency across the destinations at 597 compared to AMWEST’s 383. AMWEST has a larger IQR suggesting more variance in their on-time rates by destination.
kable((flights %>%
group_by(airline) %>%
summarise(median_on_time = median(on_time), iqr_on_time = IQR(on_time),
min_on_time = min(on_time), max_on_time = max(on_time))),
format = "markdown")
| airline | median_on_time | iqr_on_time | min_on_time | max_on_time |
|---|---|---|---|---|
| ALASKA | 497 | 282 | 212 | 1841 |
| AMWEST | 383 | 374 | 201 | 4840 |
Since this data is already aggregated, a nice way to visualize it might be this first dot plot below. The number of on-time flights for Los Angeles, San Diego, and San Francisco look similar, but there are large difference in the other two cities. In Phoenix AMWEST has nearly 5,000 on-time flights while ALASKA only has 250. In a flip, ALASKA has the greater amount of on-time flights to Seattle, around 2000, with AMWEST only have around 200.
library(ggplot2)
ggplot(data = flights, aes(x = on_time, y = destination)) +
geom_point(aes(color = airline, size = 5)) +
ylab('Destination City') +
xlab('Frequency of On-Time Flights') +
labs(title="Frequency of On-Time Flights to Destinations, by Airline") +
coord_flip()
Below, in looking at the delayed flights variable we can see that across all destinations, except for Seattle, AMWEST has more delays. The largest discrepancies are found in Seattle and Phoenix. Seattle is the only destination where ALASKA has more delays, at over 300 compared to AMWEST’s around 60. Phoenix is the inverse of that, ALASKA has what looks to be around 10 delays while that is the worst amount of delays in the entire dataset for AMWEST, that 415 maximum delays.
From this view one would conclude if you were flying to these destinations you’d want to fly ALASKA, unless you’re going to Seattle, then it’s AMWEST. However…
ggplot(data = flights, aes(x = delay, y = destination)) +
geom_point(aes(color = airline, size = 5)) +
ylab('Destination City') +
xlab('Frequency of Delays') +
labs(title="Frequency of Delays to Destinations, by Airline") +
coord_flip()
It might be more valuable to look at these not as just counts, but as percents out of the total flights to that destination for that airline. In this final dot plot below, now we see that for every destination, regardless of the raw number of flights, ALASKA has a greater percentage of on-time flights.
This plot also shows us that AMWEST has a particularly hard time being on-time in San Francisco, it’s the lowest dot on the chart near 71%.
flights <- mutate(flights, on_time_perc = flights$on_time/(flights$on_time + flights$delay))
ggplot(data = flights, aes(x = on_time_perc, y = destination)) +
geom_point(aes(color = airline, size = 5)) +
ylab('Destination City') +
xlab('% On-Time Flights') +
labs(title="% On-Time Flights to Destinations, by Airline") +
coord_flip()
This process shows how important it is to get the data in the correct format for analysis and graphing, and the importance of not just plotting the variables you have but taking time to create a more logical measure, such as the percent of on-time flights as opposed to the frequency. While the number of times an airline flies on time to a destination certainly holds merit, the percentage allows us to compare between the two airlines better as in some cases one simply didn’t fly to a certain destination as often.