I decided to create a MySQL db for this assignment to help myself get more familiar with connecting R to MySQL and creating MySQL databases. First thing I did was create a table in MySQL and insert the following data so that it resembled the table below.
| Airline | Status | Los Angeles | Phoenix | San Diego | San Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| - | delayed | 62 | 12 | 20 | 102 | 305 |
| AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| - | delayed | 117 | 415 | 65 | 129 | 61 |
The table describes the arrival delays for two airlines at five airports located in the West United States, however, the data isn’t organized perfectly to be analyzed. I will use dyplr and tidyr to transform the data to make analyzing it easier. Imagine a scenario where my supervisor tells me there will be lots of employees traveling over the course of the next year in the Western US and they are want to know which airline will be most reliable when it comes to arriving on time. Using the data provided we can create some barplots to help visualize the number of delays for each airline.
library(RMySQL)
## Warning: package 'RMySQL' was built under R version 3.4.4
## Loading required package: DBI
## Warning: package 'DBI' was built under R version 3.4.4
library(plyr)
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.4
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.4
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Once connection is established with the MySQL DB I can then query the table and save it as a data frame.
delays <- dbGetQuery(mydb, "SELECT * FROM delays")
delays <- as.data.frame(delays)
delays
## airline status Los_Angeles Phoenix San_Diego San_Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
Summarizing the data confirms that there are NA’s within the columns from the one blank row between the two airlines.
summary(delays)
## airline status Los_Angeles Phoenix
## Length:5 Length:5 Min. : 62.0 Min. : 12.0
## Class :character Class :character 1st Qu.:103.2 1st Qu.: 168.8
## Mode :character Mode :character Median :307.0 Median : 318.0
## Mean :342.5 Mean :1372.0
## 3rd Qu.:546.2 3rd Qu.:1521.2
## Max. :694.0 Max. :4840.0
## NA's :1 NA's :1
## San_Diego San_Francisco Seattle
## Min. : 20.00 Min. :102.0 Min. : 61
## 1st Qu.: 53.75 1st Qu.:122.2 1st Qu.: 166
## Median :138.50 Median :224.5 Median : 253
## Mean :170.00 Mean :263.5 Mean : 602
## 3rd Qu.:254.75 3rd Qu.:365.8 3rd Qu.: 689
## Max. :383.00 Max. :503.0 Max. :1841
## NA's :1 NA's :1 NA's :1
Lets remove the ugly row of NA’s out of the middle of our dataframe.
delays <- delays %>% drop_na(Los_Angeles)
delays
## airline status Los_Angeles Phoenix San_Diego San_Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
And fill the NA’s below the airlines using fill from tidyr.
delays <- delays %>% fill(airline)
Now lets change to layout from wide to long using gather from tidyr.
long <- delays %>% gather(key=airport,value=count,3:7)
long
## airline status airport count
## 1 ALASKA on time Los_Angeles 497
## 2 ALASKA delayed Los_Angeles 62
## 3 AM WEST on time Los_Angeles 694
## 4 AM WEST delayed Los_Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 ALASKA delayed Phoenix 12
## 7 AM WEST on time Phoenix 4840
## 8 AM WEST delayed Phoenix 415
## 9 ALASKA on time San_Diego 212
## 10 ALASKA delayed San_Diego 20
## 11 AM WEST on time San_Diego 383
## 12 AM WEST delayed San_Diego 65
## 13 ALASKA on time San_Francisco 503
## 14 ALASKA delayed San_Francisco 102
## 15 AM WEST on time San_Francisco 320
## 16 AM WEST delayed San_Francisco 129
## 17 ALASKA on time Seattle 1841
## 18 ALASKA delayed Seattle 305
## 19 AM WEST on time Seattle 201
## 20 AM WEST delayed Seattle 61
Looking at out long dataframe, I would like to change the way it is assorted with arrange.
long <- long %>% arrange(airline, status)
## Warning: package 'bindrcpp' was built under R version 3.4.4
long
## airline status airport count
## 1 ALASKA delayed Los_Angeles 62
## 2 ALASKA delayed Phoenix 12
## 3 ALASKA delayed San_Diego 20
## 4 ALASKA delayed San_Francisco 102
## 5 ALASKA delayed Seattle 305
## 6 ALASKA on time Los_Angeles 497
## 7 ALASKA on time Phoenix 221
## 8 ALASKA on time San_Diego 212
## 9 ALASKA on time San_Francisco 503
## 10 ALASKA on time Seattle 1841
## 11 AM WEST delayed Los_Angeles 117
## 12 AM WEST delayed Phoenix 415
## 13 AM WEST delayed San_Diego 65
## 14 AM WEST delayed San_Francisco 129
## 15 AM WEST delayed Seattle 61
## 16 AM WEST on time Los_Angeles 694
## 17 AM WEST on time Phoenix 4840
## 18 AM WEST on time San_Diego 383
## 19 AM WEST on time San_Francisco 320
## 20 AM WEST on time Seattle 201
Now that we have tidied up the data a bit we can group the data in different ways to summarize it. I can group by airline and status to get the total number of delayed and on time flights for each airline. Then I can group by just the airline to get the total number of flights for each airline. I would like to calculate the proportion of on time and delayed flights for each airline so I then combine the two data frames.
grpstatus <- long %>% group_by(airline, status) %>% summarise(total_count_by_airline_status = sum(count))
grpstatus
## # A tibble: 4 x 3
## # Groups: airline [?]
## airline status total_count_by_airline_status
## <chr> <chr> <int>
## 1 ALASKA delayed 501
## 2 ALASKA on time 3274
## 3 AM WEST delayed 787
## 4 AM WEST on time 6438
grpairlines <- long %>% group_by(airline) %>% summarise(total_count = sum(count))
grpairlines
## # A tibble: 2 x 2
## airline total_count
## <chr> <int>
## 1 ALASKA 3775
## 2 AM WEST 7225
combined <- rbind.fill(grpstatus,grpairlines)
combined
## airline status total_count_by_airline_status total_count
## 1 ALASKA delayed 501 NA
## 2 ALASKA on time 3274 NA
## 3 AM WEST delayed 787 NA
## 4 AM WEST on time 6438 NA
## 5 ALASKA <NA> NA 3775
## 6 AM WEST <NA> NA 7225
This creates a dataframe with a bunch of NA’s. To remove the NA’s in the total_count column I assign the total count of flights for each airline to all rows where the airline name matches. I then remove the bottom two rows from the data frame.
combined$total_count[combined$airline == 'ALASKA'] <- combined$total_count[combined$airline == 'ALASKA'] %>% tail(1)
combined$total_count[combined$airline == 'AM WEST'] <- combined$total_count[combined$airline == 'AM WEST'] %>% tail(1)
combined <- combined[-c(5,6),]
I then can easily calculate the proportion of delayed and on time flights for both airlines. I then rearrange the data frame to sort by the status so that it will be easier to compare the airlines in the barplots.
combined$prop <- combined$total_count_by_airline_status/combined$total_count
paste(combined$airline, combined$status, sep = ' ')
## [1] "ALASKA delayed" "ALASKA on time" "AM WEST delayed" "AM WEST on time"
par(mar = c(10,4,4,2))
combined <- combined %>% arrange(status)
barplot(combined$total_count_by_airline_status,
names.arg = (paste(combined$airline, combined$status, sep = ' ')),
las = 2, main = 'Totals for each airline')
barplot(combined$prop,
names.arg = (paste(combined$airline, combined$status, sep = ' ')),
las = 2, yaxp = c(0.0, 0.9,9),
main = 'Proportion of flights for each airline')
In conclusion, the barplot of the totals for each ariline makes it seem as if AM WEST would be the clear choice to with because it has almost double the number of on time flights as ALASKA. However, looking at the porportion of flights that were delayed or on time for each airline shows that there was not much of a difference between the airlines. In this case I would select AM WEST airline for our office travel needs because it had nearly double the amount of flights as ALASKA and the greater number of flights might make it easier to select flight times when booking travel for meetings.