Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
First we input the data into a MySQL database. My first attempt at this created a table that was too tidy. I reorganized the table, using the instructional video: https://www.rstudio.com/resources/webinars/data-wrangling-with-r-and-rstudio/ as a reference to create something more untidy to fit with the spirit of the exercise.
dbSendQuery(db, "drop table if exists ontime, cities;")
## <MySQLResult:72374840,0,0>
#I'm am doing this as two tables.
dbSendQuery(db, "CREATE TABLE cities(
city_id INT,
city varchar(50),
PRIMARY KEY (city_id)
);")
## <MySQLResult:196357432,0,1>
dbSendQuery(db, "CREATE TABLE ontime(
city_id INT,
air_line varchar(50),
arrived varchar(50),
n INT,
FOREIGN KEY (city_id) REFERENCES cities(city_id)
);")
## <MySQLResult:199386456,0,2>
#populate the cities table
dbSendQuery(db, "INSERT INTO cities(
city_id,city)
VALUES (1, 'Los Angeles');")
## <MySQLResult:2,0,3>
dbSendQuery(db, "INSERT INTO cities(
city_id,city)
VALUES (2, 'Phoenix');")
## <MySQLResult:200437344,0,4>
dbSendQuery(db, "INSERT INTO cities(
city_id,city)
VALUES (3, 'San Diego');")
## <MySQLResult:0,0,5>
dbSendQuery(db, "INSERT INTO cities(
city_id,city)
VALUES (4, 'San Francisco');")
## <MySQLResult:0,0,6>
dbSendQuery(db, "INSERT INTO cities(
city_id,city)
VALUES (5, 'Seatle');")
## <MySQLResult:190551376,0,7>
#populate the ontime table for the Alaska Flights
dbSendQuery(db, "INSERT INTO ontime
(city_id, air_line, arrived, n)
VALUES(1, 'Alaska', 'ontime', 497);")
## <MySQLResult:72357400,0,8>
dbSendQuery(db, "INSERT INTO ontime
(city_id, air_line, arrived, n)
VALUES(1, 'Alaska', 'delayed', 62);")
## <MySQLResult:0,0,9>
dbSendQuery(db, "INSERT INTO ontime
(city_id, air_line, arrived, n)
VALUES(2, 'Alaska', 'ontime', 221);")
## <MySQLResult:1,0,10>
dbSendQuery(db, "INSERT INTO ontime
(city_id, air_line, arrived, n)
VALUES(2, 'Alaska', 'delayed', 12);")
## <MySQLResult:2,0,11>
dbSendQuery(db, "INSERT INTO ontime
(city_id, air_line, arrived, n)
VALUES(3, 'Alaska', 'ontime', 212);")
## <MySQLResult:111687552,0,12>
dbSendQuery(db, "INSERT INTO ontime
(city_id, air_line, arrived, n)
VALUES(3, 'Alaska', 'delayed', 20);")
## <MySQLResult:150622672,0,13>
dbSendQuery(db, "INSERT INTO ontime
(city_id, air_line, arrived, n)
VALUES(4, 'Alaska', 'ontime', 503);")
## <MySQLResult:72355472,0,14>
dbSendQuery(db, "INSERT INTO ontime
(city_id, air_line, arrived, n)
VALUES(4, 'Alaska', 'delayed', 102);")
## <MySQLResult:72355472,0,15>
dbSendQuery(db, "INSERT INTO ontime
(city_id, air_line, arrived, n)
VALUES(5, 'Alaska', 'ontime', 1803);")
## <MySQLResult:150623400,0,16>
dbSendQuery(db, "INSERT INTO ontime
(city_id, air_line, arrived, n)
VALUES(5, 'Alaska', 'delayed', 305);")
## <MySQLResult:2,0,17>
#Populate the ontime table for AM West Flights
dbSendQuery(db, "INSERT INTO ontime
(city_id, air_line, arrived, n)
VALUES(1, 'AM West', 'ontime', 694);")
## <MySQLResult:72355472,0,18>
dbSendQuery(db, "INSERT INTO ontime
(city_id, air_line, arrived, n)
VALUES(1, 'AM West', 'delayed', 117);")
## <MySQLResult:111781816,0,19>
dbSendQuery(db, "INSERT INTO ontime
(city_id, air_line, arrived, n)
VALUES(2, 'AM West', 'ontime', 4840);")
## <MySQLResult:2,0,20>
dbSendQuery(db, "INSERT INTO ontime
(city_id, air_line, arrived, n)
VALUES(2, 'AM West', 'delayed', 415);")
## <MySQLResult:0,0,21>
dbSendQuery(db, "INSERT INTO ontime
(city_id, air_line, arrived, n)
VALUES(3, 'AM West', 'ontime', 383);")
## <MySQLResult:114690696,0,22>
dbSendQuery(db, "INSERT INTO ontime
(city_id, air_line, arrived, n)
VALUES(3, 'AM West', 'delayed', 65);")
## <MySQLResult:147931528,0,23>
dbSendQuery(db, "INSERT INTO ontime
(city_id, air_line, arrived, n)
VALUES(4, 'AM West', 'ontime', 320);")
## <MySQLResult:111781816,0,24>
dbSendQuery(db, "INSERT INTO ontime
(city_id, air_line, arrived, n)
VALUES(4, 'AM West', 'delayed', 129);")
## <MySQLResult:72355472,0,25>
dbSendQuery(db, "INSERT INTO ontime
(city_id, air_line, arrived, n)
VALUES(5, 'AM West', 'ontime', 201);")
## <MySQLResult:2,0,26>
dbSendQuery(db, "INSERT INTO ontime
(city_id, air_line, arrived, n)
VALUES(5, 'AM West', 'delayed', 61);")
## <MySQLResult:1364425037,0,27>
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
First we will load tidyr and dplyr:
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.1
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Next I’ll read the information from the MySQL database into r.
arrival<- db %>% dbSendQuery("SELECT DISTINCT a.air_line, c.city, a.arrived, a.n FROM ontime a, cities c WHERE a.city_id = c.city_id GROUP BY a.air_line, c.city, a.arrived;") %>% fetch() %>% tbl_df()
arrival
## # A tibble: 20 x 4
## air_line city arrived n
## <chr> <chr> <chr> <int>
## 1 Alaska Los Angeles delayed 62
## 2 Alaska Los Angeles ontime 497
## 3 Alaska Phoenix delayed 12
## 4 Alaska Phoenix ontime 221
## 5 Alaska San Diego delayed 20
## 6 Alaska San Diego ontime 212
## 7 Alaska San Francisco delayed 102
## 8 Alaska San Francisco ontime 503
## 9 Alaska Seatle delayed 305
## 10 Alaska Seatle ontime 1803
## 11 AM West Los Angeles delayed 117
## 12 AM West Los Angeles ontime 694
## 13 AM West Phoenix delayed 415
## 14 AM West Phoenix ontime 4840
## 15 AM West San Diego delayed 65
## 16 AM West San Diego ontime 383
## 17 AM West San Francisco delayed 129
## 18 AM West San Francisco ontime 320
## 19 AM West Seatle delayed 61
## 20 AM West Seatle ontime 201
#First time I used the %>% operator and it worked. I have not smiled this big in awhile.
It looks like the query I made put the data in a vertical table. I will use tidyr’s spread to organize it further:
arrival <- arrival %>% spread(arrived, n)
arrival
## # A tibble: 10 x 4
## air_line city delayed ontime
## * <chr> <chr> <int> <int>
## 1 Alaska Los Angeles 62 497
## 2 Alaska Phoenix 12 221
## 3 Alaska San Diego 20 212
## 4 Alaska San Francisco 102 503
## 5 Alaska Seatle 305 1803
## 6 AM West Los Angeles 117 694
## 7 AM West Phoenix 415 4840
## 8 AM West San Diego 65 383
## 9 AM West San Francisco 129 320
## 10 AM West Seatle 61 201
I also want to use dplyr’s mutate to add a row of the percentage of on time flights, since the numbers range from 4840 to 12, it’s difficult to look at the data and tell which city or airline has the best performance.
arrival <- arrival %>% mutate(total_flights = ontime+delayed)
## Warning: package 'bindrcpp' was built under R version 3.4.1
arrival <- arrival %>% mutate(percent_ontime = ontime / total_flights*100)
arrival
## # A tibble: 10 x 6
## air_line city delayed ontime total_flights percent_ontime
## <chr> <chr> <int> <int> <int> <dbl>
## 1 Alaska Los Angeles 62 497 559 88.90877
## 2 Alaska Phoenix 12 221 233 94.84979
## 3 Alaska San Diego 20 212 232 91.37931
## 4 Alaska San Francisco 102 503 605 83.14050
## 5 Alaska Seatle 305 1803 2108 85.53131
## 6 AM West Los Angeles 117 694 811 85.57337
## 7 AM West Phoenix 415 4840 5255 92.10276
## 8 AM West San Diego 65 383 448 85.49107
## 9 AM West San Francisco 129 320 449 71.26949
## 10 AM West Seatle 61 201 262 76.71756
From this table it looks as if AM West handles more flights in this region than Alaska Airlines. AM West handles more flights out of Phoenix than Alaska handles in total, but more on that in the next section.
Perform analysis to compare the arrival delays for the two airlines.
First we will do some simple summary statistics.
arrival %>% summarise(median_percent = median(percent_ontime), sd_percent = sd(percent_ontime))
## # A tibble: 1 x 2
## median_percent sd_percent
## <dbl> <dbl>
## 1 85.55234 7.167932
The median ontime percentage is 85.6% with a standard deviation of 7.2% this will come in useful when we evaluate top performing cities and and airlines below.
arrival %>% arrange(desc(percent_ontime))
## # A tibble: 10 x 6
## air_line city delayed ontime total_flights percent_ontime
## <chr> <chr> <int> <int> <int> <dbl>
## 1 Alaska Phoenix 12 221 233 94.84979
## 2 AM West Phoenix 415 4840 5255 92.10276
## 3 Alaska San Diego 20 212 232 91.37931
## 4 Alaska Los Angeles 62 497 559 88.90877
## 5 AM West Los Angeles 117 694 811 85.57337
## 6 Alaska Seatle 305 1803 2108 85.53131
## 7 AM West San Diego 65 383 448 85.49107
## 8 Alaska San Francisco 102 503 605 83.14050
## 9 AM West Seatle 61 201 262 76.71756
## 10 AM West San Francisco 129 320 449 71.26949
arrival %>% filter(percent_ontime > median(percent_ontime)) %>% select(air_line,percent_ontime) %>% arrange()
## # A tibble: 5 x 2
## air_line percent_ontime
## <chr> <dbl>
## 1 Alaska 88.90877
## 2 Alaska 94.84979
## 3 Alaska 91.37931
## 4 AM West 85.57337
## 5 AM West 92.10276
Alaska Air holds 3 of the 5 top spots in on time percentage with 2 above 90%, whereas AM West only has 1 above 90%.
arrival %>% filter(percent_ontime > median(percent_ontime)) %>% select(city,percent_ontime) %>% arrange()
## # A tibble: 5 x 2
## city percent_ontime
## <chr> <dbl>
## 1 Los Angeles 88.90877
## 2 Phoenix 94.84979
## 3 San Diego 91.37931
## 4 Los Angeles 85.57337
## 5 Phoenix 92.10276
Interestingly Phoenix is a top performing city for both airlines, however AM West handles 4840 flights out of Phoenix, whereas Alaska only handles 233.
arrival %>% filter(percent_ontime < median(percent_ontime)) %>% select(air_line,percent_ontime) %>% arrange()
## # A tibble: 5 x 2
## air_line percent_ontime
## <chr> <dbl>
## 1 Alaska 83.14050
## 2 Alaska 85.53131
## 3 AM West 85.49107
## 4 AM West 71.26949
## 5 AM West 76.71756
Conversely, AM West holds 3 of 5 bottom-half performing cities, with two below 80%. Alaska lowest city is 83.1%
arrival %>% filter(percent_ontime < median(percent_ontime)) %>% select(city,percent_ontime) %>% arrange()
## # A tibble: 5 x 2
## city percent_ontime
## <chr> <dbl>
## 1 San Francisco 83.14050
## 2 Seatle 85.53131
## 3 San Diego 85.49107
## 4 San Francisco 71.26949
## 5 Seatle 76.71756
Seattle is a bottom-half performing city for both airlines. However, that is a small market for AM West at 262 flights total, but that is Alaska’s top market with 2108 flights total.
par(las=2)
barplot(arrival$percent_ontime,names.arg =arrival$city,horiz = TRUE,cex.names = 0.5, col = c("darkgreen", "darkgreen","darkgreen","darkgreen","darkgreen", "Orange","Orange","Orange","Orange","Orange" ), legend = c("Orange = AM West", "Green = Alaska"),args.legend = list(x ='topright', bty='n', inset=c(-0.075,0), cex=0.8))
city_by_airline <- arrival %>% select(city,air_line,percent_ontime) %>%spread(air_line,percent_ontime)
boxplot(city_by_airline$Alaska, city_by_airline$`AM West`, names = c("Alaska", "AM West") )
Again, from these plots it appears as if Alaska Airlines has more cities with a higher percent on time than AM West. However, we should not stop here as there are other factors to consider.
par(las=2)
barplot(arrival$total_flights,names.arg =arrival$city,horiz = TRUE,cex.names = 0.5, col = c("darkgreen", "darkgreen","darkgreen","darkgreen","darkgreen", "Orange","Orange","Orange","Orange","Orange" ), legend = c("Orange = AM West", "Green = Alaska"),args.legend = list(x ='topright', bty='n', inset=c(-0.075,0), cex=0.8))
As stated above, AM West handles more flights out of Phoenix (a top-half market) than Alaska does all five cities combined. Furthermore, Alaska’s top market is in the bottom-half of performance. Below we will take the ratio of AM West flights to Alaska flights.
alaska <- arrival %>% filter(air_line == "Alaska")
al_sum <-sum(alaska$total_flights)
am_west <- arrival %>% filter(air_line == "AM West")
am_sum <- sum(am_west$total_flights)
am_sum/al_sum
## [1] 1.933369
AM West handles nearly twice as much traffic. With this in mind we need to devise a better indicator of on time percentage. I will calculate the total ontime flights divided by the total flights for each airline to get a total on time percentage.
alaska <- arrival %>% filter(air_line == "Alaska")
al_tot_on <-sum(alaska$ontime)
al_tot_on/al_sum
## [1] 0.8659352
am_west <- arrival %>% filter(air_line == "AM West")
am_tot_on <- sum(am_west$ontime)
am_tot_on/am_sum
## [1] 0.8910727
So it appears that AM West slightly out-performs Alaska overall with 89.1% on time flights versus 86.6%. Weighting the on time percentages by the size of each city’s flights over the total for each airline may be a better indicator.
al_weighted <- arrival %>% filter(air_line == "Alaska")%>%mutate(weighted_ave = percent_ontime*total_flights/al_sum)
al_weighted %>% select(city,weighted_ave)
## # A tibble: 5 x 2
## city weighted_ave
## <chr> <dbl>
## 1 Los Angeles 13.299438
## 2 Phoenix 5.913835
## 3 San Diego 5.673000
## 4 San Francisco 13.459995
## 5 Seatle 48.247257
am_weighted <- arrival %>% filter(air_line == "AM West")%>%mutate(weighted_ave = percent_ontime*total_flights/am_sum)
am_weighted %>% select(city,weighted_ave)
## # A tibble: 5 x 2
## city weighted_ave
## <chr> <dbl>
## 1 Los Angeles 9.605536
## 2 Phoenix 66.989619
## 3 San Diego 5.301038
## 4 San Francisco 4.429066
## 5 Seatle 2.782007
Finally we see that for Alaska, Seattle which is it’s largest market, but a bottom-half performing city makes up 48% of it’s on time flights, whereas for AM West Phoenix which is by large the largest market in the set and a top-half performing city makes up 67% of AM West’s on time flights. This further supports the claim the AM West does better overall even though Alaska has more top-half cities in it’s list.