I retrieved data from MySQL.
For security purpose, I blanked password, userID and host. Actual codes for MySQL connection are written like:
install.packages(“RMySQL”)
library(RMySQL)
mydb = dbConnect(MySQL(), user=‘’, password=’‘, dbname=’airlines’, host=’’)
You may refer to RMD file in my github to see the details.
dbListTables(mydb)
## [1] "carrier_airlines" "carrier_flight_count" "carrier_status"
flight <- dbGetQuery(mydb, 'select c.carrier_name, b.status_name, a.Los_Angeles, a.Phoenix, a.San_Diego, a.San_Francisco, a.Seattle
from carrier_flight_count a
join carrier_status b on a.status_id = b.status_id
join carrier_airlines c on a.carrier_id = c.carrier_id')
sorted <- gather(flight, "city", "count", 3:7)
sorted
## carrier_name status_name city count
## 1 ALASKA on_time Los_Angeles 497
## 2 ALASKA delayed Los_Angeles 62
## 3 AM WEST on_time Los_Angeles 694
## 4 AM WEST delayed Los_Angeles 117
## 5 ALASKA on_time Phoenix 221
## 6 ALASKA delayed Phoenix 12
## 7 AM WEST on_time Phoenix 4840
## 8 AM WEST delayed Phoenix 415
## 9 ALASKA on_time San_Diego 212
## 10 ALASKA delayed San_Diego 20
## 11 AM WEST on_time San_Diego 383
## 12 AM WEST delayed San_Diego 65
## 13 ALASKA on_time San_Francisco 503
## 14 ALASKA delayed San_Francisco 102
## 15 AM WEST on_time San_Francisco 320
## 16 AM WEST delayed San_Francisco 129
## 17 ALASKA on_time Seattle 1841
## 18 ALASKA delayed Seattle 305
## 19 AM WEST on_time Seattle 201
## 20 AM WEST delayed Seattle 61
sorted_2 <- spread(sorted, status_name, count)
sorted_2
## carrier_name city delayed on_time
## 1 ALASKA Los_Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San_Diego 20 212
## 4 ALASKA San_Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los_Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San_Diego 65 383
## 9 AM WEST San_Francisco 129 320
## 10 AM WEST Seattle 61 201
str(sorted_2)
## 'data.frame': 10 obs. of 4 variables:
## $ carrier_name: chr "ALASKA" "ALASKA" "ALASKA" "ALASKA" ...
## $ city : chr "Los_Angeles" "Phoenix" "San_Diego" "San_Francisco" ...
## $ delayed : int 62 12 20 102 305 117 415 65 129 61
## $ on_time : int 497 221 212 503 1841 694 4840 383 320 201
Q1. Get total delay by carriers and compare. Get proportion of delay by carriers and compare.
Q2. Get total delay by cities and compare. Get proportion of delay by cities and compare.
Q3. Get total delay by carriers broken down by cities and compare. Get proportion of delay by carriers broken down by cities and compare.
Conclusion. Summarize the findings.
I will use summarize, group_by, ggplot, mutate and etc to transform the data.
carriers <- group_by(sorted_2, carrier_name)
summary_carriers <- summarize(
carriers,
sum_delayed = sum(delayed),
sum_on_time = sum(on_time),
total_flights = sum_delayed + sum_on_time,
avg_delayed = mean(delayed),
avg_on_time = mean(on_time),
sd_delayed = sd(delayed),
sd_on_time = sd(on_time),
prop_delayed = round(sum_delayed / total_flights,2)
)
## Warning: package 'bindrcpp' was built under R version 3.4.1
data.frame(summary_carriers)
## carrier_name sum_delayed sum_on_time total_flights avg_delayed
## 1 ALASKA 501 3274 3775 100.2
## 2 AM WEST 787 6438 7225 157.4
## avg_on_time sd_delayed sd_on_time prop_delayed
## 1 654.8 120.0175 678.0982 0.13
## 2 1287.6 147.1625 1994.1889 0.11
ggplot(summary_carriers, aes(carrier_name, prop_delayed)) +
geom_bar(aes(fill = carrier_name), position = "dodge", stat = "identity") +
ylab("Prop. Delayed")
AM West has larger total number of flights than Alaska. Because of this, proportion of delayed flight between 2 carriers serves as much better indication when it comes to comparing who is doing better or worse. From the result above, Alaska has larger prop_delayed. However, we see that avg_delayed and sd_delayed for Alaska are all lower than AM West. At the same time, avg_on_time and sd_on_time for Alaska are all lower than AM West. As you can see, since variables are conflicting each other (counter-intuitive), it is hard to tell which carrier is doing better in general. However, when we talk about “managing overall delayed flights”, we have to focus on prop_delayed mianly. Therefore, I would say AM West is doing better, at least in grouping by carrier_name.
cities <- group_by(sorted_2, city)
summary_city <- summarize(
cities,
sum_delayed = sum(delayed),
sum_on_time = sum(on_time),
total_flights = sum_delayed + sum_on_time,
avg_delayed = mean(delayed),
avg_on_time = mean(on_time),
sd_delayed = sd(delayed),
sd_on_time = sd(on_time),
prop_delayed = round(sum_delayed / total_flights,2)
)
data.frame(summary_city)
## city sum_delayed sum_on_time total_flights avg_delayed
## 1 Los_Angeles 179 1191 1370 89.5
## 2 Phoenix 427 5061 5488 213.5
## 3 San_Diego 85 595 680 42.5
## 4 San_Francisco 231 823 1054 115.5
## 5 Seattle 366 2042 2408 183.0
## avg_on_time sd_delayed sd_on_time prop_delayed
## 1 595.5 38.89087 139.3000 0.13
## 2 2530.5 284.96403 3266.1262 0.08
## 3 297.5 31.81981 120.9153 0.12
## 4 411.5 19.09188 129.4005 0.22
## 5 1021.0 172.53405 1159.6551 0.15
ggplot(summary_city, aes(city, prop_delayed)) +
geom_bar(aes(fill = city), position = "dodge", stat = "identity") +
ylab("Prop. Delayed")
From summary_city, we know that Pheonix has the lowest prop_delayed where as San Francisco has the highest prop_delayed. We should particularly focus on Seattle and Pheonix as sd_delayed and sd_on_time are unusually large.
Let’s now see how prop_delayed, grouping by carrier_name, will change if we get rid of Seattle and Pheonix (outliers)
filt_city <- filter(sorted_2, city != 'Seattle' & city != 'Phoenix' )
filt_city
## carrier_name city delayed on_time
## 1 ALASKA Los_Angeles 62 497
## 2 ALASKA San_Diego 20 212
## 3 ALASKA San_Francisco 102 503
## 4 AM WEST Los_Angeles 117 694
## 5 AM WEST San_Diego 65 383
## 6 AM WEST San_Francisco 129 320
carrier <- group_by(filt_city, carrier_name)
summary_carriers_filt <- summarize(
carrier,
sum_delayed = sum(delayed),
sum_on_time = sum(on_time),
total_flights = sum_delayed + sum_on_time,
avg_delayed = mean(delayed),
avg_on_time = mean(on_time),
sd_delayed = sd(delayed),
sd_on_time = sd(on_time),
prop_delayed = round(sum_delayed / total_flights,2)
)
data.frame(summary_carriers_filt)
## carrier_name sum_delayed sum_on_time total_flights avg_delayed
## 1 ALASKA 184 1212 1396 61.33333
## 2 AM WEST 311 1397 1708 103.66667
## avg_on_time sd_delayed sd_on_time prop_delayed
## 1 404.0000 41.00406 166.3039 0.13
## 2 465.6667 34.01960 200.2357 0.18
Indeed, Alaska has lower prop_delayed once we get rid of 2 outliers. Before we jump the gun, We need to ask, is it really ok to exclude these 2 cities? Well, we need to be careful. In my opinon, excluding 2 cities, just because they look like outliers, is not a very logical approach. In business logic, it could be true that some cities prefer specific carriers to other carriers and it is a general norm. Thus, we should not rule out 2 cities when we are comparing OVERALL prop_delayed in 2 carriers. Therefore, my conclusion of analysis will be based on all 5 cities, not 3 cities.
sorted_3 <- mutate(sorted_2, total_flights = delayed + on_time, prop_delayed = round (delayed / (delayed + on_time),2))
sorted_3
## carrier_name city delayed on_time total_flights prop_delayed
## 1 ALASKA Los_Angeles 62 497 559 0.11
## 2 ALASKA Phoenix 12 221 233 0.05
## 3 ALASKA San_Diego 20 212 232 0.09
## 4 ALASKA San_Francisco 102 503 605 0.17
## 5 ALASKA Seattle 305 1841 2146 0.14
## 6 AM WEST Los_Angeles 117 694 811 0.14
## 7 AM WEST Phoenix 415 4840 5255 0.08
## 8 AM WEST San_Diego 65 383 448 0.15
## 9 AM WEST San_Francisco 129 320 449 0.29
## 10 AM WEST Seattle 61 201 262 0.23
summary(sorted_3)
## carrier_name city delayed on_time
## Length:10 Length:10 Min. : 12.00 Min. : 201.0
## Class :character Class :character 1st Qu.: 61.25 1st Qu.: 245.8
## Mode :character Mode :character Median : 83.50 Median : 440.0
## Mean :128.80 Mean : 971.2
## 3rd Qu.:126.00 3rd Qu.: 646.2
## Max. :415.00 Max. :4840.0
## total_flights prop_delayed
## Min. : 232.0 Min. :0.050
## 1st Qu.: 308.5 1st Qu.:0.095
## Median : 504.0 Median :0.140
## Mean :1100.0 Mean :0.145
## 3rd Qu.: 759.5 3rd Qu.:0.165
## Max. :5255.0 Max. :0.290
ggplot(sorted_3, aes(city, prop_delayed)) +
geom_bar(aes(fill = carrier_name), position = "dodge", stat = "identity") +
ylab("Prop. Delayed")
Using sorted_3, we can see that total_flights in Seattle and Pheonix exceed 1100 (mean of total flights) and definately outliers. Note that 2146 (Seattle) total flights belong to Alaska and 5255 (Pheonix) total fights belong to AM West alone. This shows why sd_delayed and sd_on_time were unsually large for these 2 cities. Not only that, in each every city, Alaska’s prop_delayed is lower than AM West’s. Given that Pheonix had the lowest prop_delayed among 5 cities and Seattle had 2nd largest prop_delayed in general, from summary_city, it is no surprise that prop_delayed is lower in AM west as Pheonix is the largest total_flights for AM West where as Seattle, the 2nd largest prop_delayed city, is the largest portion of total flights for Alaska.
That is hard to tell since each carrier has comparative advantage from one another.
When it comes to grouping by carrier_name, we can say that AM West is doing better in “managing OVERALL delayed flights” and Alaska is doing better in “managing delayed flights ON AVERAGE”
data.frame(summary_carriers)
## carrier_name sum_delayed sum_on_time total_flights avg_delayed
## 1 ALASKA 501 3274 3775 100.2
## 2 AM WEST 787 6438 7225 157.4
## avg_on_time sd_delayed sd_on_time prop_delayed
## 1 654.8 120.0175 678.0982 0.13
## 2 1287.6 147.1625 1994.1889 0.11
When it comes to grouping by carrier_name and city, we can say that Alaska is doing better in “managing delayed flights by CITY”
data.frame(sorted_3)
## carrier_name city delayed on_time total_flights prop_delayed
## 1 ALASKA Los_Angeles 62 497 559 0.11
## 2 ALASKA Phoenix 12 221 233 0.05
## 3 ALASKA San_Diego 20 212 232 0.09
## 4 ALASKA San_Francisco 102 503 605 0.17
## 5 ALASKA Seattle 305 1841 2146 0.14
## 6 AM WEST Los_Angeles 117 694 811 0.14
## 7 AM WEST Phoenix 415 4840 5255 0.08
## 8 AM WEST San_Diego 65 383 448 0.15
## 9 AM WEST San_Francisco 129 320 449 0.29
## 10 AM WEST Seattle 61 201 262 0.23
If we really do want to understand which carriers are doing better in “managing delayed flights” in general, we have to consider other confounding variables such as weather, skills of pilots and airport road conditions and etc. From summary_city, we can see that some cities happened to have higher or lower prop_delayed for unknown reason (Pheonix and San Francisco). It seems that number of flights in these cities are affected by confounding variables. Not only that, we know that each carrier tends to be “skwed” to particular city, Alaska to Seattle and AM West to Phoenix, overall prop_delayed variable could be a bit biased.