I retrieved data from MySQL.

For security purpose, I blanked password, userID and host. Actual codes for MySQL connection are written like:

install.packages(“RMySQL”)

library(RMySQL)

mydb = dbConnect(MySQL(), user=‘’, password=’‘, dbname=’airlines’, host=’’)

You may refer to RMD file in my github to see the details.

dbListTables(mydb)

## [1] "carrier_airlines"     "carrier_flight_count" "carrier_status"

flight <- dbGetQuery(mydb, 'select c.carrier_name, b.status_name, a.Los_Angeles, a.Phoenix, a.San_Diego, a.San_Francisco, a.Seattle
from carrier_flight_count a
join carrier_status b on a.status_id = b.status_id
join carrier_airlines c on a.carrier_id = c.carrier_id')

sorted <- gather(flight, "city", "count", 3:7)
sorted

##    carrier_name status_name          city count
## 1        ALASKA     on_time   Los_Angeles   497
## 2        ALASKA     delayed   Los_Angeles    62
## 3       AM WEST     on_time   Los_Angeles   694
## 4       AM WEST     delayed   Los_Angeles   117
## 5        ALASKA     on_time       Phoenix   221
## 6        ALASKA     delayed       Phoenix    12
## 7       AM WEST     on_time       Phoenix  4840
## 8       AM WEST     delayed       Phoenix   415
## 9        ALASKA     on_time     San_Diego   212
## 10       ALASKA     delayed     San_Diego    20
## 11      AM WEST     on_time     San_Diego   383
## 12      AM WEST     delayed     San_Diego    65
## 13       ALASKA     on_time San_Francisco   503
## 14       ALASKA     delayed San_Francisco   102
## 15      AM WEST     on_time San_Francisco   320
## 16      AM WEST     delayed San_Francisco   129
## 17       ALASKA     on_time       Seattle  1841
## 18       ALASKA     delayed       Seattle   305
## 19      AM WEST     on_time       Seattle   201
## 20      AM WEST     delayed       Seattle    61

sorted_2 <- spread(sorted, status_name, count)
sorted_2

##    carrier_name          city delayed on_time
## 1        ALASKA   Los_Angeles      62     497
## 2        ALASKA       Phoenix      12     221
## 3        ALASKA     San_Diego      20     212
## 4        ALASKA San_Francisco     102     503
## 5        ALASKA       Seattle     305    1841
## 6       AM WEST   Los_Angeles     117     694
## 7       AM WEST       Phoenix     415    4840
## 8       AM WEST     San_Diego      65     383
## 9       AM WEST San_Francisco     129     320
## 10      AM WEST       Seattle      61     201

str(sorted_2)

## 'data.frame':    10 obs. of  4 variables:
##  $ carrier_name: chr  "ALASKA" "ALASKA" "ALASKA" "ALASKA" ...
##  $ city        : chr  "Los_Angeles" "Phoenix" "San_Diego" "San_Francisco" ...
##  $ delayed     : int  62 12 20 102 305 117 415 65 129 61
##  $ on_time     : int  497 221 212 503 1841 694 4840 383 320 201

Road map of analysis

Q1. Get total delay by carriers and compare. Get proportion of delay by carriers and compare.

Q2. Get total delay by cities and compare. Get proportion of delay by cities and compare.

Q3. Get total delay by carriers broken down by cities and compare. Get proportion of delay by carriers broken down by cities and compare.

Conclusion. Summarize the findings.

Techniques used for analysis

I will use summarize, group_by, ggplot, mutate and etc to transform the data.

Q1. summary by carrier

carriers <- group_by(sorted_2, carrier_name)

summary_carriers <- summarize(
  carriers, 
  sum_delayed = sum(delayed),
  sum_on_time = sum(on_time), 
  total_flights = sum_delayed + sum_on_time,
  avg_delayed = mean(delayed),
  avg_on_time = mean(on_time),
  sd_delayed = sd(delayed),
  sd_on_time = sd(on_time),
  prop_delayed = round(sum_delayed / total_flights,2)
)

## Warning: package 'bindrcpp' was built under R version 3.4.1

data.frame(summary_carriers)

##   carrier_name sum_delayed sum_on_time total_flights avg_delayed
## 1       ALASKA         501        3274          3775       100.2
## 2      AM WEST         787        6438          7225       157.4
##   avg_on_time sd_delayed sd_on_time prop_delayed
## 1       654.8   120.0175   678.0982         0.13
## 2      1287.6   147.1625  1994.1889         0.11

ggplot(summary_carriers, aes(carrier_name, prop_delayed)) + 
  geom_bar(aes(fill = carrier_name), position = "dodge", stat = "identity") + 
  ylab("Prop. Delayed")

AM West has larger total number of flights than Alaska. Because of this, proportion of delayed flight between 2 carriers serves as much better indication when it comes to comparing who is doing better or worse. From the result above, Alaska has larger prop_delayed. However, we see that avg_delayed and sd_delayed for Alaska are all lower than AM West. At the same time, avg_on_time and sd_on_time for Alaska are all lower than AM West. As you can see, since variables are conflicting each other (counter-intuitive), it is hard to tell which carrier is doing better in general. However, when we talk about “managing overall delayed flights”, we have to focus on prop_delayed mianly. Therefore, I would say AM West is doing better, at least in grouping by carrier_name.

Q2. summary by city

cities <- group_by(sorted_2, city)

summary_city <- summarize(
  cities, 
  sum_delayed = sum(delayed),
  sum_on_time = sum(on_time), 
  total_flights = sum_delayed + sum_on_time,
  avg_delayed = mean(delayed),
  avg_on_time = mean(on_time),
  sd_delayed = sd(delayed),
  sd_on_time = sd(on_time),
  prop_delayed = round(sum_delayed / total_flights,2)
)

data.frame(summary_city)

##            city sum_delayed sum_on_time total_flights avg_delayed
## 1   Los_Angeles         179        1191          1370        89.5
## 2       Phoenix         427        5061          5488       213.5
## 3     San_Diego          85         595           680        42.5
## 4 San_Francisco         231         823          1054       115.5
## 5       Seattle         366        2042          2408       183.0
##   avg_on_time sd_delayed sd_on_time prop_delayed
## 1       595.5   38.89087   139.3000         0.13
## 2      2530.5  284.96403  3266.1262         0.08
## 3       297.5   31.81981   120.9153         0.12
## 4       411.5   19.09188   129.4005         0.22
## 5      1021.0  172.53405  1159.6551         0.15

ggplot(summary_city, aes(city, prop_delayed)) + 
  geom_bar(aes(fill = city), position = "dodge", stat = "identity") + 
  ylab("Prop. Delayed")

From summary_city, we know that Pheonix has the lowest prop_delayed where as San Francisco has the highest prop_delayed. We should particularly focus on Seattle and Pheonix as sd_delayed and sd_on_time are unusually large.

Let’s now see how prop_delayed, grouping by carrier_name, will change if we get rid of Seattle and Pheonix (outliers)

filt_city <- filter(sorted_2, city != 'Seattle' & city != 'Phoenix' )
filt_city

##   carrier_name          city delayed on_time
## 1       ALASKA   Los_Angeles      62     497
## 2       ALASKA     San_Diego      20     212
## 3       ALASKA San_Francisco     102     503
## 4      AM WEST   Los_Angeles     117     694
## 5      AM WEST     San_Diego      65     383
## 6      AM WEST San_Francisco     129     320

carrier <- group_by(filt_city, carrier_name)

summary_carriers_filt <- summarize(
  carrier, 
  sum_delayed = sum(delayed),
  sum_on_time = sum(on_time), 
  total_flights = sum_delayed + sum_on_time,
  avg_delayed = mean(delayed),
  avg_on_time = mean(on_time),
  sd_delayed = sd(delayed),
  sd_on_time = sd(on_time),
  prop_delayed = round(sum_delayed / total_flights,2)
)


data.frame(summary_carriers_filt)

##   carrier_name sum_delayed sum_on_time total_flights avg_delayed
## 1       ALASKA         184        1212          1396    61.33333
## 2      AM WEST         311        1397          1708   103.66667
##   avg_on_time sd_delayed sd_on_time prop_delayed
## 1    404.0000   41.00406   166.3039         0.13
## 2    465.6667   34.01960   200.2357         0.18

Indeed, Alaska has lower prop_delayed once we get rid of 2 outliers. Before we jump the gun, We need to ask, is it really ok to exclude these 2 cities? Well, we need to be careful. In my opinon, excluding 2 cities, just because they look like outliers, is not a very logical approach. In business logic, it could be true that some cities prefer specific carriers to other carriers and it is a general norm. Thus, we should not rule out 2 cities when we are comparing OVERALL prop_delayed in 2 carriers. Therefore, my conclusion of analysis will be based on all 5 cities, not 3 cities.

Q3. summary by carrier and city

sorted_3 <- mutate(sorted_2, total_flights = delayed + on_time, prop_delayed = round (delayed / (delayed + on_time),2))
sorted_3

##    carrier_name          city delayed on_time total_flights prop_delayed
## 1        ALASKA   Los_Angeles      62     497           559         0.11
## 2        ALASKA       Phoenix      12     221           233         0.05
## 3        ALASKA     San_Diego      20     212           232         0.09
## 4        ALASKA San_Francisco     102     503           605         0.17
## 5        ALASKA       Seattle     305    1841          2146         0.14
## 6       AM WEST   Los_Angeles     117     694           811         0.14
## 7       AM WEST       Phoenix     415    4840          5255         0.08
## 8       AM WEST     San_Diego      65     383           448         0.15
## 9       AM WEST San_Francisco     129     320           449         0.29
## 10      AM WEST       Seattle      61     201           262         0.23

summary(sorted_3)

##  carrier_name           city              delayed          on_time      
##  Length:10          Length:10          Min.   : 12.00   Min.   : 201.0  
##  Class :character   Class :character   1st Qu.: 61.25   1st Qu.: 245.8  
##  Mode  :character   Mode  :character   Median : 83.50   Median : 440.0  
##                                        Mean   :128.80   Mean   : 971.2  
##                                        3rd Qu.:126.00   3rd Qu.: 646.2  
##                                        Max.   :415.00   Max.   :4840.0  
##  total_flights     prop_delayed  
##  Min.   : 232.0   Min.   :0.050  
##  1st Qu.: 308.5   1st Qu.:0.095  
##  Median : 504.0   Median :0.140  
##  Mean   :1100.0   Mean   :0.145  
##  3rd Qu.: 759.5   3rd Qu.:0.165  
##  Max.   :5255.0   Max.   :0.290

ggplot(sorted_3, aes(city, prop_delayed)) + 
  geom_bar(aes(fill = carrier_name), position = "dodge", stat = "identity") + 
  ylab("Prop. Delayed")

Using sorted_3, we can see that total_flights in Seattle and Pheonix exceed 1100 (mean of total flights) and definately outliers. Note that 2146 (Seattle) total flights belong to Alaska and 5255 (Pheonix) total fights belong to AM West alone. This shows why sd_delayed and sd_on_time were unsually large for these 2 cities. Not only that, in each every city, Alaska’s prop_delayed is lower than AM West’s. Given that Pheonix had the lowest prop_delayed among 5 cities and Seattle had 2nd largest prop_delayed in general, from summary_city, it is no surprise that prop_delayed is lower in AM west as Pheonix is the largest total_flights for AM West where as Seattle, the 2nd largest prop_delayed city, is the largest portion of total flights for Alaska.

conclusion: Which carrier is doing better in “managing delayed flights” in general?

That is hard to tell since each carrier has comparative advantage from one another.

When it comes to grouping by carrier_name, we can say that AM West is doing better in “managing OVERALL delayed flights” and Alaska is doing better in “managing delayed flights ON AVERAGE”

data.frame(summary_carriers)

##   carrier_name sum_delayed sum_on_time total_flights avg_delayed
## 1       ALASKA         501        3274          3775       100.2
## 2      AM WEST         787        6438          7225       157.4
##   avg_on_time sd_delayed sd_on_time prop_delayed
## 1       654.8   120.0175   678.0982         0.13
## 2      1287.6   147.1625  1994.1889         0.11

When it comes to grouping by carrier_name and city, we can say that Alaska is doing better in “managing delayed flights by CITY”

data.frame(sorted_3)

##    carrier_name          city delayed on_time total_flights prop_delayed
## 1        ALASKA   Los_Angeles      62     497           559         0.11
## 2        ALASKA       Phoenix      12     221           233         0.05
## 3        ALASKA     San_Diego      20     212           232         0.09
## 4        ALASKA San_Francisco     102     503           605         0.17
## 5        ALASKA       Seattle     305    1841          2146         0.14
## 6       AM WEST   Los_Angeles     117     694           811         0.14
## 7       AM WEST       Phoenix     415    4840          5255         0.08
## 8       AM WEST     San_Diego      65     383           448         0.15
## 9       AM WEST San_Francisco     129     320           449         0.29
## 10      AM WEST       Seattle      61     201           262         0.23

Things that need to be adjusted

If we really do want to understand which carriers are doing better in “managing delayed flights” in general, we have to consider other confounding variables such as weather, skills of pilots and airport road conditions and etc. From summary_city, we can see that some cities happened to have higher or lower prop_delayed for unknown reason (Pheonix and San Francisco). It seems that number of flights in these cities are affected by confounding variables. Not only that, we know that each carrier tends to be “skwed” to particular city, Alaska to Seattle and AM West to Phoenix, overall prop_delayed variable could be a bit biased.

Week5.assignment.DATA607.

Sang Yoon (Andy) Hwang