## Warning: package 'ggplot2' was built under R version 3.2.4
Please indicate
Plot a “time series” of the proportion of flights that were delayed by > 30 minutes on each day. i.e.
Using this plot, indicate describe the seasonality of when delays over 30 minutes tend to occur.
Note: Before drawing any conclusions, I should investigate whether there is a pattern to when there are missing values. By removing them, I could be missing something.
It appears in this plot that outliers may be obscuring seasonal trends by adjusting the y scale and making it hard to see typical trends. Perhaps I could bin into months and then do some box plots to better show the trend that I’m trying to see. I don’t want to completely get rid of outliers, as they may say something (for example, there may be more delays on holidays or something).
This still may not be the ideal visualization, but I like it better. It is definitely worth noting all the outliers. Based on the median, which I think is a good variable to look at here because it isn’t skewed by outliers, there is an increase in the proportion of delayed flights during June and July, while there is a notable decrease in September, October and November.
Some people prefer flying on older planes. Even though they aren’t as nice, they tend to have more room. Which airlines should these people favor?
| carrier | mean_yr |
|---|---|
| MQ | 1981.579 |
| AA | 1986.675 |
| DL | 1990.240 |
| US | 1991.922 |
| UA | 1996.365 |
| WN | 1997.943 |
| XE | 2000.442 |
| CO | 2001.137 |
| FL | 2002.110 |
| F9 | 2003.878 |
| YV | 2003.886 |
| EV | 2004.586 |
| OO | 2004.983 |
| AS | 2005.554 |
| B6 | 2006.080 |
This shows which airline, on average has the oldest planes. This analysis, based on mean year the plane was made per airline, shows that MQ and AA had, on average, the oldest planes. However, this could easily be due to a few really old planes. In fact, when isolating MQ, we can see that only 57 of 4,648 entries have values for the plane’s year. It may be good to see the distribution for each carrier.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 11402 rows containing non-finite values (stat_bin).
But what would really be effective is if we could see what year planes started getting crammed and then look at proportion of planes from before that year …
GOOGLE KNOWS ALL!
USA today on airline seats gives some input into the evolution of seat pitch and seat width of commercial airplanes over recent years. While different airlines have different ranges in seat pitch change, seat width took a dive around 1995. Thus, I’ll look at proportion of planes made before ’95 on each airline.
flights <-
full_join(select(planes, plane, year), flights, by = "plane") %>%
group_by(carrier) %>%
mutate(p_old = mean(year <= 1995, na.rm=TRUE))
# p_old shows the proportion of planes made before 1995 on each airline
p_old <- flights %>%
group_by(carrier) %>%
summarise(proportion_old = format(mean(p_old, na.rm=TRUE), digits=4)) %>%
arrange(desc(proportion_old))
p_old %>% knitr::kable()
| carrier | proportion_old |
|---|---|
| MQ | 1 |
| AA | 0.9878 |
| US | 0.7919 |
| DL | 0.7386 |
| UA | 0.4093 |
| WN | 0.3898 |
| CO | 0.1029 |
| YV | 0.01266 |
| OO | 0.009401 |
| FL | 0.008172 |
| AS | 0 |
| B6 | 0 |
| EV | 0 |
| F9 | 0 |
| XE | 0 |
This analysis shows MQ, AA, US and DL as the carriers with the most planes made before 1995, and thus probably the most spacious economy seating. The second analysis, in my opinion, is more informative for the traveler who is interested in spacious seats. I would have to discount MQ as too many of the values are missing (discussed above). Based on the histograms I would say US, AA, DL and WN are pretty good bets.
For example, Southwest Airlines Flight 60 to Dallas consists of a single flight path, but since it flew 299 times in 2013, it would be counted as 299 flights.
| state | n |
|---|---|
| TX | 17230 |
| FL | 3992 |
| LA | 3362 |
| CA | 2792 |
| OK | 2350 |
| IL | 2094 |
| NV | 1543 |
| CO | 1438 |
| TN | 1377 |
| AZ | 1369 |
| MO | 1334 |
| MD | 1227 |
| NM | 1019 |
| MS | 1016 |
| NA | 729 |
| AL | 691 |
| SC | 588 |
| PA | 437 |
| NJ | 390 |
| AR | 365 |
| state | count |
|---|---|
| TX | 770 |
| FL | 297 |
| LA | 236 |
| CA | 219 |
| OK | 178 |
| IL | 163 |
| NV | 123 |
| CO | 121 |
| TN | 107 |
| AZ | 106 |
| MO | 101 |
| NM | 80 |
| MD | 79 |
| MS | 78 |
| AL | 52 |
| NA | 52 |
| SC | 43 |
| PA | 35 |
| AR | 28 |
| NJ | 26 |
I want to know proportionately what regions (NE, south, west, midwest) each carrier flies to/from Houston in the month of July. Consider the month() function from the lubridate package.
| carrier | region | n | carrier_sum | proportion |
|---|---|---|---|---|
| AA | south | 273 | 273 | 1 |
| AS | west | 31 | 31 | 1 |
| B6 | NE | 62 | 62 | 1 |
| CO | midwest | 713 | 6190 | 0.1152 |
| CO | NE | 1060 | 6190 | 0.1712 |
| CO | south | 2073 | 6190 | 0.3349 |
| CO | west | 2246 | 6190 | 0.3628 |
| CO | NA | 98 | 6190 | 0.0158 |
| DL | midwest | 47 | 226 | 0.208 |
| DL | south | 179 | 226 | 0.792 |
| EV | midwest | 95 | 164 | 0.579 |
| EV | south | 69 | 164 | 0.421 |
| F9 | west | 88 | 88 | 1 |
| FL | south | 195 | 213 | 0.9155 |
| FL | NA | 18 | 213 | 0.0845 |
| MQ | midwest | 117 | 410 | 0.285 |
| MQ | south | 200 | 410 | 0.488 |
| MQ | west | 93 | 410 | 0.227 |
| OO | midwest | 349 | 1586 | 0.2201 |
| OO | NE | 91 | 1586 | 0.0574 |
| OO | south | 663 | 1586 | 0.4180 |
| OO | west | 483 | 1586 | 0.3045 |
| UA | midwest | 53 | 247 | 0.215 |
| UA | south | 47 | 247 | 0.190 |
| UA | west | 147 | 247 | 0.595 |
| US | south | 181 | 319 | 0.567 |
| US | west | 138 | 319 | 0.433 |
| WN | midwest | 296 | 3956 | 0.0748 |
| WN | NE | 227 | 3956 | 0.0574 |
| WN | south | 2650 | 3956 | 0.6699 |
| WN | west | 721 | 3956 | 0.1823 |
| WN | NA | 62 | 3956 | 0.0157 |
| XE | midwest | 1251 | 6778 | 0.184568 |
| XE | NE | 1 | 6778 | 0.000148 |
| XE | south | 5262 | 6778 | 0.776335 |
| XE | west | 264 | 6778 | 0.038950 |
| YV | south | 5 | 5 | 1 |
## Source: local data frame [178 x 6]
## Groups: carrier [3]
##
## date flight dest state region carrier
## (date) (int) (chr) (chr) (chr) (chr)
## 1 2011-07-23 1 HNL HI NA CO
## 2 2011-07-17 1 HNL HI NA CO
## 3 2011-07-11 1 HNL HI NA CO
## 4 2011-07-07 1 HNL HI NA CO
## 5 2011-07-29 1755 ANC AK NA CO
## 6 2011-07-26 1755 ANC AK NA CO
## 7 2011-07-06 1755 ANC AK NA CO
## 8 2011-07-04 1755 ANC AK NA CO
## 9 2011-07-26 1 HNL HI NA CO
## 10 2011-07-19 1 HNL HI NA CO
## .. ... ... ... ... ... ...
## Warning: Removed 3 rows containing missing values (position_stack).
Note: asking for “flies to/from Houston” but we only have data for destination. Note: only 3 “states”" are filed as NA: HI, AK, PR. In fact, then, the NA category is worth keeping, but thinking of as a new category (territories).