## Warning: package 'ggplot2' was built under R version 3.2.4

Admistrative:

Please indicate

Question 1:

Plot a “time series” of the proportion of flights that were delayed by > 30 minutes on each day. i.e.

Using this plot, indicate describe the seasonality of when delays over 30 minutes tend to occur.

Note: Before drawing any conclusions, I should investigate whether there is a pattern to when there are missing values. By removing them, I could be missing something.

It appears in this plot that outliers may be obscuring seasonal trends by adjusting the y scale and making it hard to see typical trends. Perhaps I could bin into months and then do some box plots to better show the trend that I’m trying to see. I don’t want to completely get rid of outliers, as they may say something (for example, there may be more delays on holidays or something).

This still may not be the ideal visualization, but I like it better. It is definitely worth noting all the outliers. Based on the median, which I think is a good variable to look at here because it isn’t skewed by outliers, there is an increase in the proportion of delayed flights during June and July, while there is a notable decrease in September, October and November.

Question 2:

Some people prefer flying on older planes. Even though they aren’t as nice, they tend to have more room. Which airlines should these people favor?

carrier mean_yr
MQ 1981.579
AA 1986.675
DL 1990.240
US 1991.922
UA 1996.365
WN 1997.943
XE 2000.442
CO 2001.137
FL 2002.110
F9 2003.878
YV 2003.886
EV 2004.586
OO 2004.983
AS 2005.554
B6 2006.080

This shows which airline, on average has the oldest planes. This analysis, based on mean year the plane was made per airline, shows that MQ and AA had, on average, the oldest planes. However, this could easily be due to a few really old planes. In fact, when isolating MQ, we can see that only 57 of 4,648 entries have values for the plane’s year. It may be good to see the distribution for each carrier.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 11402 rows containing non-finite values (stat_bin).

But what would really be effective is if we could see what year planes started getting crammed and then look at proportion of planes from before that year …

GOOGLE KNOWS ALL!

USA today on airline seats gives some input into the evolution of seat pitch and seat width of commercial airplanes over recent years. While different airlines have different ranges in seat pitch change, seat width took a dive around 1995. Thus, I’ll look at proportion of planes made before ’95 on each airline.

flights <- 
  full_join(select(planes, plane, year), flights, by = "plane") %>%
  group_by(carrier) %>%
  mutate(p_old = mean(year <= 1995, na.rm=TRUE))
      # p_old shows the proportion of planes made before 1995 on each airline

p_old <- flights %>%
  group_by(carrier) %>%
  summarise(proportion_old = format(mean(p_old, na.rm=TRUE), digits=4)) %>%
  arrange(desc(proportion_old))
p_old %>% knitr::kable()
carrier proportion_old
MQ 1
AA 0.9878
US 0.7919
DL 0.7386
UA 0.4093
WN 0.3898
CO 0.1029
YV 0.01266
OO 0.009401
FL 0.008172
AS 0
B6 0
EV 0
F9 0
XE 0

This analysis shows MQ, AA, US and DL as the carriers with the most planes made before 1995, and thus probably the most spacious economy seating. The second analysis, in my opinion, is more informative for the traveler who is interested in spacious seats. I would have to discount MQ as too many of the values are missing (discussed above). Based on the histograms I would say US, AA, DL and WN are pretty good bets.

Question 3:

For example, Southwest Airlines Flight 60 to Dallas consists of a single flight path, but since it flew 299 times in 2013, it would be counted as 299 flights.

Carrier codes found here

state n
TX 17230
FL 3992
LA 3362
CA 2792
OK 2350
IL 2094
NV 1543
CO 1438
TN 1377
AZ 1369
MO 1334
MD 1227
NM 1019
MS 1016
NA 729
AL 691
SC 588
PA 437
NJ 390
AR 365
state count
TX 770
FL 297
LA 236
CA 219
OK 178
IL 163
NV 123
CO 121
TN 107
AZ 106
MO 101
NM 80
MD 79
MS 78
AL 52
NA 52
SC 43
PA 35
AR 28
NJ 26

Question 4:

I want to know proportionately what regions (NE, south, west, midwest) each carrier flies to/from Houston in the month of July. Consider the month() function from the lubridate package.

carrier region n carrier_sum proportion
AA south 273 273 1
AS west 31 31 1
B6 NE 62 62 1
CO midwest 713 6190 0.1152
CO NE 1060 6190 0.1712
CO south 2073 6190 0.3349
CO west 2246 6190 0.3628
CO NA 98 6190 0.0158
DL midwest 47 226 0.208
DL south 179 226 0.792
EV midwest 95 164 0.579
EV south 69 164 0.421
F9 west 88 88 1
FL south 195 213 0.9155
FL NA 18 213 0.0845
MQ midwest 117 410 0.285
MQ south 200 410 0.488
MQ west 93 410 0.227
OO midwest 349 1586 0.2201
OO NE 91 1586 0.0574
OO south 663 1586 0.4180
OO west 483 1586 0.3045
UA midwest 53 247 0.215
UA south 47 247 0.190
UA west 147 247 0.595
US south 181 319 0.567
US west 138 319 0.433
WN midwest 296 3956 0.0748
WN NE 227 3956 0.0574
WN south 2650 3956 0.6699
WN west 721 3956 0.1823
WN NA 62 3956 0.0157
XE midwest 1251 6778 0.184568
XE NE 1 6778 0.000148
XE south 5262 6778 0.776335
XE west 264 6778 0.038950
YV south 5 5 1
## Source: local data frame [178 x 6]
## Groups: carrier [3]
## 
##          date flight  dest state region carrier
##        (date)  (int) (chr) (chr)  (chr)   (chr)
## 1  2011-07-23      1   HNL    HI     NA      CO
## 2  2011-07-17      1   HNL    HI     NA      CO
## 3  2011-07-11      1   HNL    HI     NA      CO
## 4  2011-07-07      1   HNL    HI     NA      CO
## 5  2011-07-29   1755   ANC    AK     NA      CO
## 6  2011-07-26   1755   ANC    AK     NA      CO
## 7  2011-07-06   1755   ANC    AK     NA      CO
## 8  2011-07-04   1755   ANC    AK     NA      CO
## 9  2011-07-26      1   HNL    HI     NA      CO
## 10 2011-07-19      1   HNL    HI     NA      CO
## ..        ...    ...   ...   ...    ...     ...
## Warning: Removed 3 rows containing missing values (position_stack).

Note: asking for “flies to/from Houston” but we only have data for destination. Note: only 3 “states”" are filed as NA: HI, AK, PR. In fact, then, the NA category is worth keeping, but thinking of as a new category (territories).