Catie Peranzi Assignment 2

January 2016 Domestic Flights
October 9, 2016
library(dplyr)
library(ggvis)
library(rmarkdown)
library(knitr)
Importing Datasets
First I imported the dataset, domestic_flights_jan_2016.csv, and saved it as flights. I displayed the first few rows and columns of the dataset.
flights <- read.csv("domestic_flights_jan_2016.csv", stringsAsFactors = FALSE)

kable(head(flights [,1:5]))
FlightDate Carrier TailNum FlightNum Origin
1/6/2016 AA N4YBAA 43 DFW
1/7/2016 AA N434AA 43 DFW
1/8/2016 AA N541AA 43 DFW
1/9/2016 AA N489AA 43 DFW
1/10/2016 AA N439AA 43 DFW
1/11/2016 AA N468AA 43 DFW

This dataset contained information on domestic flights in January 2016.

Report Outline

I chose to use this data set to investigate:

  1. Which carrier has the highest percentage of cancelled flights and which carrier has the lowest percentage of cancelled flights?

  2. Is there an relationship between a carriers average departure delay and a carriers average arrival delay? If there is a relationship, is it significant?

  3. What are the top five carriers that flew the greatest total distance in January?

  4. Is a flight more likely to fly a greater distance if it is diverted?

Cancelled Flights by Carrier

q1 <- flights %>% group_by(Carrier) %>% select(Cancelled) %>% summarize(perc_cancelled = sum(Cancelled)/n())

#Plot the data
q1 %>% ggvis(~Carrier, ~perc_cancelled) %>% layer_bars() 

We can see that Carrier B6 has the highest percentage of flights cancelled and Carrier HA has the lowest percentage of flights cancelled.

Average Departure Delay Versus Average Arrival Delay

In order to find the average departure delay and average arrival delay, we must look at incomplete cases with “NA” values.

#Filter for Incomplete cases
flights %>% filter(!complete.cases(.)) %>% head(5)
##   FlightDate Carrier TailNum FlightNum Origin OriginCityName OriginState
## 1  1/16/2016      AA  N3CXAA        44    SEA    Seattle, WA          WA
## 2  1/23/2016      AA                44    SEA    Seattle, WA          WA
## 3  1/24/2016      AA  N3DGAA        44    SEA    Seattle, WA          WA
## 4  1/25/2016      AA  N3MNAA        44    SEA    Seattle, WA          WA
## 5  1/15/2016      AA  N3JSAA        45    JFK   New York, NY          NY
##   Dest DestCityName DestState CRSDepTime DepTime WheelsOff WheelsOn
## 1  JFK New York, NY        NY        640      NA        NA       NA
## 2  JFK New York, NY        NY        640      NA        NA       NA
## 3  JFK New York, NY        NY        645      NA        NA       NA
## 4  JFK New York, NY        NY        645      NA        NA       NA
## 5  SEA  Seattle, WA        WA       1830      NA        NA       NA
##   CRSArrTime ArrTime Cancelled Diverted CRSElapsedTime ActualElapsedTime
## 1       1501      NA         1        0            321                NA
## 2       1501      NA         1        0            321                NA
## 3       1506      NA         1        0            321                NA
## 4       1506      NA         1        0            321                NA
## 5       2152      NA         1        0            382                NA
##   Distance
## 1     2422
## 2     2422
## 3     2422
## 4     2422
## 5     2422

Looking at the incomplete cases allowed me to determine that “NA” values were due to cancelled or diverted flights, so I will filter for flights that are not cancelled and not diverted.

#Filter, mutate, and summarize
q2 <- flights %>% filter(Cancelled == 0, Diverted ==0) %>% group_by(Carrier) %>% mutate(DepDelay = CRSDepTime - DepTime, ArrDelay= CRSArrTime - ArrTime) %>% summarize(avg_depdel=mean(DepDelay), avg_arrdel=mean(ArrDelay))

#Plot the data
q2 %>% ggvis(~avg_depdel, ~avg_arrdel) %>% layer_points(fill=~Carrier) %>% layer_model_predictions(model="lm", se=TRUE) 

We can see from the graph that there appears to be a slight relationship, however we can also see that the 95% confidence interval is quite large.

Now, we want to run a regression model to determine if the average departure delay is a good indicator for the average arrival delay.

model.2 <- lm(avg_arrdel ~avg_depdel, data = q2) 
summary(model.2)
## 
## Call:
## lm(formula = avg_arrdel ~ avg_depdel, data = q2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.137  -5.812  -1.426   6.105  16.209 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.4564     2.9885   6.510 6.81e-05 ***
## avg_depdel    0.4249     0.3797   1.119    0.289    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.27 on 10 degrees of freedom
## Multiple R-squared:  0.1113, Adjusted R-squared:  0.02246 
## F-statistic: 1.253 on 1 and 10 DF,  p-value: 0.2892

Running the regression model allowed me to determine that the relationship between average arrival delay and average departure delay is not statistically significant, due to the p-value of 0.2892 which not less than 0.05. We cannot say that the average departure delay can determine the average arrival delay.

Top Five Carriers That Flew the Greatest Distance

flights %>% group_by(Carrier) %>% summarize(total_dist=sum(Distance)) %>% arrange(desc(total_dist)) %>% head(5)
## # A tibble: 5 × 2
##   Carrier total_dist
##     <chr>      <int>
## 1      WN   77253801
## 2      AA   75493609
## 3      DL   59447466
## 4      UA   49274973
## 5      B6   24507923

The top five carriers that flew the greatest distance in descending order are: WN, AA, DL, UA, B6.

Distance and Diversion

flights %>% group_by(Diverted) %>% summarize(avg_distance = mean(Distance))
## # A tibble: 2 × 2
##   Diverted avg_distance
##      <int>        <dbl>
## 1        0     844.0332
## 2        1     947.5833

The table shows us that planes that are not diverted travel and average distance of about 844 miles while planes that are diverted travel an average distance of about 948 miles.