library(dplyr)
library(ggvis)
library(rmarkdown)
library(knitr)
flights <- read.csv("domestic_flights_jan_2016.csv", stringsAsFactors = FALSE)
kable(head(flights [,1:5]))
| FlightDate | Carrier | TailNum | FlightNum | Origin |
|---|---|---|---|---|
| 1/6/2016 | AA | N4YBAA | 43 | DFW |
| 1/7/2016 | AA | N434AA | 43 | DFW |
| 1/8/2016 | AA | N541AA | 43 | DFW |
| 1/9/2016 | AA | N489AA | 43 | DFW |
| 1/10/2016 | AA | N439AA | 43 | DFW |
| 1/11/2016 | AA | N468AA | 43 | DFW |
This dataset contained information on domestic flights in January 2016.
I chose to use this data set to investigate:
Which carrier has the highest percentage of cancelled flights and which carrier has the lowest percentage of cancelled flights?
Is there an relationship between a carriers average departure delay and a carriers average arrival delay? If there is a relationship, is it significant?
What are the top five carriers that flew the greatest total distance in January?
Is a flight more likely to fly a greater distance if it is diverted?
q1 <- flights %>% group_by(Carrier) %>% select(Cancelled) %>% summarize(perc_cancelled = sum(Cancelled)/n())
#Plot the data
q1 %>% ggvis(~Carrier, ~perc_cancelled) %>% layer_bars()
In order to find the average departure delay and average arrival delay, we must look at incomplete cases with “NA” values.
#Filter for Incomplete cases
flights %>% filter(!complete.cases(.)) %>% head(5)
## FlightDate Carrier TailNum FlightNum Origin OriginCityName OriginState
## 1 1/16/2016 AA N3CXAA 44 SEA Seattle, WA WA
## 2 1/23/2016 AA 44 SEA Seattle, WA WA
## 3 1/24/2016 AA N3DGAA 44 SEA Seattle, WA WA
## 4 1/25/2016 AA N3MNAA 44 SEA Seattle, WA WA
## 5 1/15/2016 AA N3JSAA 45 JFK New York, NY NY
## Dest DestCityName DestState CRSDepTime DepTime WheelsOff WheelsOn
## 1 JFK New York, NY NY 640 NA NA NA
## 2 JFK New York, NY NY 640 NA NA NA
## 3 JFK New York, NY NY 645 NA NA NA
## 4 JFK New York, NY NY 645 NA NA NA
## 5 SEA Seattle, WA WA 1830 NA NA NA
## CRSArrTime ArrTime Cancelled Diverted CRSElapsedTime ActualElapsedTime
## 1 1501 NA 1 0 321 NA
## 2 1501 NA 1 0 321 NA
## 3 1506 NA 1 0 321 NA
## 4 1506 NA 1 0 321 NA
## 5 2152 NA 1 0 382 NA
## Distance
## 1 2422
## 2 2422
## 3 2422
## 4 2422
## 5 2422
Looking at the incomplete cases allowed me to determine that “NA” values were due to cancelled or diverted flights, so I will filter for flights that are not cancelled and not diverted.
#Filter, mutate, and summarize
q2 <- flights %>% filter(Cancelled == 0, Diverted ==0) %>% group_by(Carrier) %>% mutate(DepDelay = CRSDepTime - DepTime, ArrDelay= CRSArrTime - ArrTime) %>% summarize(avg_depdel=mean(DepDelay), avg_arrdel=mean(ArrDelay))
#Plot the data
q2 %>% ggvis(~avg_depdel, ~avg_arrdel) %>% layer_points(fill=~Carrier) %>% layer_model_predictions(model="lm", se=TRUE)
We can see from the graph that there appears to be a slight relationship, however we can also see that the 95% confidence interval is quite large.
Now, we want to run a regression model to determine if the average departure delay is a good indicator for the average arrival delay.
model.2 <- lm(avg_arrdel ~avg_depdel, data = q2)
summary(model.2)
##
## Call:
## lm(formula = avg_arrdel ~ avg_depdel, data = q2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.137 -5.812 -1.426 6.105 16.209
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.4564 2.9885 6.510 6.81e-05 ***
## avg_depdel 0.4249 0.3797 1.119 0.289
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.27 on 10 degrees of freedom
## Multiple R-squared: 0.1113, Adjusted R-squared: 0.02246
## F-statistic: 1.253 on 1 and 10 DF, p-value: 0.2892
Running the regression model allowed me to determine that the relationship between average arrival delay and average departure delay is not statistically significant, due to the p-value of 0.2892 which not less than 0.05. We cannot say that the average departure delay can determine the average arrival delay.
flights %>% group_by(Carrier) %>% summarize(total_dist=sum(Distance)) %>% arrange(desc(total_dist)) %>% head(5)
## # A tibble: 5 × 2
## Carrier total_dist
## <chr> <int>
## 1 WN 77253801
## 2 AA 75493609
## 3 DL 59447466
## 4 UA 49274973
## 5 B6 24507923
The top five carriers that flew the greatest distance in descending order are: WN, AA, DL, UA, B6.
flights %>% group_by(Diverted) %>% summarize(avg_distance = mean(Distance))
## # A tibble: 2 × 2
## Diverted avg_distance
## <int> <dbl>
## 1 0 844.0332
## 2 1 947.5833
The table shows us that planes that are not diverted travel and average distance of about 844 miles while planes that are diverted travel an average distance of about 948 miles.