LaGuardia Airport (LGA) in New York City is a common destination for my family coming from New Mexico but they often face delays while connecting en route to LGA. I chose to look at all flights arriving in LGA to begin my analysis.
To start I will read in my csv file.
flights <- read.csv("domestic_flights_jan_2016.csv", header = TRUE, stringsAsFactors = FALSE)
And load in the additional packages I will be using to sort and analyze the data.
library(dplyr)
library(ggvis)
To then be able to create a new data frame of only flights with LGA as a destination.
LGA_arrivals <- flights %>% filter(Dest == "LGA")
Next, I will be performing many of the same metrics Prof. Suleiman demonstrated in the Unit 6 lecture on my new LGA data frame “LGA_arrivals”. For this reason, I will move through this section without much of a narrative.
#Clean up date convention
LGA_arrivals$FlightDate <- as.Date(LGA_arrivals$FlightDate, format = "%m/%d/%Y")
#Format and add lead zeros back into time variables
LGA_arrivals <- LGA_arrivals %>% mutate(new_CRSDepTime = paste(FlightDate, sprintf("%04d", CRSDepTime)))
LGA_arrivals$new_CRSDepTime <- as.POSIXct(LGA_arrivals$new_CRSDepTime, format="%Y-%m-%d %H%M")
LGA_arrivals <- LGA_arrivals %>% mutate(new_CRSArrTime = paste(FlightDate, sprintf("%04d", CRSArrTime)))
LGA_arrivals$new_CRSArrTime <- as.POSIXct(LGA_arrivals$new_CRSArrTime, format="%Y-%m-%d %H%M")
LGA_arrivals <- LGA_arrivals %>% filter(Cancelled == 0) %>% filter(Diverted == 0) %>% mutate(new_DepTime = paste(FlightDate, sprintf("%04d", DepTime)), new_WheelsOff = paste(FlightDate, sprintf("%04d", WheelsOff)), new_WheelsOn = paste(FlightDate, sprintf("%04d", WheelsOn)), new_ArrTime = paste(FlightDate, sprintf("%04d", ArrTime)))
LGA_arrivals$new_DepTime <- as.POSIXct(LGA_arrivals$new_DepTime, format="%Y-%m-%d %H%M")
LGA_arrivals$new_WheelsOff <- as.POSIXct(LGA_arrivals$new_WheelsOff, format="%Y-%m-%d %H%M")
LGA_arrivals$new_WheelsOn <- as.POSIXct(LGA_arrivals$new_WheelsOn, format="%Y-%m-%d %H%M")
LGA_arrivals$new_ArrTime <- as.POSIXct(LGA_arrivals$new_ArrTime, format="%Y-%m-%d %H%M")
#Speed metrics
LGA_arrivals <- LGA_arrivals %>% filter(Cancelled == 0) %>% mutate(TaxiOut = as.integer(difftime(new_WheelsOff, new_DepTime, units = "mins")), TaxiIn = as.integer(difftime(new_ArrTime, new_WheelsOn, units = "mins")), ArrDelay = as.integer(difftime(new_ArrTime, new_CRSArrTime, units = "mins")), ArrDelayMinutes = ifelse(ArrDelay < 0, 0, ArrDelay), ArrDel15 = ifelse(ArrDelay >= 15, 1, 0), FlightTimeBuffer = CRSElapsedTime - ActualElapsedTime)
LGA_arrivals <- LGA_arrivals %>% filter(Cancelled == 0) %>% mutate(AirTime = ActualElapsedTime - TaxiOut - TaxiIn)
LGA_arrivals <- LGA_arrivals %>% filter(Cancelled == 0) %>% mutate(AirSpeed = Distance / (AirTime / 60))
#Departure delays
LGA_arrivals <- LGA_arrivals %>% filter(Cancelled == 0) %>% filter(Diverted == 0) %>% mutate(DepDelay = as.integer(difftime(new_DepTime, new_CRSDepTime, units = "mins")))
LGA_arrivals <- LGA_arrivals %>% filter(Cancelled == 0) %>% filter(Diverted == 0) %>% mutate(DepDelayMinutes = ifelse(DepDelay < 0, 0, DepDelay), DepDel15 = ifelse(DepDelay >= 15, 1, 0))
Now I can start manipulating the data to see if there is a way to avoid delays when flying into LGA.
First, I will look at significant delays, more than 15 mintues, by carrier.
LGA_arrivals %>% ggvis(x = ~Carrier, y = ~DepDel15) %>% layer_bars()
Another parameter to look at would be delays by location. In this analysis I calculated the rate of significant delays by origin airport.
DelayRate_ByLocation <- LGA_arrivals %>% group_by(Origin) %>% summarize(DelayRate = sum(DepDel15) / n())
DelayRate_ByLocation %>% ggvis(~Origin, ~DelayRate) %>% layer_bars()
Finally, because I want to graph a linear regression, despite lacking terribly logical variables with which to do so, I will be looking at the relationship between plane speed (AirSpeed) and departure delay in mintues (DepDelayMinutes).
LGA_arrivals %>% ggvis(x = ~DepDelayMinutes, y = ~AirSpeed) %>% layer_points() %>% layer_model_predictions(model = "lm", se = TRUE, stroke := "red")