In this project I will be creating a linear model to predict flight arrival delay in minutes from flight departure delay in minutes. I will create a training set from a portion of my data to train my model, and use the remaining data to test my models accuracy at predicting arrival delay from departure delay.
Loading packages to be used in this project + viewing some of the dataset.
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom 1.0.8 ✔ recipes 1.3.0
## ✔ dials 1.4.0 ✔ rsample 1.3.0
## ✔ dplyr 1.1.4 ✔ tibble 3.2.1
## ✔ ggplot2 3.5.2 ✔ tidyr 1.3.1
## ✔ infer 1.0.8 ✔ tune 1.3.0
## ✔ modeldata 1.4.0 ✔ workflows 1.2.0
## ✔ parsnip 1.3.1 ✔ workflowsets 1.1.0
## ✔ purrr 1.0.4 ✔ yardstick 1.3.2
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::step() masks stats::step()
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ lubridate 1.9.4 ✔ stringr 1.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ stringr::fixed() masks recipes::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ readr::spec() masks yardstick::spec()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(stats)
library(readr)
lax_to_jfk <- read.csv("lax_to_jfk.csv")
head(lax_to_jfk)
## Month DayOfWeek FlightDate Reporting_Airline Origin Dest CRSDepTime
## 1 3 5 2003-03-28 UA LAX JFK 2210
## 2 11 4 2018-11-29 AS LAX JFK 1045
## 3 8 5 2015-08-28 UA LAX JFK 805
## 4 4 7 2003-04-20 DL LAX JFK 2205
## 5 11 3 2005-11-30 UA LAX JFK 840
## 6 4 1 1992-04-06 UA LAX JFK 1450
## CRSArrTime DepTime ArrTime ArrDelay ArrDelayMinutes CarrierDelay WeatherDelay
## 1 615 2209 617 2 2 NA NA
## 2 1912 1049 1851 -21 0 NA NA
## 3 1634 757 1620 -14 0 NA NA
## 4 619 2212 616 -3 0 NA NA
## 5 1653 836 1640 -13 0 NA NA
## 6 2308 1452 2248 -20 0 NA NA
## NASDelay SecurityDelay LateAircraftDelay DepDelay DepDelayMinutes DivDistance
## 1 NA NA NA -1 0 NA
## 2 NA NA NA 4 4 NA
## 3 NA NA NA -8 0 NA
## 4 NA NA NA 7 7 NA
## 5 NA NA NA -4 0 NA
## 6 NA NA NA 2 2 NA
## DivArrDelay
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
Now I will create a subset of the data with just the ArrDelayMinutes and DepDelayMinutes for American Airlines. Let’s view the first 6 rows of the new subset we created.
aa_subset <- lax_to_jfk %>%
select(Reporting_Airline, ArrDelayMinutes, DepDelayMinutes) %>%
filter(Reporting_Airline == "AA")
head(aa_subset)
## Reporting_Airline ArrDelayMinutes DepDelayMinutes
## 1 AA 5 2
## 2 AA 21 0
## 3 AA 0 0
## 4 AA 14 0
## 5 AA 4 0
## 6 AA 0 0
Now let’s separate our data into training and testing groups to train and test our predictive model.
set.seed(1234)
index <- initial_split(aa_subset, prop = 0.8)
training_data <- training(index)
testing_data <- testing(index)
Now let’s specify the model we want to use and fit it to the training data.
linear_reg_lm_spec <-
linear_reg() %>%
set_engine(engine = "lm")
train_fit <- linear_reg_lm_spec %>%
fit(ArrDelayMinutes ~ DepDelayMinutes, data = training_data)
Let’s create our final model, find out it’s accuracy at predicting arrival delay from departure delay in American Airline flights. Let’s compare our model’s predicted values to the actual values from our testing data.
test_results <- train_fit %>%
predict(new_data = testing_data) %>%
mutate(truth = testing_data$ArrDelayMinutes)
test_results
## # A tibble: 220 × 2
## .pred truth
## <dbl> <int>
## 1 2.69 21
## 2 2.69 4
## 3 2.69 0
## 4 9.73 0
## 5 14.1 0
## 6 5.33 0
## 7 2.69 0
## 8 2.69 0
## 9 2.69 0
## 10 2.69 0
## # ℹ 210 more rows
train_results <- train_fit %>%
predict(new_data = training_data) %>%
mutate(truth = training_data$ArrDelayMinutes)
train_results
## # A tibble: 876 × 2
## .pred truth
## <dbl> <int>
## 1 2.69 27
## 2 2.69 0
## 3 31.7 46
## 4 2.69 5
## 5 68.7 61
## 6 3.57 5
## 7 5.33 0
## 8 2.69 0
## 9 2.69 0
## 10 2.69 0
## # ℹ 866 more rows
Values for root mean squared error and R squared for testing and training datasets.
rmsetest <- rmse(test_results, truth = truth, estimate = .pred)
rmsetest
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 10.5
rmsetrain <- rmse(train_results, truth = truth, estimate = .pred)
rmsetrain
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 12.1
rsqtest <- rsq(test_results, truth = truth, estimate = .pred)
rsqtest
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.842
rsqtrain <- rsq(train_results, truth = truth, estimate = .pred)
rsqtrain
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.759
Visualizing a comparison between the training results and test results of our predictive model.
test_results %>%
mutate(train = "testing") %>%
bind_rows(train_results %>%
mutate(train = "training")) %>%
ggplot(aes(truth, .pred)) +
geom_abline(lty = 2, color = "orange", size = 1.5) +
geom_point(color = "darkgreen", alpha = 0.5) +
facet_wrap(~train) +
labs( x = "Truth", y = "Predicted Arrival Delays (mins)") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.