Linear Model Machine Learning Airline Dataset

In this project I will be creating a linear model to predict flight arrival delay in minutes from flight departure delay in minutes. I will create a training set from a portion of my data to train my model, and use the remaining data to test my models accuracy at predicting arrival delay from departure delay.

Loading packages to be used in this project + viewing some of the dataset.

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──

## ✔ broom        1.0.8     ✔ recipes      1.3.0
## ✔ dials        1.4.0     ✔ rsample      1.3.0
## ✔ dplyr        1.1.4     ✔ tibble       3.2.1
## ✔ ggplot2      3.5.2     ✔ tidyr        1.3.1
## ✔ infer        1.0.8     ✔ tune         1.3.0
## ✔ modeldata    1.4.0     ✔ workflows    1.2.0
## ✔ parsnip      1.3.1     ✔ workflowsets 1.1.0
## ✔ purrr        1.0.4     ✔ yardstick    1.3.2

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ lubridate 1.9.4     ✔ stringr   1.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard()    masks scales::discard()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ stringr::fixed()    masks recipes::fixed()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ readr::spec()       masks yardstick::spec()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(stats)
library(readr)
lax_to_jfk <- read.csv("lax_to_jfk.csv")
head(lax_to_jfk)

##   Month DayOfWeek FlightDate Reporting_Airline Origin Dest CRSDepTime
## 1     3         5 2003-03-28                UA    LAX  JFK       2210
## 2    11         4 2018-11-29                AS    LAX  JFK       1045
## 3     8         5 2015-08-28                UA    LAX  JFK        805
## 4     4         7 2003-04-20                DL    LAX  JFK       2205
## 5    11         3 2005-11-30                UA    LAX  JFK        840
## 6     4         1 1992-04-06                UA    LAX  JFK       1450
##   CRSArrTime DepTime ArrTime ArrDelay ArrDelayMinutes CarrierDelay WeatherDelay
## 1        615    2209     617        2               2           NA           NA
## 2       1912    1049    1851      -21               0           NA           NA
## 3       1634     757    1620      -14               0           NA           NA
## 4        619    2212     616       -3               0           NA           NA
## 5       1653     836    1640      -13               0           NA           NA
## 6       2308    1452    2248      -20               0           NA           NA
##   NASDelay SecurityDelay LateAircraftDelay DepDelay DepDelayMinutes DivDistance
## 1       NA            NA                NA       -1               0          NA
## 2       NA            NA                NA        4               4          NA
## 3       NA            NA                NA       -8               0          NA
## 4       NA            NA                NA        7               7          NA
## 5       NA            NA                NA       -4               0          NA
## 6       NA            NA                NA        2               2          NA
##   DivArrDelay
## 1          NA
## 2          NA
## 3          NA
## 4          NA
## 5          NA
## 6          NA

Now I will create a subset of the data with just the ArrDelayMinutes and DepDelayMinutes for American Airlines. Let’s view the first 6 rows of the new subset we created.

aa_subset <- lax_to_jfk %>%
  select(Reporting_Airline, ArrDelayMinutes, DepDelayMinutes) %>%
  filter(Reporting_Airline == "AA")

head(aa_subset)

##   Reporting_Airline ArrDelayMinutes DepDelayMinutes
## 1                AA               5               2
## 2                AA              21               0
## 3                AA               0               0
## 4                AA              14               0
## 5                AA               4               0
## 6                AA               0               0

Now let’s separate our data into training and testing groups to train and test our predictive model.

set.seed(1234)
index <- initial_split(aa_subset, prop = 0.8)
training_data <- training(index)
testing_data <- testing(index)

Now let’s specify the model we want to use and fit it to the training data.

linear_reg_lm_spec <-
  linear_reg() %>%
  set_engine(engine = "lm")

train_fit <- linear_reg_lm_spec %>%
  fit(ArrDelayMinutes ~ DepDelayMinutes, data = training_data)

Let’s create our final model, find out it’s accuracy at predicting arrival delay from departure delay in American Airline flights. Let’s compare our model’s predicted values to the actual values from our testing data.

test_results <- train_fit %>%
  predict(new_data = testing_data) %>%
  mutate(truth = testing_data$ArrDelayMinutes)

test_results

## # A tibble: 220 × 2
##    .pred truth
##    <dbl> <int>
##  1  2.69    21
##  2  2.69     4
##  3  2.69     0
##  4  9.73     0
##  5 14.1      0
##  6  5.33     0
##  7  2.69     0
##  8  2.69     0
##  9  2.69     0
## 10  2.69     0
## # ℹ 210 more rows

train_results <- train_fit %>%
  predict(new_data = training_data) %>%
  mutate(truth = training_data$ArrDelayMinutes)

train_results

## # A tibble: 876 × 2
##    .pred truth
##    <dbl> <int>
##  1  2.69    27
##  2  2.69     0
##  3 31.7     46
##  4  2.69     5
##  5 68.7     61
##  6  3.57     5
##  7  5.33     0
##  8  2.69     0
##  9  2.69     0
## 10  2.69     0
## # ℹ 866 more rows

Values for root mean squared error and R squared for testing and training datasets.

rmsetest <- rmse(test_results, truth = truth, estimate = .pred)
rmsetest

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        10.5

rmsetrain <- rmse(train_results, truth = truth, estimate = .pred)
rmsetrain

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        12.1

rsqtest <- rsq(test_results, truth = truth, estimate = .pred)
rsqtest

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard       0.842

rsqtrain <- rsq(train_results, truth = truth, estimate = .pred)
rsqtrain

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard       0.759

Visualizing a comparison between the training results and test results of our predictive model.

test_results %>%
  mutate(train = "testing") %>%
  bind_rows(train_results %>%
              mutate(train = "training")) %>%
  ggplot(aes(truth, .pred)) + 
  geom_abline(lty = 2, color = "orange", size = 1.5) + 
  geom_point(color = "darkgreen", alpha = 0.5) + 
  facet_wrap(~train) + 
  labs( x = "Truth", y = "Predicted Arrival Delays (mins)") + 
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Linear Model Machine Learning Airline Dataset

Archit Nangavaram

2025-05-02