##Packages
suppressWarnings({
  suppressMessages({
library(dplyr)
library(ggplot2)
library(lubridate)
library(gridExtra)
library(forecast)
library(modeltime)
library(fabletools)
library(feasts)
library(tsibble)
library(tibble)
library(fable)
library(data.table) 
library(knitr) 
library(prophet)  
library(fable.prophet)
library(tidyr)
library(slider)
library(stats)
library(lagged)
library(tidyverse)
library(ggfortify)
library(zoo)
library(tseries)   


options(warn=-1)
})
})

Section 1.1: Exploratory Data Analysis & Time Series Decomposition

To delve into fundamental concepts of time-series decomposition and forecasting, we will utilize data on “Total US Vehicle Sales” sourced from the “FRED” Economic Data platform, curated by the Federal Reserve Bank of St. Louis. This dataset presents non-seasonally adjusted monthly figures (in thousands) representing vehicle sales across the United States, spanning from 1976 to 2023. These numbers, compiled by the U.S. Bureau of Economic Analysis, reflect aggregated sales data from automotive dealers nationwide.

The dynamics of vehicle sales are influenced by a myriad of factors, encompassing the cost of raw materials such as steel and rubber, climatic conditions, economic fluctuations, seasonal variations, insurance premiums, taxation policies, consumer preferences, labor market dynamics, and demographic trends. Given the multifaceted nature of these variables shaping sales trends, generating precise forecasts poses a significant analytical challenge.

Section 1.2: Data Summary

We initiate our analysis by delving into the structure of our dataset, leveraging statistical metrics and visualizations to uncover insights. Along the way, we will provide commentary on noteworthy observations and draw key conclusions as they emerge.

vehicle_df <- read.csv('total_vehicle_sales.csv')

 vehicle_sales_tbl_ts <- vehicle_df %>% select(date, vehicle_sales) %>%
  mutate(value = vehicle_sales) %>%
  mutate(date = yearmonth(date)) %>%
  as_tsibble(index = date)
head(vehicle_df)
##         date vehicle_sales
## 1 1976-01-01         885.2
## 2 1976-02-01         994.7
## 3 1976-03-01        1243.6
## 4 1976-04-01        1191.2
## 5 1976-05-01        1203.2
## 6 1976-06-01        1254.7
summary(vehicle_df)
##      date           vehicle_sales   
##  Length:575         Min.   : 670.5  
##  Class :character   1st Qu.:1117.2  
##  Mode  :character   Median :1268.5  
##                     Mean   :1261.7  
##                     3rd Qu.:1420.2  
##                     Max.   :1845.7
date_range <- range(vehicle_df$date)
date_range
## [1] "1976-01-01" "2023-11-01"
nrow(vehicle_df)
## [1] 575

The data consists of monthly vehicle sales between January 1976 and November 2023, summarized by month - resulting in 575 observations. On average, 1261.7 thousand sales were done per month, although that number varied from a low of 670.5 thousand to a peak of 1845.7 thousand.

##Histogram
hist(vehicle_df$vehicle_sales, freq = FALSE)
lines(density(vehicle_sales_tbl_ts$value), lwd = 3, col = "red")

The vehicles sales seem to be distributed fairly normal without any outliers in the data set.

ggplot() +
  geom_line(data = vehicle_sales_tbl_ts, aes(x = date, y = value)) +
  labs(x = "Month", y = "Vehicle Sales", title = "Monthly Vehicle Sales Over Time")

When visualizing the sales data chronologically, we observe a gradual upward trend alongside notable seasonal fluctuations, suggesting a recurring pattern, possibly on an annual basis. Given that the sales data is aggregated monthly, it’s conceivable that the seasonal variations contribute to the observed quarterly peaks and troughs.

Section 1.3: Time Series Analysis

After examining various moving average plots, it becomes evident that a 6-month moving average strikes a balance between smoothing the data and retaining the underlying seasonal or cyclical patterns.

# Calculate 6-month moving average
vehicle_sales_tbl_ts <- vehicle_sales_tbl_ts %>%
  mutate(ma_6 = slide_dbl(.x = value, .f = ~mean(.x, na.rm = TRUE), .before = 5))

# Create the plot
ggplot(vehicle_sales_tbl_ts, aes(x = date)) +
  geom_line(aes(y = value), color = "blue", na.rm = TRUE) +
  geom_line(aes(y = ma_6), color = "red", na.rm = TRUE) +
  labs(x = "Date", y = "Vehicle Sales", 
       title = "Vehicle Sales with 6-Month",
       subtitle = "Blue: Actual Sales, Red: 6-Month MA") +
  theme_minimal()

Given our belief in the presence of seasonality within the dataset, we proceed to employ “STL” decomposition for further analysis. The decomposition reveals a distinct seasonal pattern, even after removing the overall trend using a moving average. Notably, we observe a recurring yearly pattern characterized by peaks followed by low periods, succeeded by 2-3 months of heightened activity, and then another downturn before the start of a new seasonal cycle. Considering the observed dates and the pattern’s structure, it’s reasonable to assume that this time series is strongly influenced by the financial year and holidays. Subsequently, the data is additive based on the below plot.

# Calculate the difference between the 12-month moving average and the actual data
vehicle_sales_tbl_ts <- vehicle_sales_tbl_ts %>%
  mutate(diff_12 = value - ma_6)

# Convert the tsibble to a ts object with a frequency of 12
vehicle_sales_ts <- ts(vehicle_sales_tbl_ts$value, frequency = 12)

# Decompose the time series
decomposed <- stl(vehicle_sales_ts, s.window="periodic")

# Plot the decomposed time series
autoplot(decomposed) +
  labs(x = "Date", y = "Vehicle Sales", 
       title = "Decomposition of Vehicle Sales Time Series",
       subtitle = "Trend, Seasonal, and Random components") +
  theme_minimal()

In this section, we will construct a model for our dataset using the ARIMA methodology. To begin, let’s break the data into two parts, initial 80% of data would be taining data and the rest 20% is considered as testing data. The visual representation of the same is shown below (training in blue and testing in red):

set.seed(123)

# Determine the number of rows for training (80%) and testing (20%)
n_rows <- nrow(vehicle_sales_tbl_ts)
train_rows <- round(0.8 * n_rows)

# Split the data into training and testing sets
train_data <- vehicle_sales_tbl_ts %>% slice(1:train_rows)
test_data <- vehicle_sales_tbl_ts %>% slice((train_rows + 1):n_rows)

# Visualize the training and test sets
ggplot() +
  geom_line(data = train_data, aes(x = date, y = value, color = "Training"), linewidth = 0.3) +
  geom_line(data = test_data, aes(x = date, y = value, color = "Test"), linewidth = 0.3) +
  scale_color_manual(values = c("Training" = "blue", "Test" = "red")) +
  labs(title = "Training and Test Sets", x = "Date", y = "vehicle sales") +
  theme_minimal()

Section 2.1: ARIMA Modeling

# Load the necessary library
library(zoo)

# Calculate the 6-month moving average for the training data
train_data <- train_data %>%
  mutate(ma_6 = rollmean(value, 6, align = "center", fill = NA))

# Plot the 6-month moving average
ggplot(train_data, aes(x = date)) +
  geom_line(aes(y = ma_6), color = "blue", na.rm = TRUE) +
  labs(x = "Date", y = "Vehicle Sales", 
       title = "Vehicle Sales with 6-Month Moving Average") +
  theme_minimal()

As observed in the 6-month rolling average plot above, a distinct upward trend indicates that our time series is not mean-stationary, necessitating correction before applying ARIMA. The data displays minor variance stationarity, characterized by fluctuations in peaks and valleys. Although there are occasional jumps within three or four consecutive periods, no variance-stabilizing transformations seem necessary. To confirm this observation, we examine a rolling standard deviation plot:

# Convert the training data to a ts object with a frequency of 12
train_sales_ts <- ts(train_data$value, frequency = 12)

# Calculate the rolling standard deviation for the training data
rolling_sd_train <- rollapply(train_sales_ts, width = 13, FUN = sd, align = "center", fill = NA)

# Create a data frame with the rolling standard deviation for the training data
rolling_sd_train_df <- data.frame(time = time(rolling_sd_train), sd = as.vector(rolling_sd_train))

# Plot the rolling standard deviation for the training data
ggplot(rolling_sd_train_df, aes(x = time, y = sd)) +
  geom_line() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(x = "Time", y = "Standard Deviation", 
       title = "Rolling Standard Deviation of Vehicle Sales (Training Data)") +
  theme_minimal()
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.
## `geom_smooth()` using formula = 'y ~ x'

The rolling Standard Deviation plot indicates a slight decreasing trend in variance over time. Despite this trend’s slow rate of change and the overall variation observed across the years, we maintain our conclusion that correction for variance non-stationarity is unnecessary.

# Create a seasonally differenced series for the training data
train_data_diff <- diff(train_data$value, lag = 12)

# Convert the differenced training data back into a time series object
train_data_diff_ts <- ts(train_data_diff, start = start(train_data$value), frequency = frequency(train_data$value))

# Create the ACF and PACF plot for the training data
tsdisplay(train_data_diff_ts)

kpss_result <- kpss.test(train_data$value)
kpss_result
## 
##  KPSS Test for Level Stationarity
## 
## data:  train_data$value
## KPSS Level = 1.1514, Truncation lag parameter = 5, p-value = 0.01
kpss_diff <- kpss.test(diff(train_data$value))
kpss_diff
## 
##  KPSS Test for Level Stationarity
## 
## data:  diff(train_data$value)
## KPSS Level = 0.013876, Truncation lag parameter = 5, p-value = 0.1

Upon testing the original sales data, we obtained a very small p-value, leading us to reject the hypothesis of mean stationarity in our data. However, after differencing to remove seasonality and trend, the KPSS Test yields a p-value of 0.10, surpassing our threshold of 0.05. This result confirms that the transformed time series is mean-stationary, indicating that no further transformations are required.

Section 2.2: Analysis of Transformed Time Series and Model fitting

mod4 = Arima(train_data$value, order = c(2, 0, 2), seasonal = c(0, 0, 0))

summary(mod4)
## Series: train_data$value 
## ARIMA(2,0,2) with non-zero mean 
## 
## Coefficients:
##          ar1     ar2     ma1      ma2       mean
##       0.1891  0.6556  0.3410  -0.3700  1229.7697
## s.e.  0.0998  0.0824  0.1072   0.0578    41.1733
## 
## sigma^2 = 20981:  log likelihood = -2939.5
## AIC=5891.01   AICc=5891.19   BIC=5915.79
## 
## Training set error measures:
##                    ME     RMSE      MAE      MPE     MAPE      MASE       ACF1
## Training set 1.349663 144.0588 117.0126 -1.34284 9.788813 0.9002761 -0.0220673
# Extract the residuals from the model
residuals <- residuals(mod4)

# Perform the Ljung-Box test
ljung_box_test <- Box.test(residuals, type = "Ljung-Box")
ljung_box_test
## 
##  Box-Ljung test
## 
## data:  residuals
## X-squared = 0.22547, df = 1, p-value = 0.6349
forecast_values <- forecast(mod4, h = 6)
forecast_values
##     Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## 461       1376.227 1190.596 1561.857 1092.3295 1660.124
## 462       1359.722 1149.627 1569.817 1038.4091 1681.034
## 463       1350.358 1128.390 1572.327 1010.8867 1689.830
## 464       1337.767 1102.476 1573.058  977.9209 1697.614
## 465       1329.248 1085.998 1572.498  957.2291 1701.267
## 466       1319.383 1068.147 1570.618  935.1508 1703.614
# Plot the forecast
autoplot(forecast_values, conf.int = TRUE) +
  ylab('Retail Sales') +
  ggtitle('6 Month Forecast of Retail Sales')

The Ljung-Box test performed with various lag choices yields p-values greater than 0.05. Consequently, we can reject the hypothesis that the time series retains residual autocorrelation at those lags. This suggests that the ARIMA model would be a suitable candidate for our forecasting task.

However, before proceeding with the ARIMA model, we also consider Meta’s Prophet model algorithm as an alternative.

Section 3.1: Meta Prophet Modeling

Prophet, a forecasting tool created by Facebook, specializes in analyzing time-series data with daily observations showcasing trends, seasonality, and holiday effects. It stands out for its user-friendly interface, offering intuitive parameter adjustments, automatic feature selection, and robust modeling techniques. Prophet efficiently captures intricate data patterns, making it a valuable asset for accurate time-series forecasting.

The basic model is fit as:

yt = gt + st +ht + et

where gts the trend, stis seasonality, and ht are holidays.

set.seed(123)

# Determine the number of rows for training (80%) and testing (20%)
n_rows <- nrow(vehicle_sales_tbl_ts)
train_rows <- round(0.8 * n_rows)

# Split the data into training and testing sets
vehicle_train <- vehicle_sales_tbl_ts %>% slice(1:train_rows)
vehicle_test <- vehicle_sales_tbl_ts %>% slice((train_rows + 1):n_rows)

Section 3.2: Seasonality Analysis

The prophet model indicates a strong yearly seasonal pattern which reinforces our earlier conclusion that seasonality is additive, rather than multiplicative. We begin be re-examining the decomposition plot, updated with the latest version of the model:

model = vehicle_train %>%
    model(prophet = fable.prophet::prophet(vehicle_sales))

model %>%
components() %>%
autoplot()

The analysis of the time series data indicates a rising trend until the early 2000s, followed by a decline thereafter. Furthermore, there is an additive seasonal pattern observed, with no evidence of multiplicative seasonality.

changepoints = model %>%
glance() %>%
pull(changepoints) %>%
bind_rows() %>%
.$changepoints

vehicle_train %>%
ggplot()+
geom_line(aes(date,vehicle_sales))+
geom_vline(xintercept=as.Date(changepoints),color='red',linetype='dashed')

model2 = vehicle_train %>%
    model(
        prophet_orig = fable.prophet::prophet(vehicle_sales ~ growth(n_changepoints = 25) + season(period = 'year') + season(period = 'month', order = 12)),
        prophet_10 = fable.prophet::prophet(vehicle_sales ~ growth(n_changepoints = 10) + season(period = 'year') + season(period = 'month', order = 12))
        )


changepoints_orig = model2 %>%
glance() %>%
filter(.model == 'prophet_orig') %>%
pull(changepoints) %>%
bind_rows() %>%
#filter(abs(adjustment)>0.05) %>%
.$changepoints

changepoints_10 = model2 %>%
glance() %>%
filter(.model == 'prophet_10') %>%
pull(changepoints) %>%
bind_rows() %>%
#filter(abs(adjustment)>0.05) %>%
.$changepoints

vehicle_train %>%
ggplot()+
geom_line(aes(date,vehicle_sales))+
# geom_vline(aes(xintercept=ymd('2000-01-01')))
geom_vline(xintercept=as.Date(changepoints_orig),color='red')+
geom_vline(xintercept=as.Date(changepoints_10),color='blue',linetype='dashed')

model1 = vehicle_train %>%
    model(
        prophet_orig = fable.prophet::prophet(vehicle_sales),
        prophet_90_range = fable.prophet::prophet(vehicle_sales~growth(changepoint_range=0.9)+season(period='year')+ season(period = 'month', order = 12)),

        )

changepoints_orig = model1 %>%
glance() %>%
filter(.model == 'prophet_orig') %>%
pull(changepoints) %>%
bind_rows() %>%
#filter(abs(adjustment)>0.01) %>%
.$changepoints

changepoints_90_range = model1 %>%
glance() %>%
filter(.model == 'prophet_90_range') %>%
unnest(changepoints) %>%
bind_rows() %>%
#filter(abs(adjustment)>0.01) %>%
.$changepoints

model1 %>%
forecast(h=68) %>%
autoplot(vehicle_train,level=NULL)+
geom_vline(xintercept=as.Date(changepoints_orig),color='blue',linetype='dashed')+
geom_vline(xintercept=as.Date(changepoints_90_range),color='red',linetype='dashed')

model3 = vehicle_train %>%
    model(
        prophet_orig = fable.prophet::prophet(vehicle_sales),
        prophet_less_flexible = fable.prophet::prophet(vehicle_sales~growth(changepoint_prior_scale=0.01)+season(period='year')),
        prophet_more_flexible = fable.prophet::prophet(vehicle_sales~growth(changepoint_prior_scale=0.5)+season(period='year'))
        )

changepoints_orig = model3 %>%
glance() %>%
unnest(changepoints) %>%
bind_rows() %>% 
filter(.model == 'prophet_orig') %>%
filter(abs(adjustment)>=0.05) %>%
.$changepoints

changepoints_more_flexible = model3 %>%
glance() %>%
unnest(changepoints) %>%
bind_rows() %>%
filter(.model == 'prophet_more_flexible') %>%
filter(abs(adjustment)>=0.50) %>%
.$changepoints

changepoints_less_flexible = model3 %>%
glance() %>%
unnest(changepoints) %>%
bind_rows() %>%
filter(.model == 'prophet_less_flexible') %>%
filter(abs(adjustment)>=0.01) %>%
.$changepoints

model3 %>%
forecast(h=68) %>%
autoplot(vehicle_train,level=NULL)+
geom_vline(xintercept=as.Date(changepoints_orig),color='blue',linetype='dashed')+
geom_vline(xintercept=as.Date(changepoints_more_flexible),color='green',linetype='dashed')+
geom_vline(xintercept=as.Date(changepoints_less_flexible),color='red',linetype='dashed')

vehicle_train %>%
    model(
        prophet_orig = fable.prophet::prophet(vehicle_sales),
        prophet_saturating = fable.prophet::prophet(vehicle_sales~growth(type='linear',capacity=3.7,floor=2.8)+season('year'))
        ) %>%
    forecast(h=68) %>%
    autoplot(vehicle_train)

The forecast predictions from changepoints_less_flexible outperform other changepoint options, suggesting that a less flexible approach may better capture the underlying data patterns. This implies that simpler models with fewer changepoints could yield more dependable forecasts.

Although adjustments were made to the floor and capacity parameters to a significant extent, the forecasts produced show minimal deviation from the base Prophet model. This suggests that while these parameters theoretically affect the model’s behavior, their practical impact on forecasting accuracy may be limited in this dataset. Further exploration and experimentation may be necessary to gain a deeper understanding of their effect.

model4 = vehicle_train %>%
    model(
      additive = fable.prophet::prophet(vehicle_sales~growth()+season(period='year',type='additive')),
      multiplicative = fable.prophet::prophet(vehicle_sales~growth()+season(period='year',type='multiplicative')))

model4 %>%
components() %>%
autoplot()

Both the decomposition graph and table reveal the existence of additive seasonality within the time series, rather than multiplicative seasonality. This implies that the seasonal fluctuations within the data maintain a consistent magnitude over time, rather than scaling proportionally with the trend.

Section 4: Model Comparison and Validation

Now that we have our models constructed, we can measure their effectiveness through cross-validation: we will use each model to build and plot 15 consecutive forecasts across subsets of our training data and compare their performance:

vehicle_train %>%
bind_rows(vehicle_test) %>%
    ggplot()+
    geom_line(aes(date, vehicle_sales))+
    geom_vline(aes(xintercept=ymd('2014-05-01')),color='red')+
  annotate("text", x = as.Date("1976-01-01"), y = max(vehicle_sales_tbl_ts$vehicle_sales) * 0.9, 
           label = "Train", color = "blue", size = 4) +
  annotate("text", x = as.Date("2023-11-01"), y = max(vehicle_sales_tbl_ts$vehicle_sales) * 0.9, 
           label = "Test", color = "blue", size = 4) +
  labs(x = "Week", y = "vehicle sales", title = "Training and Testing Dataset") +
    theme_bw()

vehicle_train_data <- vehicle_train %>%
  stretch_tsibble(.init = 22*12, .step = 12)

vehicle_train_data %>%
    ggplot()+
    geom_point(aes(date,factor(.id),color=factor(.id)))+
    ylab('Iteration')+
    ggtitle('Samples included in each CV Iteration')

vehicle_train_cv_forecast <- vehicle_train_data %>%
  model(
    arima = ARIMA(vehicle_sales~pdq(2,0,2)),
    prophet = fable.prophet::prophet(vehicle_sales~growth()+season(period='year')),
    naive = NAIVE(vehicle_sales~drift())
  ) %>%
  forecast(h=6)


vehicle_train_cv_forecast %>%
  as_tsibble() %>%
  dplyr::select(-vehicle_sales) %>%
  left_join(
      vehicle_train
  ) %>%
    ggplot()+
    geom_line(aes(date,vehicle_sales))+
    geom_line(aes(date,.mean,color=factor(.id),linetype=.model))+
    scale_color_discrete(name='Iteration')+
    ylab('Ridership Count')+
    theme_bw()
## Joining with `by = join_by(date)`

vehicle_train_cv_forecast %>%
  as_tibble() %>%
  dplyr::select(-vehicle_sales) %>%
  left_join(vehicle_train) %>%
  group_by(.id,.model) %>%
  mutate(
    months_ahead = seq(1:6)
  ) %>%
  ungroup() %>%
  mutate(error = abs(vehicle_sales - .mean)) %>%
  ggplot()+
  geom_boxplot(aes(factor(months_ahead),error))+
  facet_wrap(~.model,ncol=1)+
  guides(color='none')+
  ylab('Absolute Error')+
  xlab('Days Ahead')+
  ggtitle('Absolute Error by Iteration')+
  ylim(0,600)
## Joining with `by = join_by(date)`

vehicle_train_cv_forecast %>%
  accuracy(vehicle_train)
## # A tibble: 3 × 10
##   .model  .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima   Test  -16.4  91.1  72.0 -1.94  5.88 0.644 0.616 0.533
## 2 naive   Test  -27.2 190.  152.  -4.49 12.6  1.36  1.29  0.278
## 3 prophet Test  -34.4 210.  166.  -4.83 14.5  1.48  1.42  0.865
vehicle_train_cv_forecast %>%
  as_tibble() %>%
  dplyr::select(-vehicle_sales) %>%
  left_join(vehicle_train) %>% 
  group_by(.id,.model) %>%
  mutate(
    months_ahead = seq(1:6)
  ) %>%
  ungroup() %>% 
  filter(!is.na(vehicle_sales)) %>%
  group_by(months_ahead,.model) %>%
  summarize(
    rmse = sqrt(mean((vehicle_sales - .mean)^2,na.rm=T)),
  ) %>%
  ungroup() %>%
  ggplot()+
  geom_point(aes(months_ahead,rmse,color=.model),alpha=0.4)+ 
  geom_smooth(aes(months_ahead,rmse,color=.model),se=F)+
  xlab("Months Ahead")+
  ylab("Smoothed RMSE")
## Joining with `by = join_by(date)`
## `summarise()` has grouped output by 'months_ahead'. You can override using the
## `.groups` argument.
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

vehicle_train_cv_forecast %>%
  as_tibble() %>%
  dplyr::select(-vehicle_sales) %>%
  left_join(vehicle_train) %>%
  group_by(.id,.model) %>%
  mutate(
    months_ahead = seq(1:6)
  ) %>%
  ungroup() %>%
  group_by(months_ahead,.model) %>%
  mutate(
    mape = mean(abs((vehicle_sales - .mean)/vehicle_sales),na.rm=T)
  ) %>%
  ungroup() %>%
  ggplot()+
  geom_point(aes(months_ahead,mape,color=.model),alpha=0.4)+ 
  geom_smooth(aes(months_ahead,mape,color=.model),se=F)+
  xlab("Months Ahead")+
  ylab("Smoothed MAPE")
## Joining with `by = join_by(date)`
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Arima model exhibits lower MAPE and RMSE values based on above plots.

Although the prophet model adeptly captures the variability and trend within the data, it struggles to accurately estimate the magnitude of the observations. This indicates that while the model effectively tracks the general patterns, it tends to underestimate the true values, leading to discrepancies in performance when evaluated on the test set. Notably, the ARIMA model exhibits the lowest error rate among all models, as evidenced by its superior performance in terms of RMSE and MAPE.

When contrasting the fitted values with the actual observations, it’s evident that the model displays minimal bias, indicating a strong alignment with the training dataset. Yet, upon scrutinizing the forecast against the test data, a notable reduction in variability is observed, signaling a potential overfitting challenge. This disparity suggests that while the model effectively captures the intrinsic patterns, its ability to generalize to new data is compromised. Further exploration and adjustments may be necessary to mitigate the overfitting issue and enhance the model’s predictive performance.