Introduction
For this project, we will be creating a Time Series. This time series
looks at the average number of air travel passengers on all U.S.
flights. The time series we will look at will include the number of
passengers on both domestic and international flights. We will create a
monthly time series which looks at the average number of air travel
passengers by month. This time series data was collected from 2003 to
2023, with the first observation in January of 2003, and the final
observation in September 2023.
We will split the time series data up into a training and a testng
data set. And then we create several candidate, exponential smoothing
models for forecasting. These will include a simple exponential model,
several Holt models, and several Holt-Winter’s models. We will see which
of these candidate models provides the best utility for an exponential
smoothing model of our time series. We will determine the best model by
seeing which one reduces the errors and provides the best overall
accuracy.
Data Description
I found this data set on kaggle.com on the following webpage: https://www.kaggle.com/datasets/yyxian/u-s-airline-traffic-data?select=air+traffic.csv
This data set looks at the average number of air travel passengers by
month from 2003 to 2023. The first observation in this data set was
collected in January 2003, and the final observation in this data set
was collected in September 2023. This data set includes variables which
represent the total number of passengers on both domestic and
international flights by month. However, for this time series, we will
look specifically at the variable which represents the total number of
passengers on all flights, both domestic and international. This data is
representative of passengers on all U.S. flights by month.
We will read in the data set from Github and we will call it
“flight”.
flight <- read.csv("https://raw.githubusercontent.com/JosieGallop/STA321/refs/heads/main/dataset/air%20traffic.csv")
This data set contains many variables regarding air travel
information, but for this time series we are just interested in looking
at the number of the total air travel passengers. This is represented by
the variable Pax. We want to create a time series of the total air
travel passengers by the date, so we will only keep the time specific
variables, Year and Month, and the Pax variable. We will call the new
data set with only these specific variables we are interested in
“flight1”.
flight1 <- flight %>%
select(Year, Month, Pax)
Now, we have a new data set, flight1, with only the variables we are
interested in for our time series project.
Variables
There are three variables from the original data set which we will be
interested in for this time series project.
Year: The year of which the observation is from. This data set
spans from 2003 to 2023.
Month: The month of which the observation is from. Given as a two
digit number representing the month. For example, 01 represents January,
02 represents February, and so on.
Pax: The total number of air travel passengers for a given
observations. This includes air passengers on both domestic and
international flights.
During our exploratory data analysis, we will create a new variable
called Date which will combine the year and the month variables to get
just one, single variable to represent the date of which an observation
occurred. The date variable we will create will be given in the format
of Year-Month.
Exploratory Data
Analysis
First, let’s ensure that there are not any missing variables in this
data set.
colSums(is.na(flight1))
Year Month Pax
0 0 0
As we can see, there are zero missing values in our data set, so that
means we do not have to worry about filling in any missing observations.
We can proceed with our time series project.
Next, we will combine the Year and Month variables into one variable
called Date which is given as year and month combined, in the format of
Year-Month. We will create a new data set with the combined Date
variable and we will call this new data set “flight2”.
flight2 <- flight1 %>%
mutate(Date = paste(Year, sprintf("%02d", Month), sep = "-"))
flight2 <- select(flight2, -1,-2)
Now, we have a data set “flight2” with just the Date, given in a
Year-Month format, and the Pax variable representing the total number of
air travel passengers.
One thing I noticed is that the Pax variable is incorrectly given as
a character variable when it should really be a numeric variable since
it represents the total number of air travel passengers. We will change
the variable type of the Pax variable from a character to a numeric
variable in order to fix this issue. We will remove the commas from the
Pax variable before converting it to a numeric variable to avoid any
problems with the conversion of the variable type.
flight2$Pax <- gsub(",", "", flight2$Pax)
flight2$Pax <- as.numeric(flight2$Pax)
Now we have the proper data set we can use to create the time series.
This data set provides monthly averages for the number of air travel
passengers for all domestic and international flights.
Time Series Plot
Now, we will create a time series object from our flight data and
plot it to help visualize any potential trends that may be occurring
within this time series data set. We will call this time series object
“flight.ts”. Since we are creating a monthly time series looking at the
average number of air travel passengers by month, our frequency will be
12.
flight.ts <- ts(flight2$Pax, start = c(2003, 1), frequency = 12)
plot(flight.ts, main = "Monthly Average Air Travel Passengers from 2003 to 2023", xlab = "Year", ylab = "Air Travel Passengers")

Looking at the time series plot, we can see some notable occurrences
within our data. There is certainly some sort of seasonality going on
within our time series which can be seen by how the plot has a pattern
which repeats after a certain amount of time. It looks like this time
series might have an annual seasonality, with this repetition occurring
after a year has gone by. This would make sense that the number of air
travel passengers would have an annual seasonality, because more people
likely travel during the summer months due to the ideal weather and for
families this would mean their children would be on summer break which
would make having a vacation during the summer an optimal timing. It
would make sense that the number of air travel passengers would peak
during the summer months, and then drop back down until there is another
slight rise around the holiday months in the winter time due to people
travelling to spend the holidays with their family. Overall, an annual
seasonality would make sense for a time series on the number of air
travel passengers.
Another notable occurrence within this time series is that there
appears to be a trend of slight overall increase in the number of air
travel passengers from the start of the time series up until the end.
There is an especially noticeable trend of overall increase from around
2010 up until 2019. We can see that there is an incredibly massive
decrease in the number of air travel passengers around 2020. This can be
attributed to the travel restrictions which were put in place during the
early days of the covid pandemic. This would explain why there is such a
drastic decline in the number of air travel passengers seen in the year
2020. After this decline in 2020, we can see that the number of air
travel passengers begins to gradually increase over the following years,
and it appears that the numbers at the end of the time series in 2023
are fairly similar to what they were right before the major decline seen
in 2020.
Data Split
We will split our time series data into two data sets, one for
training, and one for testing. Since we have a monthly time series, we
will use the 12 most recent observations for the testing data set. The
rest of the observations will be used for the training data set. We will
also create a time series object for just the training data set. We will
call this time series object “flight.ts1”.
test.flight <- flight2$Pax[238:249]
train.flight <- flight2$Pax[1:238]
flight.ts1 <- ts(flight2$Pax[1:238], start = c(2003, 1), frequency = 12)
Now we have split our time series data set up into a training and a
testing data set. We will use the training data set to train the
smoothing models we will create. We will use the testing data set to
identify which of these smoothing models we will create is the best one
to use. So, we will next begin with creating these smoothing models and
we will use the training and testing data sets to help us with building
each of these models and then determining which model is the best one to
use.
Smoothing Models
We will now begin with building the smoothing models so we can
ultimately determine which one does the best and would be the best to
use. We will create three different types of smoothing models. These
different models will include a simple exponential smoothing model, Holt
models (additive and with a possible damped pattern), and Holt-Winter’s
models (additive and multiplicative seasonality with a possible damped
pattern).
Simple Exponential
Smoothing Model
First, we will begin with creating the simple exponential smoothing
model. We will call this simple exponential smoothing model “fit1”.
# Simple Exponential Smoothing Model
fit1 <- ses(flight.ts1, h = 12)
Now we have created our simple exponential smoothing model.
Holt Models
Next, we will create the Holt models. We will use three different
kinds of Holt models. The first of our Holt models will be a linear Holt
model, and we will call this “fit2”. Our next Holt model will be an
additive Holt model with a possible damped pattern, and this will be
called “fit3”. Our last Holt model will be an exponential Holt model
with a possible damped pattern, and this will be called “fit4”.
# Holt Linear Model
fit2 <- holt(flight.ts1, initial = "optimal", h = 12)
# Holt Additive Model with a Possible Damped Pattern
fit3 <- holt(flight.ts1, damped = TRUE, h = 12)
# Holt Exponential Model with a Possible Damped Pattern
fit4 <- holt(flight.ts1, exponential = TRUE, damped = TRUE, h = 12)
Now we have successfully created our Holt models.
Holt-Winter’s
Models
Lastly, we will create our Holt Winter’s models. We will use four
different kinds of Holt-Winter’s models. The first of these will be a
Holt-Winter’s additive model, and this will be called “fit5”. The next
will be a Holt-Winter’s multiplicative model, and this will be called
“fit6”. The next one after will be a Holt-Winter’s additive model with a
possible damped pattern, and this will be called “fit7”. The last of
these models will be a Holt-Winter’s multiplicative model with a
possible damped pattern, and this will be called “fit8”.
# Holt-Winter's Additive Model
fit5 <- hw(flight.ts1, h = 12, seasonal = "additive")
# Holt-Winter's Multiplicative Model
fit6 <- hw(flight.ts1, h = 12, seasonal = "multiplicative")
# Holt-Winter's Additive Model with a Possible Damped Pattern
fit7 <- hw(flight.ts1, h = 12, seasonal = "additive", damped=TRUE)
# Holt-Winter's Multiplicative Model with a Possible Damped Pattern
fit8 <- hw(flight.ts1, h = 12, seasonal = "multiplicative", damped=TRUE)
Now we have successfully created all of our Holt-Winter’s models.
Accuracy Measures
Now that we have created various models to try out, including a
simple exponential model, Holt models, and Holt-Winter’s models, we will
use accuracy measures to compare all of these different models. These
accuracy measures will be based upon the training data set that we
created from our time series data.
accuracy.table = round(rbind(accuracy(fit1), accuracy(fit2), accuracy(fit3), accuracy(fit4), accuracy(fit5), accuracy(fit6), accuracy(fit7), accuracy(fit8)), 4)
row.names(accuracy.table)=c("SES","Holt Linear","Holt Add. Damped", "Holt Mult. Damped", "HW Add.","HW Mult.","HW Add. Damped", "HW Mult. Damped")
kable(accuracy.table, caption = "Acccuracy Measures of the Various Smoothing Models Based upon the Training Data Set")
Acccuracy Measures of the Various Smoothing Models Based upon
the Training Data Set
SES |
102859.87 |
6626499 |
4795473 |
-5.1086 |
13.3855 |
0.6550 |
0.0043 |
Holt Linear |
-485232.07 |
6653815 |
4823863 |
-6.0574 |
13.4850 |
0.6589 |
-0.0019 |
Holt Add. Damped |
62009.49 |
6614114 |
4775792 |
-5.1840 |
13.3544 |
0.6523 |
0.0042 |
Holt Mult. Damped |
65535.10 |
6619923 |
4833567 |
-4.9977 |
13.3576 |
0.6602 |
-0.0276 |
HW Add. |
-240230.44 |
3914979 |
1755071 |
-4.4331 |
7.6521 |
0.2397 |
0.3594 |
HW Mult. |
-326797.03 |
4516990 |
2044228 |
-5.4066 |
8.9673 |
0.2792 |
0.3666 |
HW Add. Damped |
70386.74 |
3885667 |
1765809 |
-3.5371 |
7.6123 |
0.2412 |
0.3325 |
HW Mult. Damped |
70002.38 |
4144943 |
1769723 |
-3.9989 |
8.0597 |
0.2417 |
0.3616 |
To interpret our accuracy table, we have several measures of accuracy
we looked at. The ME, or mean errors, represents how far off the
predictions are from their actual values. The Holt Add. Damped and Holt
Mult. Damped have the smallest absolute value of their ME. Next, we have
RMSE, MAE, and MAPE, all of which a lower value is ideal because it
indicates the least errors. From these accuracy measurements, the HW
Add., and the HW. Add. Damped stand out for having notably lower values
than any of the other models do. For the MASE measurement, a value
closest to zero is ideal. The HW. Add model has the closest MASE to
zero, making this once again seem like a strong model. Lastly, for the
ACF1 measurements, a value closest to zero is ideal for accuracy. From
our models, the Holt Linear model has an ACF1 that is closest to
zero.
Overall, the Holt-Winter’s Additive model stood out for showing ideal
performance in the most accuracy measures of all of the models. This
Holt-Winter’s Additive model had low RMSE, MAE, and MAPE, as well as a
MASE closest to zero. Since the Holt-Winter’s Additive model stood out
in more accuracy measurements than any of the other models, we will
choose this smoothing model as the ideal model in terms of
performance.
Serial Plot of the
Data
We will create a serial plot of the historical data, which is the
data that was used for the training data set. This will allow us to
forecast the values and visualize this forecast.
We will create two serial plots, one will be for the non-seasonal
smoothing models, which will include the simple exponential smoothing
model, the Holt linear, the Holt Additive Damped, and the Holt
Multiplicative Damped models. The other serial plot will be for the
trend and seasonal smoothing models, which will include the HW Additive,
the HW Multiplicative, the HW Additive Damped, and the HW Multiplicative
Damped models.
First, we will create the serial plot for the non-seasonal smoothing
models.
par(mfrow = c(2,1), mar = c(3,4,3,1))
pred.id = 238:249
plot(1:238, flight.ts1, lwd = 2,type = "o", ylab = "Air Travel Passengers", xlab = "",
xlim = c(1, 249), ylim=c(0, 80000000), cex = 0.3,
main = "Non-seasonal Smoothing Models")
lines(pred.id, fit1$mean, col = "pink")
lines(pred.id, fit2$mean, col = "blue")
lines(pred.id, fit3$mean, col = "purple")
lines(pred.id, fit4$mean, col = "navy")
points(pred.id, fit1$mean, pch = 16, col = "pink", cex = 0.5)
points(pred.id, fit2$mean, pch = 17, col = "blue", cex = 0.5)
points(pred.id, fit3$mean, pch = 19, col = "purple", cex = 0.5)
points(pred.id, fit4$mean, pch = 21, col = "navy", cex = 0.5)
legend("bottomright", lty = 1, col = c("pink","blue","purple","navy"), pch = c(16, 17, 19, 21),
c("SES","Holt Linear","Holt Additive Damped", "Holt Multiplicative Damped"),
cex = 0.5, bty = "n")

We can see from the serial plot that this visualized the forecasts
for the non-seasonal smoothing models which we created.
Next, we will create the serial plot for the Holt-Winter’s trend and
seasonal smoothing models.
plot(1:238, flight.ts1, lwd = 2, type = "o", ylab = "Air Travel Passengers", xlab = "",
xlim=c(1,249), ylim=c(0, 80000000), cex = 0.3,
main="Holt-Winters Trend and Seasonal Smoothing Models")
lines(pred.id, fit5$mean, col = "pink")
lines(pred.id, fit6$mean, col = "blue")
lines(pred.id, fit7$mean, col = "purple")
lines(pred.id, fit8$mean, col = "navy")
points(pred.id, fit5$mean, pch = 16, col = "pink", cex = 0.5)
points(pred.id, fit6$mean, pch = 17, col = "blue", cex = 0.5)
points(pred.id, fit7$mean, pch = 19, col = "purple", cex = 0.5)
points(pred.id, fit8$mean, pch = 21, col = "navy", cex = 0.5)
legend("bottomright", lty=1, col = c("pink","blue","purple", "navy"), pch = c(16, 17, 19, 21),
c("HW Additive","HW Multiplicative","HW Additive Damped", "HW Multiplicative Damped"),
cex = 0.5, bty = "n")

We can see from this serial plot the forecasts for the trend and
seasonal smoothing models.
From the two serial plots we created, it looks like the Holt-Winter’s
Additive model once again seems like the best of the models we created.
This matches up with what was found in the accuracy table. It looks like
the Holt Winter’s Additive Model provides a quality forecast in the
serial plot for the Holt-Winter’s trend and seasonal smoothing
models.
Forecasting with the
Testing Data
Now that we have determined which model looks like the ideal choice,
the Holt-Winter’s Additive Model, we will use the testing data set which
we created earlier to verify our best model selection.
We previously used the training data set to train the various
smoothing models we looked at. To create our real-world forecast, we
will refit the model using the testing data as well. This will allow us
to verify our findings and if the model which looked to show the best
performance in forecasting truly looks like the ideal choice.
First, we will refit the model using the entire data set to update
the smoothing parameters.
acc.fun = function(test.data, mod.obj){
PE = 100*(test.data-mod.obj$mean)/mod.obj$mean
MAPE = mean(abs(PE))
E = test.data-mod.obj$mean
MSE = mean(E^2)
accuracy.metric = c(MSE = MSE, MAPE = MAPE)
accuracy.metric
}
Now, we will find the updated accuracy prediction measures for each
of the smoothing models we created.
pred.accuracy = rbind(SES = acc.fun(test.data = test.flight, mod.obj = fit1),
Holt.Add = acc.fun(test.data=test.flight,
mod.obj = fit2),
Holt.Add.Damp = acc.fun(test.data = test.flight,
mod.obj=fit3),
Holt.Exp = acc.fun(test.data = test.flight,
mod.obj = fit4),
HW.Add = acc.fun(test.data = test.flight,
mod.obj = fit5),
HW.Exp = acc.fun(test.data = test.flight,
mod.obj = fit6),
HW.Add.Damp = acc.fun(test.data = test.flight,
mod.obj = fit7),
HW.Exp.Damp = acc.fun(test.data = test.flight,
mod.obj = fit8))
kable(pred.accuracy, caption = "Accuracy Measures of the Various Exponential Smoothing Models Based on the Testing Data Set")
Accuracy Measures of the Various Exponential Smoothing Models
Based on the Testing Data Set
SES |
4.496124e+13 |
7.346870 |
Holt.Add |
3.748179e+13 |
6.273833 |
Holt.Add.Damp |
4.493119e+13 |
7.343571 |
Holt.Exp |
4.454601e+13 |
7.246437 |
HW.Add |
3.049452e+13 |
5.195657 |
HW.Exp |
4.372064e+13 |
7.284188 |
HW.Add.Damp |
3.366352e+13 |
5.255700 |
HW.Exp.Damp |
3.954300e+13 |
5.948060 |
Looking at the accuracy measures based on the testing data set, we
can see that the Holt-Winter’s Additive model has both the lowest MSE
and MAPE. This is ideal because we want our model to reduce the errors
as much as possible. Also, this matches up with what we found earlier,
because once again the Holt-Winter’s Additive model has the best
performance in terms of accuracy. So, it looks like the Holt-Winter’s
Additive model is the ideal choice, and this has been verified with the
testing data set.
Final Model
Based upon our findings, it looks like the Holt-Winter’s Additive
model is the ideal choice out of the many possible candidate models we
created. This model showed its ideal performance by reducing the errors
more significantly than any of the other models did. This Holt-Winter’s
Additive model also showed the best performance in terms of the accuracy
measures, because it had the lowest MSE and MAPE values of out any of
the models. So, the Holt-Winter’s Additive model is the ideal model out
of all the candidate models we created and tried out.
Smoothing
Parameters
We will take a look at the smoothing parameters for the Holt-Winter’s
Additive model.
final.model <- hw(flight.ts1, h = 12, seasonal = "additive")
smoothing.parameter = final.model$model$par[1:3]
kable(smoothing.parameter, caption="The Estimated Values of the Smoothing Parameters in the Holt-Winters Additive Model")
The Estimated Values of the Smoothing Parameters in the
Holt-Winters Additive Model
alpha |
0.9998907 |
beta |
0.0003555 |
gamma |
0.0001081 |
The smoothing parameters for our final model, the Holt-Winter’s
Additive model, include an alpha of 0.9999, a beta of 0.0004, and a
gamma of 0.0001.
Conclusion
Overall, we looked at a data set which was representative of the
total number of air travel passengers in the U.S. through monthly
averages. This data was collected over the course of the years from 2003
to 2023. We created a monthly time series which showed the trends and
seasonality of the average number of air travel passengers by month over
the span of this 20 year period.
We split the time series data up into a training and a testing data
set. After looking at several potenetial smoothing models, the one which
was found to provide the best performance was the Holt-Winter’s Additive
model. This model was the best choice because it did the best job out of
decreasing the errors, and provided the best accuracy overall. So, the
Holt-Winter’s Additive model was the best choice out of the several
candidate, smoothing models we looked at.
Recommendations
Some recommendations I would make for future projects include:
Consider expanding the time series data to include the average
air travel passengers for all countries, not just the U.S. in order to
see if this still results in the Holt-Winter’s Additive model being the
ideal model choice, or if another model seems to provide better
performance in this scenario.
Further look into the seasonality in this time series data set,
since there definitely appeared to be seasonality occurring with more
passengers travelling in certain times of the year, like summer, than
during other times of the year which saw fewer passengers.
---
title: "Monthly Average Air Travel Passengers: Time Series with Exponential Smoothing"
author: "Josie Gallop"
date: "2024-12-11"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    fig_width: 6
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
editor_options: 
  chunk_output_type: console
---

```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 20px;
  font-family: system-ui;
  color: DarkRed;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: center;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```
```{r setup, include=FALSE}
# Detect, install, and load packages if needed.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("leaflet")) {
   install.packages("leaflet")
   library(leaflet)
}
if (!require("EnvStats")) {
   install.packages("EnvStats")
   library(EnvStats)
}
if (!require("MASS")) {
   install.packages("MASS")
   library(MASS)
}
if (!require("phytools")) {
   install.packages("phytools")
   library(phytools)
}
if(!require("dplyr")) {
   install.packages("dplyr")
   library(dplyr)
}
if(!require("tidyverse")) {
   install.packages("tidyverse")
   library(tidyverse)
}
if(!require("GGally")) {
   install.packages("GGally")
   library(GGally)
}
if (!require("boot")) {
   install.packages("boot")
   library(boot)
}
if(!require("pander")) {
   install.packages("pander")
   library(pander)
}
if(!require("mlbench")) {
   install.packages("mlbench")
   library(mlbench)
}
if(!require("psych")) {
   install.packages("psych")
   library(psych)
}
if(!require("lubridate")) {
   install.packages("lubridate")
   library(lubridate)
}
if(!require("GGally")) {
   install.packages("GGally")
   library(GGally)
}
if (!require("pROC")) {
   install.packages("pROC")
   library(pROC)
}
if (!require("forecast")) {
   install.packages("forecast")
   library(forecast)
}
knitr::opts_chunk$set(echo = TRUE,  
                   warning = FALSE,   
                   message = FALSE,  
                   results = TRUE,  
                   comment = NA   
                      )   
```


# Introduction

For this project, we will be creating a Time Series. This time series looks at the average number of air travel passengers on all U.S. flights. The time series we will look at will include the number of passengers on both domestic and international flights. We will create a monthly time series which looks at the average number of air travel passengers by month. This time series data was collected from 2003 to 2023, with the first observation in January of 2003, and the final observation in September 2023.

We will split the time series data up into a training and a testng data set. And then we create several candidate, exponential smoothing models for forecasting. These will include a simple exponential model, several Holt models, and several Holt-Winter's models. We will see which of these candidate models provides the best utility for an exponential smoothing model of our time series. We will determine the best model by seeing which one reduces the errors and provides the best overall accuracy. 


## Data Description

I found this data set on kaggle.com on the following webpage:
https://www.kaggle.com/datasets/yyxian/u-s-airline-traffic-data?select=air+traffic.csv

This data set looks at the average number of air travel passengers by month from 2003 to 2023. The first observation in this data set was collected in January 2003, and the final observation in this data set was collected in September 2023. This data set includes variables which represent the total number of passengers on both domestic and international flights by month. However, for this time series, we will look specifically at the variable which represents the total number of passengers on all flights, both domestic and international. This data is representative of passengers on all U.S. flights by month. 

We will read in the data set from Github and we will call it "flight". 

```{r}
flight <- read.csv("https://raw.githubusercontent.com/JosieGallop/STA321/refs/heads/main/dataset/air%20traffic.csv")
```

This data set contains many variables regarding air travel information, but for this time series we are just interested in looking at the number of the total air travel passengers. This is represented by the variable Pax. We want to create a time series of the total air travel passengers by the date, so we will only keep the time specific variables, Year and Month, and the Pax variable. We will call the new data set with only these specific variables we are interested in "flight1".

```{r}
flight1 <- flight %>%
  select(Year, Month, Pax)
```

Now, we have a new data set, flight1, with only the variables we are interested in for our time series project.


## Variables

There are three variables from the original data set which we will be interested in for this time series project.

* Year: The year of which the observation is from. This data set spans from 2003 to 2023. 

* Month: The month of which the observation is from. Given as a two digit number representing the month. For example, 01 represents January, 02 represents February, and so on.

* Pax: The total number of air travel passengers for a given observations. This includes air passengers on both domestic and international flights. 

During our exploratory data analysis, we will create a new variable called Date which will combine the year and the month variables to get just one, single variable to represent the date of which an observation occurred. The date variable we will create will be given in the format of Year-Month. 




# Exploratory Data Analysis

First, let's ensure that there are not any missing variables in this data set. 
```{r}
colSums(is.na(flight1))
```

As we can see, there are zero missing values in our data set, so that means we do not have to worry about filling in any missing observations. We can proceed with our time series project.  

Next, we will combine the Year and Month variables into one variable called Date which is given as year and month combined, in the format of Year-Month. We will create a new data set with the combined Date variable and we will call this new data set "flight2". 

```{r}
flight2 <- flight1 %>%
  mutate(Date = paste(Year, sprintf("%02d", Month), sep = "-"))
flight2 <- select(flight2, -1,-2)
```

Now, we have a data set "flight2" with just the Date, given in a Year-Month format, and the Pax variable representing the total number of air travel passengers. 

One thing I noticed is that the Pax variable is incorrectly given as a character variable when it should really be a numeric variable since it represents the total number of air travel passengers. We will change the variable type of the Pax variable from a character to a numeric variable in order to fix this issue. We will remove the commas from the Pax variable before converting it to a numeric variable to avoid any problems with the conversion of the variable type.

```{r}
flight2$Pax <- gsub(",", "", flight2$Pax) 
flight2$Pax <- as.numeric(flight2$Pax)
```

Now we have the proper data set we can use to create the time series. This data set provides monthly averages for the number of air travel passengers for all domestic and international flights. 


## Time Series Plot

Now, we will create a time series object from our flight data and plot it to help visualize any potential trends that may be occurring within this time series data set. We will call this time series object "flight.ts". Since we are creating a monthly time series looking at the average number of air travel passengers by month, our frequency will be 12. 

```{r}
flight.ts <- ts(flight2$Pax, start = c(2003, 1), frequency = 12)
plot(flight.ts, main = "Monthly Average Air Travel Passengers from 2003 to 2023", xlab = "Year", ylab = "Air Travel Passengers")
```

Looking at the time series plot, we can see some notable occurrences within our data. There is certainly some sort of seasonality going on within our time series which can be seen by how the plot has a pattern which repeats after a certain amount of time. It looks like this time series might have an annual seasonality, with this repetition occurring after a year has gone by. This would make sense that the number of air travel passengers would have an annual seasonality, because more people likely travel during the summer months due to the ideal weather and for families this would mean their children would be on summer break which would make having a vacation during the summer an optimal timing. It would make sense that the number of air travel passengers would peak during the summer months, and then drop back down until there is another slight rise around the holiday months in the winter time due to people travelling to spend the holidays with their family. Overall, an annual seasonality would make sense for a time series on the number of air travel passengers. 

Another notable occurrence within this time series is that there appears to be a trend of slight overall increase in the number of air travel passengers from the start of the time series up until the end. There is an especially noticeable trend of overall increase from around 2010 up until 2019. We can see that there is an incredibly massive decrease in the number of air travel passengers around 2020. This can be attributed to the travel restrictions which were put in place during the early days of the covid pandemic. This would explain why there is such a drastic decline in the number of air travel passengers seen in the year 2020. After this decline in 2020, we can see that the number of air travel passengers begins to gradually increase over the following years, and it appears that the numbers at the end of the time series in 2023 are fairly similar to what they were right before the major decline seen in 2020. 




# Data Split

We will split our time series data into two data sets, one for training, and one for testing. Since we have a monthly time series, we will use the 12 most recent observations for the testing data set. The rest of the observations will be used for the training data set. We will also create a time series object for just the training data set. We will call this time series object "flight.ts1".

```{r}
test.flight <- flight2$Pax[238:249]
train.flight <- flight2$Pax[1:238]
flight.ts1 <- ts(flight2$Pax[1:238], start = c(2003, 1), frequency = 12)
```

Now we have split our time series data set up into a training and a testing data set. We will use the training data set to train the smoothing models we will create. We will use the testing data set to identify which of these smoothing models we will create is the best one to use. So, we will next begin with creating these smoothing models and we will use the training and testing data sets to help us with building each of these models and then determining which model is the best one to use. 




# Smoothing Models

We will now begin with building the smoothing models so we can ultimately determine which one does the best and would be the best to use. We will create three different types of smoothing models. These different models will include a simple exponential smoothing model, Holt models (additive and with a possible damped pattern), and Holt-Winter's models (additive and multiplicative seasonality with a possible damped pattern).


## Simple Exponential Smoothing Model

First, we will begin with creating the simple exponential smoothing model. We will call this simple exponential smoothing model "fit1".

```{r}
# Simple Exponential Smoothing Model
fit1 <- ses(flight.ts1, h = 12)
```

Now we have created our simple exponential smoothing model.


## Holt Models

Next, we will create the Holt models. We will use three different kinds of Holt models. The first of our Holt models will be a linear Holt model, and we will call this "fit2". Our next Holt model will be an additive Holt model with a possible damped pattern, and this will be called "fit3". Our last Holt model will be an exponential Holt model with a possible damped pattern, and this will be called "fit4". 

```{r}
# Holt Linear Model
fit2 <- holt(flight.ts1, initial = "optimal", h = 12)       

# Holt Additive Model with a Possible Damped Pattern 
fit3 <- holt(flight.ts1, damped = TRUE, h = 12)                  

# Holt Exponential Model with a Possible Damped Pattern
fit4 <- holt(flight.ts1, exponential = TRUE, damped = TRUE, h = 12) 
```

Now we have successfully created our Holt models. 


## Holt-Winter's Models

Lastly, we will create our Holt Winter's models. We will use four different kinds of Holt-Winter's models. The first of these will be a Holt-Winter's additive model, and this will be called "fit5". The next will be a Holt-Winter's multiplicative model, and this will be called "fit6". The next one after will be a Holt-Winter's additive model with a possible damped pattern, and this will be called "fit7". The last of these models will be a Holt-Winter's multiplicative model with a possible damped pattern, and this will be called "fit8".

```{r}
# Holt-Winter's Additive Model
fit5 <- hw(flight.ts1, h = 12, seasonal = "additive")     

# Holt-Winter's Multiplicative Model
fit6 <- hw(flight.ts1, h = 12, seasonal = "multiplicative")

# Holt-Winter's Additive Model with a Possible Damped Pattern
fit7 <- hw(flight.ts1, h = 12, seasonal = "additive", damped=TRUE)

# Holt-Winter's Multiplicative Model with a Possible Damped Pattern
fit8 <- hw(flight.ts1, h = 12, seasonal = "multiplicative", damped=TRUE)
```

Now we have successfully created all of our Holt-Winter's models. 




# Accuracy Measures

Now that we have created various models to try out, including a simple exponential model, Holt models, and Holt-Winter's models, we will use accuracy measures to compare all of these different models. These accuracy measures will be based upon the training data set that we created from our time series data. 

```{r}
accuracy.table = round(rbind(accuracy(fit1), accuracy(fit2), accuracy(fit3), accuracy(fit4), accuracy(fit5), accuracy(fit6), accuracy(fit7), accuracy(fit8)), 4)
row.names(accuracy.table)=c("SES","Holt Linear","Holt Add. Damped", "Holt Mult. Damped", "HW Add.","HW Mult.","HW Add. Damped", "HW Mult. Damped")

kable(accuracy.table, caption = "Acccuracy Measures of the Various Smoothing Models Based upon the Training Data Set")
```

To interpret our accuracy table, we have several measures of accuracy we looked at. The ME, or mean errors, represents how far off the predictions are from their actual values. The Holt Add. Damped and Holt Mult. Damped have the smallest absolute value of their ME. Next, we have RMSE, MAE, and MAPE, all of which a lower value is ideal because it indicates the least errors. From these accuracy measurements, the HW Add., and the HW. Add. Damped stand out for having notably lower values than any of the other models do. For the MASE measurement, a value closest to zero is ideal. The HW. Add model has the closest MASE to zero, making this once again seem like a strong model. Lastly, for the ACF1 measurements, a value closest to zero is ideal for accuracy. From our models, the Holt Linear model has an ACF1 that is closest to zero.

Overall, the Holt-Winter's Additive model stood out for showing ideal performance in the most accuracy measures of all of the models. This Holt-Winter's Additive model had low RMSE, MAE, and MAPE, as well as a MASE closest to zero. Since the Holt-Winter's Additive model stood out in more accuracy measurements than any of the other models, we will choose this smoothing model as the ideal model in terms of performance. 




# Serial Plot of the Data

We will create a serial plot of the historical data, which is the data that was used for the training data set. This will allow us to forecast the values and visualize this forecast.

We will create two serial plots, one will be for the non-seasonal smoothing models, which will include the simple exponential smoothing model, the Holt linear, the Holt Additive Damped, and the Holt Multiplicative Damped models. The other serial plot will be for the trend and seasonal smoothing models, which will include the HW Additive, the HW Multiplicative, the HW Additive Damped, and the HW Multiplicative Damped models. 

First, we will create the serial plot for the non-seasonal smoothing models. 

```{r}
par(mfrow = c(2,1), mar = c(3,4,3,1))
pred.id = 238:249
plot(1:238, flight.ts1, lwd = 2,type = "o", ylab = "Air Travel Passengers", xlab = "", 
     xlim = c(1, 249), ylim=c(0, 80000000), cex = 0.3,
     main = "Non-seasonal Smoothing Models")
lines(pred.id, fit1$mean, col = "pink")
lines(pred.id, fit2$mean, col = "blue")
lines(pred.id, fit3$mean, col = "purple")
lines(pred.id, fit4$mean, col = "navy")

points(pred.id, fit1$mean, pch = 16, col = "pink", cex = 0.5)
points(pred.id, fit2$mean, pch = 17, col = "blue", cex = 0.5)
points(pred.id, fit3$mean, pch = 19, col = "purple", cex = 0.5)
points(pred.id, fit4$mean, pch = 21, col = "navy", cex = 0.5)

legend("bottomright", lty = 1, col = c("pink","blue","purple","navy"), pch = c(16, 17, 19, 21),
   c("SES","Holt Linear","Holt Additive Damped", "Holt Multiplicative Damped"), 
   cex = 0.5, bty = "n")
```

We can see from the serial plot that this visualized the forecasts for the non-seasonal smoothing models which we created.

Next, we will create the serial plot for the Holt-Winter's trend and seasonal smoothing models.

```{r}
plot(1:238, flight.ts1, lwd = 2, type = "o", ylab = "Air Travel Passengers", xlab = "", 
     xlim=c(1,249), ylim=c(0, 80000000), cex = 0.3,
     main="Holt-Winters Trend and Seasonal Smoothing Models")
lines(pred.id, fit5$mean, col = "pink")
lines(pred.id, fit6$mean, col = "blue")
lines(pred.id, fit7$mean, col = "purple")
lines(pred.id, fit8$mean, col = "navy")

points(pred.id, fit5$mean, pch = 16, col = "pink", cex = 0.5)
points(pred.id, fit6$mean, pch = 17, col = "blue", cex = 0.5)
points(pred.id, fit7$mean, pch = 19, col = "purple", cex = 0.5)
points(pred.id, fit8$mean, pch = 21, col = "navy", cex = 0.5)

legend("bottomright", lty=1, col = c("pink","blue","purple", "navy"), pch = c(16, 17, 19, 21),
   c("HW Additive","HW Multiplicative","HW Additive Damped", "HW Multiplicative Damped"), 
   cex = 0.5, bty = "n")
```

We can see from this serial plot the forecasts for the trend and seasonal smoothing models.

From the two serial plots we created, it looks like the Holt-Winter's Additive model once again seems like the best of the models we created. This matches up with what was found in the accuracy table. It looks like the Holt Winter's Additive Model provides a quality forecast in the serial plot for the Holt-Winter's trend and seasonal smoothing models.




# Forecasting with the Testing Data

Now that we have determined which model looks like the ideal choice, the Holt-Winter's Additive Model, we will use the testing data set which we created earlier to verify our best model selection.

We previously used the training data set to train the various smoothing models we looked at. To create our real-world forecast, we will refit the model using the testing data as well. This will allow us to verify our findings and if the model which looked to show the best performance in forecasting truly looks like the ideal choice.

First, we will refit the model using the entire data set to update the smoothing parameters.

```{r}
acc.fun = function(test.data, mod.obj){
  PE = 100*(test.data-mod.obj$mean)/mod.obj$mean
  MAPE = mean(abs(PE))
  E = test.data-mod.obj$mean
  MSE = mean(E^2)
  accuracy.metric = c(MSE = MSE, MAPE = MAPE)
  accuracy.metric
}
```

Now, we will find the updated accuracy prediction measures for each of the smoothing models we created. 

```{r}
pred.accuracy = rbind(SES = acc.fun(test.data = test.flight, mod.obj = fit1),
                      Holt.Add = acc.fun(test.data=test.flight, 
                                         mod.obj = fit2),
                      Holt.Add.Damp = acc.fun(test.data = test.flight,    
                                              mod.obj=fit3),
                      Holt.Exp = acc.fun(test.data = test.flight, 
                                         mod.obj = fit4),
                      HW.Add = acc.fun(test.data = test.flight, 
                                       mod.obj = fit5),
                      HW.Exp = acc.fun(test.data = test.flight, 
                                       mod.obj = fit6),
                      HW.Add.Damp = acc.fun(test.data = test.flight, 
                                            mod.obj = fit7),
                      HW.Exp.Damp = acc.fun(test.data = test.flight, 
                                            mod.obj = fit8))
kable(pred.accuracy, caption = "Accuracy Measures of the Various Exponential Smoothing Models Based on the Testing Data Set")
```

Looking at the accuracy measures based on the testing data set, we can see that the Holt-Winter's Additive model has both the lowest MSE and MAPE. This is ideal because we want our model to reduce the errors as much as possible. Also, this matches up with what we found earlier, because once again the Holt-Winter's Additive model has the best performance in terms of accuracy. So, it looks like the Holt-Winter's Additive model is the ideal choice, and this has been verified with the testing data set. 




# Final Model

Based upon our findings, it looks like the Holt-Winter's Additive model is the ideal choice out of the many possible candidate models we created. This model showed its ideal performance by reducing the errors more significantly than any of the other models did. This Holt-Winter's Additive model also showed the best performance in terms of the accuracy measures, because it had the lowest MSE and MAPE values of out any of the models. So, the Holt-Winter's Additive model is the ideal model out of all the candidate models we created and tried out.


## Smoothing Parameters

We will take a look at the smoothing parameters for the Holt-Winter's Additive model. 

```{r}
final.model <- hw(flight.ts1, h = 12, seasonal = "additive")
smoothing.parameter = final.model$model$par[1:3]
kable(smoothing.parameter, caption="The Estimated Values of the Smoothing Parameters in the Holt-Winters Additive Model")
```

The smoothing parameters for our final model, the Holt-Winter's Additive model, include an alpha of 0.9999, a beta of 0.0004, and a gamma of 0.0001. 
  



# Conclusion

Overall, we looked at a data set which was representative of the total number of air travel passengers in the U.S. through monthly averages. This data was collected over the course of the years from 2003 to 2023. We created a monthly time series which showed the trends and seasonality of the average number of air travel passengers by month over the span of this 20 year period. 

We split the time series data up into a training and a testing data set. After looking at several potenetial smoothing models, the one which was found to provide the best performance was the Holt-Winter's Additive model. This model was the best choice because it did the best job out of decreasing the errors, and provided the best accuracy overall. So, the Holt-Winter's Additive model was the best choice out of the several candidate, smoothing models we looked at. 


## Recommendations

Some recommendations I would make for future projects include:

* Consider expanding the time series data to include the average air travel passengers for all countries, not just the U.S. in order to see if this still results in the Holt-Winter's Additive model being the ideal model choice, or if another model seems to provide better performance in this scenario.

* Further look into the seasonality in this time series data set, since there definitely appeared to be seasonality occurring with more passengers travelling in certain times of the year, like summer, than during other times of the year which saw fewer passengers. 




# References

The data set I used for my time series was found on kaggle.com. Included below is the citation of the web page where I found this data set. 

YYXian. (2024, January 15). U.S. airline traffic data (2003-2023). Kaggle. https://www.kaggle.com/datasets/yyxian/u-s-airline-traffic-data?select=air%2Btraffic.csv 
