1 Introduction

For this project, we will be creating a Time Series. This time series looks at the average number of air travel passengers on all U.S. flights. The time series we will look at will include the number of passengers on both domestic and international flights. We will create a monthly time series which looks at the average number of air travel passengers by month. This time series data was collected from 2003 to 2023, with the first observation in January of 2003, and the final observation in September 2023.

We will split the time series data up into a training and a testng data set. And then we create several candidate, exponential smoothing models for forecasting. These will include a simple exponential model, several Holt models, and several Holt-Winter’s models. We will see which of these candidate models provides the best utility for an exponential smoothing model of our time series. We will determine the best model by seeing which one reduces the errors and provides the best overall accuracy.

1.1 Data Description

I found this data set on kaggle.com on the following webpage: https://www.kaggle.com/datasets/yyxian/u-s-airline-traffic-data?select=air+traffic.csv

This data set looks at the average number of air travel passengers by month from 2003 to 2023. The first observation in this data set was collected in January 2003, and the final observation in this data set was collected in September 2023. This data set includes variables which represent the total number of passengers on both domestic and international flights by month. However, for this time series, we will look specifically at the variable which represents the total number of passengers on all flights, both domestic and international. This data is representative of passengers on all U.S. flights by month.

We will read in the data set from Github and we will call it “flight”.

flight <- read.csv("https://raw.githubusercontent.com/JosieGallop/STA321/refs/heads/main/dataset/air%20traffic.csv")

This data set contains many variables regarding air travel information, but for this time series we are just interested in looking at the number of the total air travel passengers. This is represented by the variable Pax. We want to create a time series of the total air travel passengers by the date, so we will only keep the time specific variables, Year and Month, and the Pax variable. We will call the new data set with only these specific variables we are interested in “flight1”.

flight1 <- flight %>%
  select(Year, Month, Pax)

Now, we have a new data set, flight1, with only the variables we are interested in for our time series project.

1.2 Variables

There are three variables from the original data set which we will be interested in for this time series project.

  • Year: The year of which the observation is from. This data set spans from 2003 to 2023.

  • Month: The month of which the observation is from. Given as a two digit number representing the month. For example, 01 represents January, 02 represents February, and so on.

  • Pax: The total number of air travel passengers for a given observations. This includes air passengers on both domestic and international flights.

During our exploratory data analysis, we will create a new variable called Date which will combine the year and the month variables to get just one, single variable to represent the date of which an observation occurred. The date variable we will create will be given in the format of Year-Month.

2 Exploratory Data Analysis

First, let’s ensure that there are not any missing variables in this data set.

colSums(is.na(flight1))
 Year Month   Pax 
    0     0     0 

As we can see, there are zero missing values in our data set, so that means we do not have to worry about filling in any missing observations. We can proceed with our time series project.

Next, we will combine the Year and Month variables into one variable called Date which is given as year and month combined, in the format of Year-Month. We will create a new data set with the combined Date variable and we will call this new data set “flight2”.

flight2 <- flight1 %>%
  mutate(Date = paste(Year, sprintf("%02d", Month), sep = "-"))
flight2 <- select(flight2, -1,-2)

Now, we have a data set “flight2” with just the Date, given in a Year-Month format, and the Pax variable representing the total number of air travel passengers.

One thing I noticed is that the Pax variable is incorrectly given as a character variable when it should really be a numeric variable since it represents the total number of air travel passengers. We will change the variable type of the Pax variable from a character to a numeric variable in order to fix this issue. We will remove the commas from the Pax variable before converting it to a numeric variable to avoid any problems with the conversion of the variable type.

flight2$Pax <- gsub(",", "", flight2$Pax) 
flight2$Pax <- as.numeric(flight2$Pax)

Now we have the proper data set we can use to create the time series. This data set provides monthly averages for the number of air travel passengers for all domestic and international flights.

2.1 Time Series Plot

Now, we will create a time series object from our flight data and plot it to help visualize any potential trends that may be occurring within this time series data set. We will call this time series object “flight.ts”. Since we are creating a monthly time series looking at the average number of air travel passengers by month, our frequency will be 12.

flight.ts <- ts(flight2$Pax, start = c(2003, 1), frequency = 12)
plot(flight.ts, main = "Monthly Average Air Travel Passengers from 2003 to 2023", xlab = "Year", ylab = "Air Travel Passengers")

Looking at the time series plot, we can see some notable occurrences within our data. There is certainly some sort of seasonality going on within our time series which can be seen by how the plot has a pattern which repeats after a certain amount of time. It looks like this time series might have an annual seasonality, with this repetition occurring after a year has gone by. This would make sense that the number of air travel passengers would have an annual seasonality, because more people likely travel during the summer months due to the ideal weather and for families this would mean their children would be on summer break which would make having a vacation during the summer an optimal timing. It would make sense that the number of air travel passengers would peak during the summer months, and then drop back down until there is another slight rise around the holiday months in the winter time due to people travelling to spend the holidays with their family. Overall, an annual seasonality would make sense for a time series on the number of air travel passengers.

Another notable occurrence within this time series is that there appears to be a trend of slight overall increase in the number of air travel passengers from the start of the time series up until the end. There is an especially noticeable trend of overall increase from around 2010 up until 2019. We can see that there is an incredibly massive decrease in the number of air travel passengers around 2020. This can be attributed to the travel restrictions which were put in place during the early days of the covid pandemic. This would explain why there is such a drastic decline in the number of air travel passengers seen in the year 2020. After this decline in 2020, we can see that the number of air travel passengers begins to gradually increase over the following years, and it appears that the numbers at the end of the time series in 2023 are fairly similar to what they were right before the major decline seen in 2020.

3 Data Split

We will split our time series data into two data sets, one for training, and one for testing. Since we have a monthly time series, we will use the 12 most recent observations for the testing data set. The rest of the observations will be used for the training data set. We will also create a time series object for just the training data set. We will call this time series object “flight.ts1”.

test.flight <- flight2$Pax[238:249]
train.flight <- flight2$Pax[1:238]
flight.ts1 <- ts(flight2$Pax[1:238], start = c(2003, 1), frequency = 12)

Now we have split our time series data set up into a training and a testing data set. We will use the training data set to train the smoothing models we will create. We will use the testing data set to identify which of these smoothing models we will create is the best one to use. So, we will next begin with creating these smoothing models and we will use the training and testing data sets to help us with building each of these models and then determining which model is the best one to use.

4 Smoothing Models

We will now begin with building the smoothing models so we can ultimately determine which one does the best and would be the best to use. We will create three different types of smoothing models. These different models will include a simple exponential smoothing model, Holt models (additive and with a possible damped pattern), and Holt-Winter’s models (additive and multiplicative seasonality with a possible damped pattern).

4.1 Simple Exponential Smoothing Model

First, we will begin with creating the simple exponential smoothing model. We will call this simple exponential smoothing model “fit1”.

# Simple Exponential Smoothing Model
fit1 <- ses(flight.ts1, h = 12)

Now we have created our simple exponential smoothing model.

4.2 Holt Models

Next, we will create the Holt models. We will use three different kinds of Holt models. The first of our Holt models will be a linear Holt model, and we will call this “fit2”. Our next Holt model will be an additive Holt model with a possible damped pattern, and this will be called “fit3”. Our last Holt model will be an exponential Holt model with a possible damped pattern, and this will be called “fit4”.

# Holt Linear Model
fit2 <- holt(flight.ts1, initial = "optimal", h = 12)       

# Holt Additive Model with a Possible Damped Pattern 
fit3 <- holt(flight.ts1, damped = TRUE, h = 12)                  

# Holt Exponential Model with a Possible Damped Pattern
fit4 <- holt(flight.ts1, exponential = TRUE, damped = TRUE, h = 12) 

Now we have successfully created our Holt models.

4.3 Holt-Winter’s Models

Lastly, we will create our Holt Winter’s models. We will use four different kinds of Holt-Winter’s models. The first of these will be a Holt-Winter’s additive model, and this will be called “fit5”. The next will be a Holt-Winter’s multiplicative model, and this will be called “fit6”. The next one after will be a Holt-Winter’s additive model with a possible damped pattern, and this will be called “fit7”. The last of these models will be a Holt-Winter’s multiplicative model with a possible damped pattern, and this will be called “fit8”.

# Holt-Winter's Additive Model
fit5 <- hw(flight.ts1, h = 12, seasonal = "additive")     

# Holt-Winter's Multiplicative Model
fit6 <- hw(flight.ts1, h = 12, seasonal = "multiplicative")

# Holt-Winter's Additive Model with a Possible Damped Pattern
fit7 <- hw(flight.ts1, h = 12, seasonal = "additive", damped=TRUE)

# Holt-Winter's Multiplicative Model with a Possible Damped Pattern
fit8 <- hw(flight.ts1, h = 12, seasonal = "multiplicative", damped=TRUE)

Now we have successfully created all of our Holt-Winter’s models.

5 Accuracy Measures

Now that we have created various models to try out, including a simple exponential model, Holt models, and Holt-Winter’s models, we will use accuracy measures to compare all of these different models. These accuracy measures will be based upon the training data set that we created from our time series data.

accuracy.table = round(rbind(accuracy(fit1), accuracy(fit2), accuracy(fit3), accuracy(fit4), accuracy(fit5), accuracy(fit6), accuracy(fit7), accuracy(fit8)), 4)
row.names(accuracy.table)=c("SES","Holt Linear","Holt Add. Damped", "Holt Mult. Damped", "HW Add.","HW Mult.","HW Add. Damped", "HW Mult. Damped")

kable(accuracy.table, caption = "Acccuracy Measures of the Various Smoothing Models Based upon the Training Data Set")
Acccuracy Measures of the Various Smoothing Models Based upon the Training Data Set
ME RMSE MAE MPE MAPE MASE ACF1
SES 102859.87 6626499 4795473 -5.1086 13.3855 0.6550 0.0043
Holt Linear -485232.07 6653815 4823863 -6.0574 13.4850 0.6589 -0.0019
Holt Add. Damped 62009.49 6614114 4775792 -5.1840 13.3544 0.6523 0.0042
Holt Mult. Damped 65535.10 6619923 4833567 -4.9977 13.3576 0.6602 -0.0276
HW Add. -240230.44 3914979 1755071 -4.4331 7.6521 0.2397 0.3594
HW Mult. -326797.03 4516990 2044228 -5.4066 8.9673 0.2792 0.3666
HW Add. Damped 70386.74 3885667 1765809 -3.5371 7.6123 0.2412 0.3325
HW Mult. Damped 70002.38 4144943 1769723 -3.9989 8.0597 0.2417 0.3616

To interpret our accuracy table, we have several measures of accuracy we looked at. The ME, or mean errors, represents how far off the predictions are from their actual values. The Holt Add. Damped and Holt Mult. Damped have the smallest absolute value of their ME. Next, we have RMSE, MAE, and MAPE, all of which a lower value is ideal because it indicates the least errors. From these accuracy measurements, the HW Add., and the HW. Add. Damped stand out for having notably lower values than any of the other models do. For the MASE measurement, a value closest to zero is ideal. The HW. Add model has the closest MASE to zero, making this once again seem like a strong model. Lastly, for the ACF1 measurements, a value closest to zero is ideal for accuracy. From our models, the Holt Linear model has an ACF1 that is closest to zero.

Overall, the Holt-Winter’s Additive model stood out for showing ideal performance in the most accuracy measures of all of the models. This Holt-Winter’s Additive model had low RMSE, MAE, and MAPE, as well as a MASE closest to zero. Since the Holt-Winter’s Additive model stood out in more accuracy measurements than any of the other models, we will choose this smoothing model as the ideal model in terms of performance.

6 Serial Plot of the Data

We will create a serial plot of the historical data, which is the data that was used for the training data set. This will allow us to forecast the values and visualize this forecast.

We will create two serial plots, one will be for the non-seasonal smoothing models, which will include the simple exponential smoothing model, the Holt linear, the Holt Additive Damped, and the Holt Multiplicative Damped models. The other serial plot will be for the trend and seasonal smoothing models, which will include the HW Additive, the HW Multiplicative, the HW Additive Damped, and the HW Multiplicative Damped models.

First, we will create the serial plot for the non-seasonal smoothing models.

par(mfrow = c(2,1), mar = c(3,4,3,1))
pred.id = 238:249
plot(1:238, flight.ts1, lwd = 2,type = "o", ylab = "Air Travel Passengers", xlab = "", 
     xlim = c(1, 249), ylim=c(0, 80000000), cex = 0.3,
     main = "Non-seasonal Smoothing Models")
lines(pred.id, fit1$mean, col = "pink")
lines(pred.id, fit2$mean, col = "blue")
lines(pred.id, fit3$mean, col = "purple")
lines(pred.id, fit4$mean, col = "navy")

points(pred.id, fit1$mean, pch = 16, col = "pink", cex = 0.5)
points(pred.id, fit2$mean, pch = 17, col = "blue", cex = 0.5)
points(pred.id, fit3$mean, pch = 19, col = "purple", cex = 0.5)
points(pred.id, fit4$mean, pch = 21, col = "navy", cex = 0.5)

legend("bottomright", lty = 1, col = c("pink","blue","purple","navy"), pch = c(16, 17, 19, 21),
   c("SES","Holt Linear","Holt Additive Damped", "Holt Multiplicative Damped"), 
   cex = 0.5, bty = "n")

We can see from the serial plot that this visualized the forecasts for the non-seasonal smoothing models which we created.

Next, we will create the serial plot for the Holt-Winter’s trend and seasonal smoothing models.

plot(1:238, flight.ts1, lwd = 2, type = "o", ylab = "Air Travel Passengers", xlab = "", 
     xlim=c(1,249), ylim=c(0, 80000000), cex = 0.3,
     main="Holt-Winters Trend and Seasonal Smoothing Models")
lines(pred.id, fit5$mean, col = "pink")
lines(pred.id, fit6$mean, col = "blue")
lines(pred.id, fit7$mean, col = "purple")
lines(pred.id, fit8$mean, col = "navy")

points(pred.id, fit5$mean, pch = 16, col = "pink", cex = 0.5)
points(pred.id, fit6$mean, pch = 17, col = "blue", cex = 0.5)
points(pred.id, fit7$mean, pch = 19, col = "purple", cex = 0.5)
points(pred.id, fit8$mean, pch = 21, col = "navy", cex = 0.5)

legend("bottomright", lty=1, col = c("pink","blue","purple", "navy"), pch = c(16, 17, 19, 21),
   c("HW Additive","HW Multiplicative","HW Additive Damped", "HW Multiplicative Damped"), 
   cex = 0.5, bty = "n")

We can see from this serial plot the forecasts for the trend and seasonal smoothing models.

From the two serial plots we created, it looks like the Holt-Winter’s Additive model once again seems like the best of the models we created. This matches up with what was found in the accuracy table. It looks like the Holt Winter’s Additive Model provides a quality forecast in the serial plot for the Holt-Winter’s trend and seasonal smoothing models.

7 Forecasting with the Testing Data

Now that we have determined which model looks like the ideal choice, the Holt-Winter’s Additive Model, we will use the testing data set which we created earlier to verify our best model selection.

We previously used the training data set to train the various smoothing models we looked at. To create our real-world forecast, we will refit the model using the testing data as well. This will allow us to verify our findings and if the model which looked to show the best performance in forecasting truly looks like the ideal choice.

First, we will refit the model using the entire data set to update the smoothing parameters.

acc.fun = function(test.data, mod.obj){
  PE = 100*(test.data-mod.obj$mean)/mod.obj$mean
  MAPE = mean(abs(PE))
  E = test.data-mod.obj$mean
  MSE = mean(E^2)
  accuracy.metric = c(MSE = MSE, MAPE = MAPE)
  accuracy.metric
}

Now, we will find the updated accuracy prediction measures for each of the smoothing models we created.

pred.accuracy = rbind(SES = acc.fun(test.data = test.flight, mod.obj = fit1),
                      Holt.Add = acc.fun(test.data=test.flight, 
                                         mod.obj = fit2),
                      Holt.Add.Damp = acc.fun(test.data = test.flight,    
                                              mod.obj=fit3),
                      Holt.Exp = acc.fun(test.data = test.flight, 
                                         mod.obj = fit4),
                      HW.Add = acc.fun(test.data = test.flight, 
                                       mod.obj = fit5),
                      HW.Exp = acc.fun(test.data = test.flight, 
                                       mod.obj = fit6),
                      HW.Add.Damp = acc.fun(test.data = test.flight, 
                                            mod.obj = fit7),
                      HW.Exp.Damp = acc.fun(test.data = test.flight, 
                                            mod.obj = fit8))
kable(pred.accuracy, caption = "Accuracy Measures of the Various Exponential Smoothing Models Based on the Testing Data Set")
Accuracy Measures of the Various Exponential Smoothing Models Based on the Testing Data Set
MSE MAPE
SES 4.496124e+13 7.346870
Holt.Add 3.748179e+13 6.273833
Holt.Add.Damp 4.493119e+13 7.343571
Holt.Exp 4.454601e+13 7.246437
HW.Add 3.049452e+13 5.195657
HW.Exp 4.372064e+13 7.284188
HW.Add.Damp 3.366352e+13 5.255700
HW.Exp.Damp 3.954300e+13 5.948060

Looking at the accuracy measures based on the testing data set, we can see that the Holt-Winter’s Additive model has both the lowest MSE and MAPE. This is ideal because we want our model to reduce the errors as much as possible. Also, this matches up with what we found earlier, because once again the Holt-Winter’s Additive model has the best performance in terms of accuracy. So, it looks like the Holt-Winter’s Additive model is the ideal choice, and this has been verified with the testing data set.

8 Final Model

Based upon our findings, it looks like the Holt-Winter’s Additive model is the ideal choice out of the many possible candidate models we created. This model showed its ideal performance by reducing the errors more significantly than any of the other models did. This Holt-Winter’s Additive model also showed the best performance in terms of the accuracy measures, because it had the lowest MSE and MAPE values of out any of the models. So, the Holt-Winter’s Additive model is the ideal model out of all the candidate models we created and tried out.

8.1 Smoothing Parameters

We will take a look at the smoothing parameters for the Holt-Winter’s Additive model.

final.model <- hw(flight.ts1, h = 12, seasonal = "additive")
smoothing.parameter = final.model$model$par[1:3]
kable(smoothing.parameter, caption="The Estimated Values of the Smoothing Parameters in the Holt-Winters Additive Model")
The Estimated Values of the Smoothing Parameters in the Holt-Winters Additive Model
x
alpha 0.9998907
beta 0.0003555
gamma 0.0001081

The smoothing parameters for our final model, the Holt-Winter’s Additive model, include an alpha of 0.9999, a beta of 0.0004, and a gamma of 0.0001.

9 Conclusion

Overall, we looked at a data set which was representative of the total number of air travel passengers in the U.S. through monthly averages. This data was collected over the course of the years from 2003 to 2023. We created a monthly time series which showed the trends and seasonality of the average number of air travel passengers by month over the span of this 20 year period.

We split the time series data up into a training and a testing data set. After looking at several potenetial smoothing models, the one which was found to provide the best performance was the Holt-Winter’s Additive model. This model was the best choice because it did the best job out of decreasing the errors, and provided the best accuracy overall. So, the Holt-Winter’s Additive model was the best choice out of the several candidate, smoothing models we looked at.

9.1 Recommendations

Some recommendations I would make for future projects include:

  • Consider expanding the time series data to include the average air travel passengers for all countries, not just the U.S. in order to see if this still results in the Holt-Winter’s Additive model being the ideal model choice, or if another model seems to provide better performance in this scenario.

  • Further look into the seasonality in this time series data set, since there definitely appeared to be seasonality occurring with more passengers travelling in certain times of the year, like summer, than during other times of the year which saw fewer passengers.

10 References

The data set I used for my time series was found on kaggle.com. Included below is the citation of the web page where I found this data set.

YYXian. (2024, January 15). U.S. airline traffic data (2003-2023). Kaggle. https://www.kaggle.com/datasets/yyxian/u-s-airline-traffic-data?select=air%2Btraffic.csv

---
title: "Monthly Average Air Travel Passengers: Time Series with Exponential Smoothing"
author: "Josie Gallop"
date: "2024-12-11"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    fig_width: 6
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
editor_options: 
  chunk_output_type: console
---

```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 20px;
  font-family: system-ui;
  color: DarkRed;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: center;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```
```{r setup, include=FALSE}
# Detect, install, and load packages if needed.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("leaflet")) {
   install.packages("leaflet")
   library(leaflet)
}
if (!require("EnvStats")) {
   install.packages("EnvStats")
   library(EnvStats)
}
if (!require("MASS")) {
   install.packages("MASS")
   library(MASS)
}
if (!require("phytools")) {
   install.packages("phytools")
   library(phytools)
}
if(!require("dplyr")) {
   install.packages("dplyr")
   library(dplyr)
}
if(!require("tidyverse")) {
   install.packages("tidyverse")
   library(tidyverse)
}
if(!require("GGally")) {
   install.packages("GGally")
   library(GGally)
}
if (!require("boot")) {
   install.packages("boot")
   library(boot)
}
if(!require("pander")) {
   install.packages("pander")
   library(pander)
}
if(!require("mlbench")) {
   install.packages("mlbench")
   library(mlbench)
}
if(!require("psych")) {
   install.packages("psych")
   library(psych)
}
if(!require("lubridate")) {
   install.packages("lubridate")
   library(lubridate)
}
if(!require("GGally")) {
   install.packages("GGally")
   library(GGally)
}
if (!require("pROC")) {
   install.packages("pROC")
   library(pROC)
}
if (!require("forecast")) {
   install.packages("forecast")
   library(forecast)
}
knitr::opts_chunk$set(echo = TRUE,  
                   warning = FALSE,   
                   message = FALSE,  
                   results = TRUE,  
                   comment = NA   
                      )   
```


# Introduction

For this project, we will be creating a Time Series. This time series looks at the average number of air travel passengers on all U.S. flights. The time series we will look at will include the number of passengers on both domestic and international flights. We will create a monthly time series which looks at the average number of air travel passengers by month. This time series data was collected from 2003 to 2023, with the first observation in January of 2003, and the final observation in September 2023.

We will split the time series data up into a training and a testng data set. And then we create several candidate, exponential smoothing models for forecasting. These will include a simple exponential model, several Holt models, and several Holt-Winter's models. We will see which of these candidate models provides the best utility for an exponential smoothing model of our time series. We will determine the best model by seeing which one reduces the errors and provides the best overall accuracy. 


## Data Description

I found this data set on kaggle.com on the following webpage:
https://www.kaggle.com/datasets/yyxian/u-s-airline-traffic-data?select=air+traffic.csv

This data set looks at the average number of air travel passengers by month from 2003 to 2023. The first observation in this data set was collected in January 2003, and the final observation in this data set was collected in September 2023. This data set includes variables which represent the total number of passengers on both domestic and international flights by month. However, for this time series, we will look specifically at the variable which represents the total number of passengers on all flights, both domestic and international. This data is representative of passengers on all U.S. flights by month. 

We will read in the data set from Github and we will call it "flight". 

```{r}
flight <- read.csv("https://raw.githubusercontent.com/JosieGallop/STA321/refs/heads/main/dataset/air%20traffic.csv")
```

This data set contains many variables regarding air travel information, but for this time series we are just interested in looking at the number of the total air travel passengers. This is represented by the variable Pax. We want to create a time series of the total air travel passengers by the date, so we will only keep the time specific variables, Year and Month, and the Pax variable. We will call the new data set with only these specific variables we are interested in "flight1".

```{r}
flight1 <- flight %>%
  select(Year, Month, Pax)
```

Now, we have a new data set, flight1, with only the variables we are interested in for our time series project.


## Variables

There are three variables from the original data set which we will be interested in for this time series project.

* Year: The year of which the observation is from. This data set spans from 2003 to 2023. 

* Month: The month of which the observation is from. Given as a two digit number representing the month. For example, 01 represents January, 02 represents February, and so on.

* Pax: The total number of air travel passengers for a given observations. This includes air passengers on both domestic and international flights. 

During our exploratory data analysis, we will create a new variable called Date which will combine the year and the month variables to get just one, single variable to represent the date of which an observation occurred. The date variable we will create will be given in the format of Year-Month. 




# Exploratory Data Analysis

First, let's ensure that there are not any missing variables in this data set. 
```{r}
colSums(is.na(flight1))
```

As we can see, there are zero missing values in our data set, so that means we do not have to worry about filling in any missing observations. We can proceed with our time series project.  

Next, we will combine the Year and Month variables into one variable called Date which is given as year and month combined, in the format of Year-Month. We will create a new data set with the combined Date variable and we will call this new data set "flight2". 

```{r}
flight2 <- flight1 %>%
  mutate(Date = paste(Year, sprintf("%02d", Month), sep = "-"))
flight2 <- select(flight2, -1,-2)
```

Now, we have a data set "flight2" with just the Date, given in a Year-Month format, and the Pax variable representing the total number of air travel passengers. 

One thing I noticed is that the Pax variable is incorrectly given as a character variable when it should really be a numeric variable since it represents the total number of air travel passengers. We will change the variable type of the Pax variable from a character to a numeric variable in order to fix this issue. We will remove the commas from the Pax variable before converting it to a numeric variable to avoid any problems with the conversion of the variable type.

```{r}
flight2$Pax <- gsub(",", "", flight2$Pax) 
flight2$Pax <- as.numeric(flight2$Pax)
```

Now we have the proper data set we can use to create the time series. This data set provides monthly averages for the number of air travel passengers for all domestic and international flights. 


## Time Series Plot

Now, we will create a time series object from our flight data and plot it to help visualize any potential trends that may be occurring within this time series data set. We will call this time series object "flight.ts". Since we are creating a monthly time series looking at the average number of air travel passengers by month, our frequency will be 12. 

```{r}
flight.ts <- ts(flight2$Pax, start = c(2003, 1), frequency = 12)
plot(flight.ts, main = "Monthly Average Air Travel Passengers from 2003 to 2023", xlab = "Year", ylab = "Air Travel Passengers")
```

Looking at the time series plot, we can see some notable occurrences within our data. There is certainly some sort of seasonality going on within our time series which can be seen by how the plot has a pattern which repeats after a certain amount of time. It looks like this time series might have an annual seasonality, with this repetition occurring after a year has gone by. This would make sense that the number of air travel passengers would have an annual seasonality, because more people likely travel during the summer months due to the ideal weather and for families this would mean their children would be on summer break which would make having a vacation during the summer an optimal timing. It would make sense that the number of air travel passengers would peak during the summer months, and then drop back down until there is another slight rise around the holiday months in the winter time due to people travelling to spend the holidays with their family. Overall, an annual seasonality would make sense for a time series on the number of air travel passengers. 

Another notable occurrence within this time series is that there appears to be a trend of slight overall increase in the number of air travel passengers from the start of the time series up until the end. There is an especially noticeable trend of overall increase from around 2010 up until 2019. We can see that there is an incredibly massive decrease in the number of air travel passengers around 2020. This can be attributed to the travel restrictions which were put in place during the early days of the covid pandemic. This would explain why there is such a drastic decline in the number of air travel passengers seen in the year 2020. After this decline in 2020, we can see that the number of air travel passengers begins to gradually increase over the following years, and it appears that the numbers at the end of the time series in 2023 are fairly similar to what they were right before the major decline seen in 2020. 




# Data Split

We will split our time series data into two data sets, one for training, and one for testing. Since we have a monthly time series, we will use the 12 most recent observations for the testing data set. The rest of the observations will be used for the training data set. We will also create a time series object for just the training data set. We will call this time series object "flight.ts1".

```{r}
test.flight <- flight2$Pax[238:249]
train.flight <- flight2$Pax[1:238]
flight.ts1 <- ts(flight2$Pax[1:238], start = c(2003, 1), frequency = 12)
```

Now we have split our time series data set up into a training and a testing data set. We will use the training data set to train the smoothing models we will create. We will use the testing data set to identify which of these smoothing models we will create is the best one to use. So, we will next begin with creating these smoothing models and we will use the training and testing data sets to help us with building each of these models and then determining which model is the best one to use. 




# Smoothing Models

We will now begin with building the smoothing models so we can ultimately determine which one does the best and would be the best to use. We will create three different types of smoothing models. These different models will include a simple exponential smoothing model, Holt models (additive and with a possible damped pattern), and Holt-Winter's models (additive and multiplicative seasonality with a possible damped pattern).


## Simple Exponential Smoothing Model

First, we will begin with creating the simple exponential smoothing model. We will call this simple exponential smoothing model "fit1".

```{r}
# Simple Exponential Smoothing Model
fit1 <- ses(flight.ts1, h = 12)
```

Now we have created our simple exponential smoothing model.


## Holt Models

Next, we will create the Holt models. We will use three different kinds of Holt models. The first of our Holt models will be a linear Holt model, and we will call this "fit2". Our next Holt model will be an additive Holt model with a possible damped pattern, and this will be called "fit3". Our last Holt model will be an exponential Holt model with a possible damped pattern, and this will be called "fit4". 

```{r}
# Holt Linear Model
fit2 <- holt(flight.ts1, initial = "optimal", h = 12)       

# Holt Additive Model with a Possible Damped Pattern 
fit3 <- holt(flight.ts1, damped = TRUE, h = 12)                  

# Holt Exponential Model with a Possible Damped Pattern
fit4 <- holt(flight.ts1, exponential = TRUE, damped = TRUE, h = 12) 
```

Now we have successfully created our Holt models. 


## Holt-Winter's Models

Lastly, we will create our Holt Winter's models. We will use four different kinds of Holt-Winter's models. The first of these will be a Holt-Winter's additive model, and this will be called "fit5". The next will be a Holt-Winter's multiplicative model, and this will be called "fit6". The next one after will be a Holt-Winter's additive model with a possible damped pattern, and this will be called "fit7". The last of these models will be a Holt-Winter's multiplicative model with a possible damped pattern, and this will be called "fit8".

```{r}
# Holt-Winter's Additive Model
fit5 <- hw(flight.ts1, h = 12, seasonal = "additive")     

# Holt-Winter's Multiplicative Model
fit6 <- hw(flight.ts1, h = 12, seasonal = "multiplicative")

# Holt-Winter's Additive Model with a Possible Damped Pattern
fit7 <- hw(flight.ts1, h = 12, seasonal = "additive", damped=TRUE)

# Holt-Winter's Multiplicative Model with a Possible Damped Pattern
fit8 <- hw(flight.ts1, h = 12, seasonal = "multiplicative", damped=TRUE)
```

Now we have successfully created all of our Holt-Winter's models. 




# Accuracy Measures

Now that we have created various models to try out, including a simple exponential model, Holt models, and Holt-Winter's models, we will use accuracy measures to compare all of these different models. These accuracy measures will be based upon the training data set that we created from our time series data. 

```{r}
accuracy.table = round(rbind(accuracy(fit1), accuracy(fit2), accuracy(fit3), accuracy(fit4), accuracy(fit5), accuracy(fit6), accuracy(fit7), accuracy(fit8)), 4)
row.names(accuracy.table)=c("SES","Holt Linear","Holt Add. Damped", "Holt Mult. Damped", "HW Add.","HW Mult.","HW Add. Damped", "HW Mult. Damped")

kable(accuracy.table, caption = "Acccuracy Measures of the Various Smoothing Models Based upon the Training Data Set")
```

To interpret our accuracy table, we have several measures of accuracy we looked at. The ME, or mean errors, represents how far off the predictions are from their actual values. The Holt Add. Damped and Holt Mult. Damped have the smallest absolute value of their ME. Next, we have RMSE, MAE, and MAPE, all of which a lower value is ideal because it indicates the least errors. From these accuracy measurements, the HW Add., and the HW. Add. Damped stand out for having notably lower values than any of the other models do. For the MASE measurement, a value closest to zero is ideal. The HW. Add model has the closest MASE to zero, making this once again seem like a strong model. Lastly, for the ACF1 measurements, a value closest to zero is ideal for accuracy. From our models, the Holt Linear model has an ACF1 that is closest to zero.

Overall, the Holt-Winter's Additive model stood out for showing ideal performance in the most accuracy measures of all of the models. This Holt-Winter's Additive model had low RMSE, MAE, and MAPE, as well as a MASE closest to zero. Since the Holt-Winter's Additive model stood out in more accuracy measurements than any of the other models, we will choose this smoothing model as the ideal model in terms of performance. 




# Serial Plot of the Data

We will create a serial plot of the historical data, which is the data that was used for the training data set. This will allow us to forecast the values and visualize this forecast.

We will create two serial plots, one will be for the non-seasonal smoothing models, which will include the simple exponential smoothing model, the Holt linear, the Holt Additive Damped, and the Holt Multiplicative Damped models. The other serial plot will be for the trend and seasonal smoothing models, which will include the HW Additive, the HW Multiplicative, the HW Additive Damped, and the HW Multiplicative Damped models. 

First, we will create the serial plot for the non-seasonal smoothing models. 

```{r}
par(mfrow = c(2,1), mar = c(3,4,3,1))
pred.id = 238:249
plot(1:238, flight.ts1, lwd = 2,type = "o", ylab = "Air Travel Passengers", xlab = "", 
     xlim = c(1, 249), ylim=c(0, 80000000), cex = 0.3,
     main = "Non-seasonal Smoothing Models")
lines(pred.id, fit1$mean, col = "pink")
lines(pred.id, fit2$mean, col = "blue")
lines(pred.id, fit3$mean, col = "purple")
lines(pred.id, fit4$mean, col = "navy")

points(pred.id, fit1$mean, pch = 16, col = "pink", cex = 0.5)
points(pred.id, fit2$mean, pch = 17, col = "blue", cex = 0.5)
points(pred.id, fit3$mean, pch = 19, col = "purple", cex = 0.5)
points(pred.id, fit4$mean, pch = 21, col = "navy", cex = 0.5)

legend("bottomright", lty = 1, col = c("pink","blue","purple","navy"), pch = c(16, 17, 19, 21),
   c("SES","Holt Linear","Holt Additive Damped", "Holt Multiplicative Damped"), 
   cex = 0.5, bty = "n")
```

We can see from the serial plot that this visualized the forecasts for the non-seasonal smoothing models which we created.

Next, we will create the serial plot for the Holt-Winter's trend and seasonal smoothing models.

```{r}
plot(1:238, flight.ts1, lwd = 2, type = "o", ylab = "Air Travel Passengers", xlab = "", 
     xlim=c(1,249), ylim=c(0, 80000000), cex = 0.3,
     main="Holt-Winters Trend and Seasonal Smoothing Models")
lines(pred.id, fit5$mean, col = "pink")
lines(pred.id, fit6$mean, col = "blue")
lines(pred.id, fit7$mean, col = "purple")
lines(pred.id, fit8$mean, col = "navy")

points(pred.id, fit5$mean, pch = 16, col = "pink", cex = 0.5)
points(pred.id, fit6$mean, pch = 17, col = "blue", cex = 0.5)
points(pred.id, fit7$mean, pch = 19, col = "purple", cex = 0.5)
points(pred.id, fit8$mean, pch = 21, col = "navy", cex = 0.5)

legend("bottomright", lty=1, col = c("pink","blue","purple", "navy"), pch = c(16, 17, 19, 21),
   c("HW Additive","HW Multiplicative","HW Additive Damped", "HW Multiplicative Damped"), 
   cex = 0.5, bty = "n")
```

We can see from this serial plot the forecasts for the trend and seasonal smoothing models.

From the two serial plots we created, it looks like the Holt-Winter's Additive model once again seems like the best of the models we created. This matches up with what was found in the accuracy table. It looks like the Holt Winter's Additive Model provides a quality forecast in the serial plot for the Holt-Winter's trend and seasonal smoothing models.




# Forecasting with the Testing Data

Now that we have determined which model looks like the ideal choice, the Holt-Winter's Additive Model, we will use the testing data set which we created earlier to verify our best model selection.

We previously used the training data set to train the various smoothing models we looked at. To create our real-world forecast, we will refit the model using the testing data as well. This will allow us to verify our findings and if the model which looked to show the best performance in forecasting truly looks like the ideal choice.

First, we will refit the model using the entire data set to update the smoothing parameters.

```{r}
acc.fun = function(test.data, mod.obj){
  PE = 100*(test.data-mod.obj$mean)/mod.obj$mean
  MAPE = mean(abs(PE))
  E = test.data-mod.obj$mean
  MSE = mean(E^2)
  accuracy.metric = c(MSE = MSE, MAPE = MAPE)
  accuracy.metric
}
```

Now, we will find the updated accuracy prediction measures for each of the smoothing models we created. 

```{r}
pred.accuracy = rbind(SES = acc.fun(test.data = test.flight, mod.obj = fit1),
                      Holt.Add = acc.fun(test.data=test.flight, 
                                         mod.obj = fit2),
                      Holt.Add.Damp = acc.fun(test.data = test.flight,    
                                              mod.obj=fit3),
                      Holt.Exp = acc.fun(test.data = test.flight, 
                                         mod.obj = fit4),
                      HW.Add = acc.fun(test.data = test.flight, 
                                       mod.obj = fit5),
                      HW.Exp = acc.fun(test.data = test.flight, 
                                       mod.obj = fit6),
                      HW.Add.Damp = acc.fun(test.data = test.flight, 
                                            mod.obj = fit7),
                      HW.Exp.Damp = acc.fun(test.data = test.flight, 
                                            mod.obj = fit8))
kable(pred.accuracy, caption = "Accuracy Measures of the Various Exponential Smoothing Models Based on the Testing Data Set")
```

Looking at the accuracy measures based on the testing data set, we can see that the Holt-Winter's Additive model has both the lowest MSE and MAPE. This is ideal because we want our model to reduce the errors as much as possible. Also, this matches up with what we found earlier, because once again the Holt-Winter's Additive model has the best performance in terms of accuracy. So, it looks like the Holt-Winter's Additive model is the ideal choice, and this has been verified with the testing data set. 




# Final Model

Based upon our findings, it looks like the Holt-Winter's Additive model is the ideal choice out of the many possible candidate models we created. This model showed its ideal performance by reducing the errors more significantly than any of the other models did. This Holt-Winter's Additive model also showed the best performance in terms of the accuracy measures, because it had the lowest MSE and MAPE values of out any of the models. So, the Holt-Winter's Additive model is the ideal model out of all the candidate models we created and tried out.


## Smoothing Parameters

We will take a look at the smoothing parameters for the Holt-Winter's Additive model. 

```{r}
final.model <- hw(flight.ts1, h = 12, seasonal = "additive")
smoothing.parameter = final.model$model$par[1:3]
kable(smoothing.parameter, caption="The Estimated Values of the Smoothing Parameters in the Holt-Winters Additive Model")
```

The smoothing parameters for our final model, the Holt-Winter's Additive model, include an alpha of 0.9999, a beta of 0.0004, and a gamma of 0.0001. 
  



# Conclusion

Overall, we looked at a data set which was representative of the total number of air travel passengers in the U.S. through monthly averages. This data was collected over the course of the years from 2003 to 2023. We created a monthly time series which showed the trends and seasonality of the average number of air travel passengers by month over the span of this 20 year period. 

We split the time series data up into a training and a testing data set. After looking at several potenetial smoothing models, the one which was found to provide the best performance was the Holt-Winter's Additive model. This model was the best choice because it did the best job out of decreasing the errors, and provided the best accuracy overall. So, the Holt-Winter's Additive model was the best choice out of the several candidate, smoothing models we looked at. 


## Recommendations

Some recommendations I would make for future projects include:

* Consider expanding the time series data to include the average air travel passengers for all countries, not just the U.S. in order to see if this still results in the Holt-Winter's Additive model being the ideal model choice, or if another model seems to provide better performance in this scenario.

* Further look into the seasonality in this time series data set, since there definitely appeared to be seasonality occurring with more passengers travelling in certain times of the year, like summer, than during other times of the year which saw fewer passengers. 




# References

The data set I used for my time series was found on kaggle.com. Included below is the citation of the web page where I found this data set. 

YYXian. (2024, January 15). U.S. airline traffic data (2003-2023). Kaggle. https://www.kaggle.com/datasets/yyxian/u-s-airline-traffic-data?select=air%2Btraffic.csv 
