The prediction of yearly and monthly passengers is a vital aspect to have as an airline company. The time series analysis of the airline passengers from 1949 to 1960 showed a positive, linear relationship. The predictions were made by logarithmically transforming and taking the differences between the lag values since the data was not stationary. The data was transformed logarithmically since the data appeared to be exponential in nature and taking the difference from the lag values removed the seasonality within the data. The data was modeled using the Auto Regression Moving Average model. The ACF and PACF plots displayed the data was stationary and aided us in determining the auto regression, difference, and moving average values for the models. The values for model were AR=0, d=1, and MA=1. The model predicted passengers out to 1970 and yielded a 95% confidence interval. As seen from more recent events, the world’s state and affairs significantly affect various corporations. Passengers using the airline is affected by global pandemics and world conflicts. The airline company must pivot and change strategy to keep up with global trends. We recommend compare the actual number of passengers versus the predicted vales and reperform the predictions every year. We also recommend the company be proactive to world events and adjust promotions and offerings to maintain a growth in passengers. We finally recommend looking into the effect of airway marketing has on passengers. This analysis will give insight on the effectiveness of the advertising. We could create more directed advertising and increase the number of passengers.
The airline industry is a competitive market. Airline companies stay competitive by offering deals on flights, minimizing layovers, providing amenities to customers, or providing customers a frequent flyers program. Airlines offer these bonuses in a hope to gain a larger portion of the airline market and beat out other competitors. Each airline brands themselves as cost effective or an experience to fly on the airline. An established airline wants to predict the future number of passengers that use the airline. The objective of this analysis is to predict the growth of passengers from year to year and the seasonal trends. This time series analysis will allow decision makers to predict the growth and seasonality of the company’s passengers.
The key measure of this analysis is the predicted number of passengers we expect the company to observe from 1961 to 1970. The data used for this analysis is from R and contains passenger use from 1949 to 1960. The time series dataset has the number of passengers that flew on the airline each month measured in thousands of passengers.
The type of model of model we will use for the analysis is a time series analysis. The first test of the analysis is to determine whether or not the time series is stationary. A time series is stationary if the mean and variance is independent of time or is constant. We test to see if a time series is stationary by performing an Augmented Dickey Fuller (ADF) Test. The null hypothesis for the ADF test is that the time series is non-stationary. We create a model to predict future passenger use by using the Auto-Regression Moving Average (ARMA) model. We will go into further detail later in this analysis of data transformation and ARMA model mechanics. All statistical analyses occur at the 0.05 significance level. The Null hypothesis is that the data is non-stationary.
Before we create the model, we will explore the data by plotting the time series and calculating correlation coefficients. The correlation coefficients will inform us if we have some type of trend between the number of passengers and time. The plot will aid in our determination if there is a trend within the data.
We plot the time series below in the first plot. The blue line is of the monthly passengers measured in thousands of passengers as a function of time from 1949 to 1960. The line in orange is the yearly average for each year. As you can see there is a strong, positive linear relationship. A correlation coefficient of 0.9239 confirms the positive linear trend between monthly passengers and time.
library(tidyverse)
library(ggfortify)
library(aTSA)
library(forecast)
sum(is.na(AirPassengers))
## [1] 0
dt<-aggregate(AirPassengers,FUN=mean)
df<-fortify(AirPassengers)
df2<-fortify(dt)
df2$Index<-as.Date(as.character(df2$Index),format = "%Y")
colors<-c("Monthly Passengers"="#56B4E9","Yearly Average"="#E69F00")
ggplot(mapping = aes(x=Index,y=Data))+
geom_line(data=fortify(df,melt=TRUE), aes(color="Monthly Passengers"))+
geom_line(data=fortify(df2,melt=TRUE),aes(color="Yearly Average"))+
labs(title="Monthly Passengers from 1949 to 1960",
x="Time (Months)",
y="Number of Passengers",
color="Legend")+
scale_color_manual(values = colors)
ind<-c(1:144)
cor(ind,df$Data)
## [1] 0.9239254
We now plot a boxplot of the monthly averages for passengers. We create a boxplot for each month and calculate values using the yearly values during each month. As you can see, there is a seasonal trend with the number of passengers. The black lines within each box is the median value for each month. The number of passengers slowly increases from January to July then decreases until November and the year ends with a spike in passengers. This is explained by more customers flying during the summer months since children are off from school. The rise in December is explained by the holidays at the end of the month.
boxplot(AirPassengers~cycle(AirPassengers),
xlab="Month",ylab="Number of Passengers",main="Boxplot for each Month",
col="#0072B2",
names=c("Jan","Feb","Mar","Apr","May","Jun",
"Jul","Aug","Sep","Oct","Nov","Dec"))
From the first plot, we see the average passengers each year steadily increases and there is a seasonal cycle each year. The variance each year is also increasing each year. This implies our data is not station, and therefore must be transformed. One of the assumptions of performing the ARMA model is the data is stationary. We transform the data by transform the variable using square root, logarithm, or some other means to transform the variable. We choose to logarithmically transform the data since the data looks exponential in nature and we take the difference of the series. The difference is calculated by subtracting the current value from the previous value. Taking the difference of a time series stabilizes the mean by removing the seasonality trend. We perform the ADF test of the logarithmically, difference time series and see if the data is now stationary. A p-value <0.05 implies we reject the null hypothesis and conclude our data is stationary.
adf.test(diff(log(AirPassengers)))
## Augmented Dickey-Fuller Test
## alternative: stationary
##
## Type 1: no drift no trend
## lag ADF p.value
## [1,] 0 -9.61 0.01
## [2,] 1 -8.82 0.01
## [3,] 2 -7.63 0.01
## [4,] 3 -8.75 0.01
## [5,] 4 -6.79 0.01
## Type 2: with drift no trend
## lag ADF p.value
## [1,] 0 -9.63 0.01
## [2,] 1 -8.86 0.01
## [3,] 2 -7.71 0.01
## [4,] 3 -8.94 0.01
## [5,] 4 -6.98 0.01
## Type 3: with drift and trend
## lag ADF p.value
## [1,] 0 -9.60 0.01
## [2,] 1 -8.83 0.01
## [3,] 2 -7.69 0.01
## [4,] 3 -8.92 0.01
## [5,] 4 -6.95 0.01
## ----
## Note: in fact, p.value = 0.01 means p.value <= 0.01
Since one difference was required to make the data stationary, the difference value of the ARMA model is 1. Now we move on to determine the Auto Regression Factor and the Moving Average Factor. The Auto Regression factors is determined by viewing the ACF plot. As seen below in the ACF plot, the plot cuts off the first lag value. The blue dashed lines on the plot show values significantly different than zero. This means the value significantly drops after the first value on the plot. This implies the Auto Regression value is 0.
acf(diff(log(AirPassengers)))
We now determine the Moving Average value by looking at the Partial Correlation Function. The PACF plot shows a cutoff at the first and second lag values. This implies the Moving Average value is either 1 or 2. We continue the analysis with 1 since the cutoff value is slightly higher in magnitude than the value at 2.
pacf(diff(log(AirPassengers)))
Using the AR, difference, and MA values from the previous plots and transformations, we now create the model to predict the number of passengers. The values for the model are AR=0, d=1, and MA=1. The model that does the best job of predicting future passenger amounts from 1961 to 1970 is the ARMA model:
fit<-arima(log(AirPassengers), c(0, 1, 1),seasonal = list(order = c(0, 1, 1), period = 12))
pred<-predict(fit, n.ahead = 10*12)
pred<-2.718^pred$pred
pm<-forecast(fit,10*12,level=0.95)
pml<-2.718^pm$lower
pmu<-2.718^pm$upper
ts.plot(AirPassengers,pred,pml,pmu,
col=c("#009E73","#E69F00","#0072B2","#0072B2"),
log = "y", lty = c(1,3,1,1),
xlab="Year",ylab="Number of Passengers",
main="Predicted Passengers from 1961 to 1970")
legend("topleft", bty="n", lty=c(1,2), col=c("#009E73","#E69F00","#0072B2"),
legend=c("Observed Passengers", "Predicted Passengers",
"Upper and Lower Confidence Interval"))
The plot above shows the prediction for the next ten years in orange. The lines in blue are the 95% confidence interval for the predictions. As you can see, the predictions follow the seasonal and yearly growth patterns. The confidence intervals also follow the seasonality, but feather as time continues since more error is induced the further the prediction is from the observed data. The portion in green is the number of observed passengers the company flew from 1949 to 1960.
The prediction of yearly and monthly passengers is a vital aspect to have as an airline company. The time series analysis of the airline passengers from 1949 to 1960 showed a positive, linear relationship. The predictions were made by logarithmically transforming and taking the differences between the lag values since the data was not stationary. The data was transformed logarithmically since the data appeared to be exponential in nature and taking the difference from the lag values removed the seasonality within the data. The data was modeled using the Auto Regression Moving Average model. The ACF and PACF plots displayed the data was stationary and aided us in determining the auto regression, difference, and moving average values for the models. The values for model were AR=0, d=1, and MA=1. The model predicted passengers out to 1970 and yielded a 95% confidence interval.
As seen from more recent events, the world’s state and affairs significantly affect various corporations. Passengers using the airline is affected by global pandemics and world conflicts. The airline company must pivot and change strategy to keep up with global trends. We recommend compare the actual number of passengers versus the predicted vales and reperform the predictions every year. We also recommend the company be proactive to world events and adjust promotions and offerings to maintain a growth in passengers. We finally recommend looking into the effect of airway marketing has on passengers. This analysis will give insight on the effectiveness of the advertising. We could create more directed advertising and increase the number of passengers.
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forecast_8.16 aTSA_3.1.2 ggfortify_0.4.14 forcats_0.5.1
## [5] stringr_1.4.0 dplyr_1.0.8 purrr_0.3.4 readr_2.1.2
## [9] tidyr_1.2.0 tibble_3.1.6 ggplot2_3.3.5 tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] tseries_0.10-49 httr_1.4.2 sass_0.4.0 jsonlite_1.8.0
## [5] modelr_0.1.8 bslib_0.3.1 assertthat_0.2.1 TTR_0.24.3
## [9] highr_0.9 cellranger_1.1.0 yaml_2.3.5 pillar_1.7.0
## [13] backports_1.4.1 lattice_0.20-45 glue_1.6.2 quadprog_1.5-8
## [17] digest_0.6.29 rvest_1.0.2 colorspace_2.0-3 htmltools_0.5.2
## [21] timeDate_3043.102 pkgconfig_2.0.3 broom_0.7.12 haven_2.4.3
## [25] scales_1.1.1 tzdb_0.2.0 farver_2.1.0 generics_0.1.2
## [29] ellipsis_0.3.2 withr_2.5.0 urca_1.3-0 nnet_7.3-17
## [33] cli_3.2.0 quantmod_0.4.18 magrittr_2.0.2 crayon_1.5.0
## [37] readxl_1.3.1 evaluate_0.15 fs_1.5.2 fansi_1.0.2
## [41] nlme_3.1-155 xts_0.12.1 xml2_1.3.3 tools_4.1.1
## [45] hms_1.1.1 lifecycle_1.0.1 munsell_0.5.0 reprex_2.0.1
## [49] compiler_4.1.1 jquerylib_0.1.4 rlang_1.0.2 grid_4.1.1
## [53] rstudioapi_0.13 labeling_0.4.2 rmarkdown_2.12 gtable_0.3.0
## [57] fracdiff_1.5-1 DBI_1.1.2 curl_4.3.2 R6_2.5.1
## [61] gridExtra_2.3 zoo_1.8-9 lubridate_1.8.0 knitr_1.37
## [65] fastmap_1.1.0 utf8_1.2.2 stringi_1.7.6 parallel_4.1.1
## [69] Rcpp_1.0.8 vctrs_0.3.8 dbplyr_2.1.1 tidyselect_1.1.2
## [73] xfun_0.30 lmtest_0.9-39