The purpose of this project is to determine the impact of COVID-19 to U.S. Airline industries. I use the time series to forecast the predicted total revenue from 2020, and the model is based on the total operating revenue before the pandemic started. Therefore, it is helpful to find out the impact of COVID-19 to the U.S Airline industry by taking the difference between the forecasted predicted operating revenue and the actual operating revenue that is releasing on June 15th. The value in the data is in thousands of dollars.
The data is from https://www.transtats.bts.gov/Data_Elements_Financial.aspx?Data=7, the Bureau of Transportation Airline Financial Data. The data is from 2000 to 2019, quaterly recorded, and I chose the total operating revenue of overall U.S. carriers. The data is structured as character, so it has to be converted into numeric value without any “,”.
# import data
library(forecast)
## Warning: package 'forecast' was built under R version 3.6.2
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
airline <- read.csv("airline.csv",header = F,stringsAsFactors=FALSE)
names(airline)<-c("year","quater","domestic","latin_america","atlantic","pacific","International","total")
# featuring
airline<-airline[3:102,]
airline<-airline[,c(1,2,8)]
airline<-subset(airline, airline$quater!="TOTAL")
str(airline)
## 'data.frame': 80 obs. of 3 variables:
## $ year : chr "2000" "2000" "2000" "2000" ...
## $ quater: chr "1" "2" "3" "4" ...
## $ total : chr "30,202,987" "33,441,689" "34,058,686" "32,544,858" ...
airline<-as.numeric(gsub(",", "", airline$total))
# summary
Airline<-ts(airline, start=c(2000,1), end=c(2019,4), frequency=4)
summary(Airline)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23889187 34711078 44768324 44175232 51470877 64449431
start(Airline)
## [1] 2000 1
end(Airline)
## [1] 2019 4
frequency(Airline)
## [1] 4
cycle(Airline)
## Qtr1 Qtr2 Qtr3 Qtr4
## 2000 1 2 3 4
## 2001 1 2 3 4
## 2002 1 2 3 4
## 2003 1 2 3 4
## 2004 1 2 3 4
## 2005 1 2 3 4
## 2006 1 2 3 4
## 2007 1 2 3 4
## 2008 1 2 3 4
## 2009 1 2 3 4
## 2010 1 2 3 4
## 2011 1 2 3 4
## 2012 1 2 3 4
## 2013 1 2 3 4
## 2014 1 2 3 4
## 2015 1 2 3 4
## 2016 1 2 3 4
## 2017 1 2 3 4
## 2018 1 2 3 4
## 2019 1 2 3 4
The operating revenue is increasing and it has decreased during the 2008 recession period, and the highest operating revenue is from the third quarter.
# plot
plot(aggregate(Airline,FUN=mean))
boxplot(Airline~cycle(Airline)) # find the max passanger:July
plot(diff(Airline))
According to the Dickey Fuller Test, p-value of 0.01 which is less than 0.05, I can reject the null hypothesis of there is unit root or non-station exists.
#dickey fuller test for stationary
options(warn=-1)
library(tseries)
adf.test(Airline,alternative="stationary",k=1)
##
## Augmented Dickey-Fuller Test
##
## data: Airline
## Dickey-Fuller = -5.0071, Lag order = 1, p-value = 0.01
## alternative hypothesis: stationary
If acf dies down, and pacf cuts off at lag p, then use AR model.
If acf cuts off at lag q, and pacf dies down, then use MA model.
If acf dies down , pacf dies down , then use ARMA model.
According to the pacf and acf, I can found that the acf dies down slowly, and the pacf cuts off at lag of p. Therefore, I can use AR model with p= 5, d=1, q=0.
pacf((diff(Airline)),lag.max = 100) #auto regression p=5
acf((diff(Airline)),lag = 100) #moving average q=0, d=1
The model is built with based on the original data of the period from 2000 to 2013 for forecasting 2014 to 2019. The purpose of doing it is for verifying the accuracy of the model, and see how correctly the model is forcasting. If the forecasting is satisfied, use the same model, but with the complete original dataset to forecast 2020. The mean absolute percentage error show 3.59 which tells our model is fairly accurated.
test_data<-ts(airline, start=c(2000,1), end=c(2013,4), frequency=4)
test_fit <- arima(test_data, c(5, 1, 0),seasonal = list(order = c(5, 1, 0), period = 4))
test_forecast <- predict(test_fit, n.ahead=6*4) #predict 6 years by quarterly.
test_forecast$pred
## Qtr1 Qtr2 Qtr3 Qtr4
## 2014 48109711 51976537 53202323 49803675
## 2015 47508702 51154373 52448139 50429712
## 2016 49967695 55205391 56483027 53276177
## 2017 52469728 57360598 57954365 54681948
## 2018 53817959 58257202 58964817 55883776
## 2019 54852313 59112030 60407471 57240506
Tail<-tail(Airline,24) # actual value from 2014-2019.
Tail
## Qtr1 Qtr2 Qtr3 Qtr4
## 2014 47689693 54246424 54916168 50861603
## 2015 48139484 53200172 54187425 49833811
## 2016 47929718 53259470 55299314 52806834
## 2017 50904764 57624998 57050446 56662524
## 2018 55232127 62012174 62000772 60515342
## 2019 57152302 64449431 63895840 62140198
accuracy(test_forecast$pred,Tail)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 1310119 2507410 2070265 2.118783 3.591304 0.6447802 0.6193072
The result of test model for predicting 2014 to 2019 is fairly accurate, so use the same model with the complete data set(included actual data of 2014 to 2019 in the model).
Airline<-ts(airline, start=c(2000,1), end=c(2019,4), frequency=4)
airline_fit <- arima(Airline, c(5, 1, 0),seasonal = list(order = c(5, 1, 0), period = 4))
airline_forecast <- predict(airline_fit, n.ahead=1*4) #predict 1 year by quarterly.
airline_forecast$pred
## Qtr1 Qtr2 Qtr3 Qtr4
## 2020 58766336 65846449 65335735 63522037
ts.plot(Airline, airline_forecast$pred, lty = c(1,3),col="blue",main="prediction of 2020")
The lack of correlation shows that the forecasts are good. The Ljung-box test shows p-value of 0.58 which is greater than 0.05. Therefore, there is no statistical evidence to reject the null hypothesis of independency of residuals.
checkresiduals(airline_fit)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(5,1,0)(5,1,0)[4]
## Q* = 1.9565, df = 3, p-value = 0.5815
##
## Model df: 10. Total lags used: 13
```