Air Passengers Forecast

Execute by Neha Raut

Introduction:

The dataset consists of monthly totals of international airline passengers, 1949 to 1960.Main aim is to predict next ten years.

Data Information:

Month:- Month of the year

Passengers:- Total number of passengers travelled on that particular month.

Step 1: Load Data

loading library dplyr

library(tseries)
library(forecast)

data(AirPassengers)
AirPassengers

##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432

#The dataset shows the number of passengers travelling on a flight for all the months in a year.

class(AirPassengers)

## [1] "ts"

end(AirPassengers)

## [1] 1960   12

Check for missing values

sum(is.na(AirPassengers))

## [1] 0

cycle of this time series is 12 months in a year

frequency(AirPassengers)

## [1] 12

Summary

summary(AirPassengers)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   104.0   180.0   265.5   280.3   360.5   622.0

Setp 2: Test the stationarity of time series

In order to test the stationarity of the time series, let’s run the Augmented Dickey-Fuller Test using the adf.test function from the tseries R package.

First set the hypothesis test:

The null hypothesis : time series is non stationary

The alternative hypothesis : time series is stationary

adf.test(AirPassengers, alternative ="stationary", k=12)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  AirPassengers
## Dickey-Fuller = -1.5094, Lag order = 12, p-value = 0.7807
## alternative hypothesis: stationary

As a rule of thumb, where the p-value is less than 0.05, we strong evidence against the null hypothesis, so we reject the null hypothesis. From the above p-value, we concluded that the time series is non-stationary .

#Below plot shows how they vary with time.
plot(AirPassengers,main="Air Passenger numbers from 1949 to 1961") #show sparkine

#This will fit in a line(will give you a mean)
#data is depend on time only. There is no independet var 
#here time is independent var( no of passanger depend on a time)
plot(AirPassengers)+abline(reg=lm(AirPassengers~time(AirPassengers)))

## integer(0)

from graph you can see at every point of time variance is changing.Variance means you can see difference between trendline and sparkline.Here variance and mean is also increasing, hence it is not stationary.

Step 3: Make it stationary

stationary means there should be consitant mean and variance

We need to address two issue before we test stationary series 1.Remove unequal variance using log of time series 2.Address the trend componant, do this by taking difference

plot(log(AirPassengers))

plot(diff(log(AirPassengers)))

#Now you can see mean and variance both are stationary

#Check general trend.
plot(aggregate(AirPassengers,FUN = mean))

#It is upword trend

Let’s use the boxplot function to see any seasonal effects.

#Show seasonality
#This boxplot will show seasonality
boxplot(AirPassengers~cycle(AirPassengers))

In the boxplot you can see in month july and aug no of passangers traveling are more. The rationale for this could be more people taking holidays and fly over the summer months

Step : Time Series Decomposition

#Decomposition break data into trend, seasonal, regular and random
plot(decompose(AirPassengers)) # time series decomposition

The above figure shows the time series decomposition into trend, seasonal and random (noise) . It is clear that the time series is non-stationary (has random walks) because of seasonal effects and a trend (linear trend).

Step 5: Model Identification and Estimation

tsdata <- ts(log(AirPassengers),frequency = 12)

Calculate p d q value

AR(Autoregreesive) :- by seeing the past value, predict own value

I Integration

MA(Moving Average) :- you take diff intervals and calculate the average

Autocorrelation function and partial autocorrelation function to determine value of p and q

Autocorrelation is the linear dependence of a variable with itself at two points in time Auto correlation function, as the word suggests, auto-correlation, means it is really correlation on itself.With time series we just have a single stream of values, or in other words there is just the X no Y.

So suppose your time series is like this X = 3,5,6,6,7,4,5,6,7,2,3,4,. correlation between 4 and 3, 3 and 2, 2 and 7 (lag 1) will be say y1; correlation between 4 and 2, 3 and 7, 2 and 6 (lag 2) will be say y2; and so on at lags 3 = y3, lag 4 = y4.

acf(AirPassengers)

acf(diff(log(AirPassengers)))

#It determine value of q(value we got as 1)

Partial auto correlation function, as the word suggests, is partial not complete. Here again we are plot the correlations at various lags 1,2,3 BUT after adjusting for the effects of intermediate numbers.

pacf(diff(log(AirPassengers)))

#It determine value of p (value we got as 0)

#d is number of time you do the differentiations to make the mean
#We do diff only one time so value of d is 1
plot(diff(log(AirPassengers)))

Stepn 6: ARIMA Model Prediction

fit <- arima(log(AirPassengers),c(0,1,1),seasonal = list(order=c(0,1,1),period=12))
fit

## 
## Call:
## arima(x = log(AirPassengers), order = c(0, 1, 1), seasonal = list(order = c(0, 
##     1, 1), period = 12))
## 
## Coefficients:
##           ma1     sma1
##       -0.4018  -0.5569
## s.e.   0.0896   0.0731
## 
## sigma^2 estimated as 0.001348:  log likelihood = 244.7,  aic = -483.4

Predict for next 10 years

pred <- predict(fit,n.ahead=10*12) #10 years * 12 months 
pred

## $pred
##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 1961 6.110186 6.053775 6.171715 6.199300 6.232556 6.368779 6.507294
## 1962 6.206435 6.150025 6.267964 6.295550 6.328805 6.465028 6.603543
## 1963 6.302684 6.246274 6.364213 6.391799 6.425054 6.561277 6.699792
## 1964 6.398934 6.342523 6.460463 6.488048 6.521304 6.657526 6.796042
## 1965 6.495183 6.438772 6.556712 6.584297 6.617553 6.753776 6.892291
## 1966 6.591432 6.535022 6.652961 6.680547 6.713802 6.850025 6.988540
## 1967 6.687681 6.631271 6.749210 6.776796 6.810052 6.946274 7.084789
## 1968 6.783931 6.727520 6.845460 6.873045 6.906301 7.042523 7.181039
## 1969 6.880180 6.823769 6.941709 6.969295 7.002550 7.138773 7.277288
## 1970 6.976429 6.920019 7.037958 7.065544 7.098799 7.235022 7.373537
##           Aug      Sep      Oct      Nov      Dec
## 1961 6.502906 6.324698 6.209008 6.063487 6.168025
## 1962 6.599156 6.420948 6.305257 6.159737 6.264274
## 1963 6.695405 6.517197 6.401507 6.255986 6.360523
## 1964 6.791654 6.613446 6.497756 6.352235 6.456773
## 1965 6.887903 6.709695 6.594005 6.448484 6.553022
## 1966 6.984153 6.805945 6.690254 6.544734 6.649271
## 1967 7.080402 6.902194 6.786504 6.640983 6.745520
## 1968 7.176651 6.998443 6.882753 6.737232 6.841770
## 1969 7.272900 7.094692 6.979002 6.833482 6.938019
## 1970 7.369150 7.190942 7.075251 6.929731 7.034268
## 
## $se
##             Jan        Feb        Mar        Apr        May        Jun
## 1961 0.03671562 0.04278290 0.04809071 0.05286829 0.05724854 0.06131668
## 1962 0.09008470 0.09549702 0.10061863 0.10549188 0.11014974 0.11461847
## 1963 0.14650633 0.15224974 0.15778424 0.16313107 0.16830813 0.17333063
## 1964 0.20896641 0.21513637 0.22113424 0.22697368 0.23266660 0.23822352
## 1965 0.27748187 0.28408285 0.29053390 0.29684478 0.30302426 0.30908021
## 1966 0.35174446 0.35876258 0.36564602 0.37240224 0.37903807 0.38555969
## 1967 0.43142004 0.43883777 0.44613217 0.45330922 0.46037439 0.46733277
## 1968 0.51620328 0.52400328 0.53168886 0.53926491 0.54673600 0.55410636
## 1969 0.60582527 0.61399145 0.62205044 0.63000634 0.63786302 0.64562410
## 1970 0.70005065 0.70856839 0.71698494 0.72530383 0.73352839 0.74166174
##             Jul        Aug        Sep        Oct        Nov        Dec
## 1961 0.06513121 0.06873437 0.07215784 0.07542608 0.07855847 0.08157066
## 1962 0.11891939 0.12307010 0.12708531 0.13097750 0.13475731 0.13843396
## 1963 0.17821163 0.18296247 0.18759304 0.19211202 0.19652712 0.20084519
## 1964 0.24365373 0.24896554 0.25416635 0.25926286 0.26426110 0.26916654
## 1965 0.31501977 0.32084940 0.32657497 0.33220188 0.33773506 0.34317903
## 1966 0.39197283 0.39828272 0.40449419 0.41061170 0.41663940 0.42258114
## 1967 0.47418904 0.48094758 0.48761246 0.49418746 0.50067612 0.50708176
## 1968 0.56137996 0.56856052 0.57565152 0.58265623 0.58957772 0.59641889
## 1969 0.65329298 0.66087288 0.66836682 0.67577766 0.68310811 0.69036073
## 1970 0.74970687 0.75766657 0.76554352 0.77334024 0.78105913 0.78870249

The above output prediction value are in logarithemic part,convert them to original form we need to transform them.

#2.718 is e value and round them to 0 decimal 

pred1<-round(2.718^pred$pred,0)  
pred1  #give op of 1960 to 1970

##       Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
## 1961  450  425  479  492  509  583  670  667  558  497  430  477
## 1962  496  468  527  542  560  642  737  734  614  547  473  525
## 1963  546  516  580  597  617  707  812  808  676  602  521  578
## 1964  601  568  639  657  679  778  894  890  745  663  573  637
## 1965  661  625  703  723  748  857  984  980  820  730  631  701
## 1966  728  688  775  796  823  943 1083 1079  903  804  695  772
## 1967  802  758  853  877  906 1039 1193 1188  994  885  765  850
## 1968  883  834  939  965  998 1143 1313 1308 1094  975  843  935
## 1969  972  919 1034 1063 1099 1259 1446 1440 1205 1073  928 1030
## 1970 1070 1012 1138 1170 1210 1386 1592 1585 1326 1181 1021 1134

plot this model

#line type (lty) can be specified using either text ("blank", "solid", "dashed", "dotted", "dotdash", "longdash", "twodash") or number (0, 1, 2, 3, 4, 5, 6). Note that lty = "solid" is identical to lty=1.

ts.plot(AirPassengers,pred1,log="y",lty=c(1,3))

In above graph, dark(solid) line is original values and dotted are predicted values

Compare predicted values with original values

#Get only 1961 values
data1<-head(pred1,12)
data1

##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1961 450 425 479 492 509 583 670 667 558 497 430 477

We can see you predict almost close values to Original values

#Predicted Values

predicted_1960 <- round(data1)#head of Predicted
predicted_1960

##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1961 450 425 479 492 509 583 670 667 558 497 430 477

#Original
original_1960 <- tail(AirPassengers,12) #tail of original
original_1960

##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1960 417 391 419 461 472 535 622 606 508 461 390 432

Lets Test this Model we are going to take a dataset till 1959, and then we predict value of 1960, then validate that 1960 from alredy existing value we have it in dataset

#Recreate model till 1959
datawide <- ts(AirPassengers, frequency = 12, start=c(1949,1), end=c(1959,12))
datawide

##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405

#Create model
fit1 <- arima(log(datawide),c(0,1,1),seasonal = list(order=c(0,1,1),period=12))
pred <- predict(fit1,n.ahead=10*12)  # predictfor now 1960 to 1970
pred1<-2.718^pred$pred  
pred1  #give op of 1960 to 1970

##            Jan       Feb       Mar       Apr       May       Jun       Jul
## 1960  419.0628  398.6732  466.2820  454.1188  472.9611  546.7614  621.8017
## 1961  469.5742  446.7270  522.4849  508.8558  529.9692  612.6649  696.7502
## 1962  526.1740  500.5730  585.4623  570.1903  593.8487  686.5121  780.7325
## 1963  589.5961  560.9092  656.0306  638.9179  665.4278  769.2603  874.8376
## 1964  660.6627  628.5180  735.1048  715.9294  745.6347  861.9826  980.2855
## 1965  740.2952  704.2761  823.7102  802.2235  835.5093  965.8811 1098.4436
## 1966  829.5262  789.1655  922.9956  898.9190  936.2169 1082.3030 1230.8438
## 1967  929.5126  884.2870 1034.2482 1007.2696 1049.0631 1212.7576 1379.2027
## 1968 1041.5507  990.8740 1158.9107 1128.6801 1175.5113 1358.9366 1545.4440
## 1969 1167.0934 1110.3083 1298.5992 1264.7248 1317.2007 1522.7351 1731.7230
##            Aug       Sep       Oct       Nov       Dec
## 1960  629.7291  526.4044  461.9958  406.3747  452.0098
## 1961  705.6331  589.8542  517.6821  455.3568  506.4925
## 1962  790.6861  660.9519  580.0806  510.2429  567.5423
## 1963  885.9909  740.6193  650.0003  571.7447  635.9506
## 1964  992.7833  829.8893  728.3476  640.6596  712.6045
## 1965 1112.4477  929.9195  816.1385  717.8810  798.4978
## 1966 1246.5359 1042.0067  914.5112  804.4104  894.7442
## 1967 1396.7863 1167.6043 1024.7412  901.3694 1002.5916
## 1968 1565.1470 1308.3407 1148.2577 1010.0154 1123.4383
## 1969 1753.8009 1466.0406 1286.6622 1131.7569 1258.8512

data11=round(head(pred1,12),0) #head of Predicted
data22=round(tail(AirPassengers,12),0) #tail of original 

plot(data11,col="red", type="l")
lines(data22,col="blue")

Step 7: Check normality using Q-Q plot

qqnorm is a generic function the default method of which produces a normal QQ plot of the values in y. qqline adds a line to a “theoretical”, by default normal, quantile-quantile plot which passes through the probs quantiles, by default the first and third quartiles.

qqnorm(residuals(fit))
qqline(residuals(fit))

The linearity of the points suggests that the data are normally distributed with mean = 0

Check for stationarity

To cross check the stationarity of timeseries, use Augemented Dickey-Fuller test (from tseries package). Rejecting the null hypothesis suggests that a time series is stationary

adf.test(fit$residuals, alternative ="stationary")

## Warning in adf.test(fit$residuals, alternative = "stationary"): p-value
## smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  fit$residuals
## Dickey-Fuller = -4.7907, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary

From the above p-value, we can conclude that the residuals of our ARIMA prediction model is stationary.