Execute by Neha Raut
The dataset consists of monthly totals of international airline passengers, 1949 to 1960.Main aim is to predict next ten years.
Month:- Month of the year
Passengers:- Total number of passengers travelled on that particular month.
loading library dplyr
library(tseries)
library(forecast)
data(AirPassengers)
AirPassengers
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432
#The dataset shows the number of passengers travelling on a flight for all the months in a year.
class(AirPassengers)
## [1] "ts"
end(AirPassengers)
## [1] 1960 12
Check for missing values
sum(is.na(AirPassengers))
## [1] 0
cycle of this time series is 12 months in a year
frequency(AirPassengers)
## [1] 12
Summary
summary(AirPassengers)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 104.0 180.0 265.5 280.3 360.5 622.0
In order to test the stationarity of the time series, let’s run the Augmented Dickey-Fuller Test using the adf.test function from the tseries R package.
First set the hypothesis test:
The null hypothesis : time series is non stationary
The alternative hypothesis : time series is stationary
adf.test(AirPassengers, alternative ="stationary", k=12)
##
## Augmented Dickey-Fuller Test
##
## data: AirPassengers
## Dickey-Fuller = -1.5094, Lag order = 12, p-value = 0.7807
## alternative hypothesis: stationary
As a rule of thumb, where the p-value is less than 0.05, we strong evidence against the null hypothesis, so we reject the null hypothesis. From the above p-value, we concluded that the time series is non-stationary .
#Below plot shows how they vary with time.
plot(AirPassengers,main="Air Passenger numbers from 1949 to 1961") #show sparkine
#This will fit in a line(will give you a mean)
#data is depend on time only. There is no independet var
#here time is independent var( no of passanger depend on a time)
plot(AirPassengers)+abline(reg=lm(AirPassengers~time(AirPassengers)))
## integer(0)
from graph you can see at every point of time variance is changing.Variance means you can see difference between trendline and sparkline.Here variance and mean is also increasing, hence it is not stationary.
stationary means there should be consitant mean and variance
We need to address two issue before we test stationary series 1.Remove unequal variance using log of time series 2.Address the trend componant, do this by taking difference
plot(log(AirPassengers))
plot(diff(log(AirPassengers)))
#Now you can see mean and variance both are stationary
#Check general trend.
plot(aggregate(AirPassengers,FUN = mean))
#It is upword trend
Let’s use the boxplot function to see any seasonal effects.
#Show seasonality
#This boxplot will show seasonality
boxplot(AirPassengers~cycle(AirPassengers))
In the boxplot you can see in month july and aug no of passangers traveling are more. The rationale for this could be more people taking holidays and fly over the summer months
#Decomposition break data into trend, seasonal, regular and random
plot(decompose(AirPassengers)) # time series decomposition
The above figure shows the time series decomposition into trend, seasonal and random (noise) . It is clear that the time series is non-stationary (has random walks) because of seasonal effects and a trend (linear trend).
tsdata <- ts(log(AirPassengers),frequency = 12)
Calculate p d q value
AR(Autoregreesive) :- by seeing the past value, predict own value
I Integration
MA(Moving Average) :- you take diff intervals and calculate the average
Autocorrelation function and partial autocorrelation function to determine value of p and q
Autocorrelation is the linear dependence of a variable with itself at two points in time Auto correlation function, as the word suggests, auto-correlation, means it is really correlation on itself.With time series we just have a single stream of values, or in other words there is just the X no Y.
So suppose your time series is like this X = 3,5,6,6,7,4,5,6,7,2,3,4,. correlation between 4 and 3, 3 and 2, 2 and 7 (lag 1) will be say y1; correlation between 4 and 2, 3 and 7, 2 and 6 (lag 2) will be say y2; and so on at lags 3 = y3, lag 4 = y4.
acf(AirPassengers)
acf(diff(log(AirPassengers)))
#It determine value of q(value we got as 1)
Partial auto correlation function, as the word suggests, is partial not complete. Here again we are plot the correlations at various lags 1,2,3 BUT after adjusting for the effects of intermediate numbers.
pacf(diff(log(AirPassengers)))
#It determine value of p (value we got as 0)
#d is number of time you do the differentiations to make the mean
#We do diff only one time so value of d is 1
plot(diff(log(AirPassengers)))
fit <- arima(log(AirPassengers),c(0,1,1),seasonal = list(order=c(0,1,1),period=12))
fit
##
## Call:
## arima(x = log(AirPassengers), order = c(0, 1, 1), seasonal = list(order = c(0,
## 1, 1), period = 12))
##
## Coefficients:
## ma1 sma1
## -0.4018 -0.5569
## s.e. 0.0896 0.0731
##
## sigma^2 estimated as 0.001348: log likelihood = 244.7, aic = -483.4
Predict for next 10 years
pred <- predict(fit,n.ahead=10*12) #10 years * 12 months
pred
## $pred
## Jan Feb Mar Apr May Jun Jul
## 1961 6.110186 6.053775 6.171715 6.199300 6.232556 6.368779 6.507294
## 1962 6.206435 6.150025 6.267964 6.295550 6.328805 6.465028 6.603543
## 1963 6.302684 6.246274 6.364213 6.391799 6.425054 6.561277 6.699792
## 1964 6.398934 6.342523 6.460463 6.488048 6.521304 6.657526 6.796042
## 1965 6.495183 6.438772 6.556712 6.584297 6.617553 6.753776 6.892291
## 1966 6.591432 6.535022 6.652961 6.680547 6.713802 6.850025 6.988540
## 1967 6.687681 6.631271 6.749210 6.776796 6.810052 6.946274 7.084789
## 1968 6.783931 6.727520 6.845460 6.873045 6.906301 7.042523 7.181039
## 1969 6.880180 6.823769 6.941709 6.969295 7.002550 7.138773 7.277288
## 1970 6.976429 6.920019 7.037958 7.065544 7.098799 7.235022 7.373537
## Aug Sep Oct Nov Dec
## 1961 6.502906 6.324698 6.209008 6.063487 6.168025
## 1962 6.599156 6.420948 6.305257 6.159737 6.264274
## 1963 6.695405 6.517197 6.401507 6.255986 6.360523
## 1964 6.791654 6.613446 6.497756 6.352235 6.456773
## 1965 6.887903 6.709695 6.594005 6.448484 6.553022
## 1966 6.984153 6.805945 6.690254 6.544734 6.649271
## 1967 7.080402 6.902194 6.786504 6.640983 6.745520
## 1968 7.176651 6.998443 6.882753 6.737232 6.841770
## 1969 7.272900 7.094692 6.979002 6.833482 6.938019
## 1970 7.369150 7.190942 7.075251 6.929731 7.034268
##
## $se
## Jan Feb Mar Apr May Jun
## 1961 0.03671562 0.04278290 0.04809071 0.05286829 0.05724854 0.06131668
## 1962 0.09008470 0.09549702 0.10061863 0.10549188 0.11014974 0.11461847
## 1963 0.14650633 0.15224974 0.15778424 0.16313107 0.16830813 0.17333063
## 1964 0.20896641 0.21513637 0.22113424 0.22697368 0.23266660 0.23822352
## 1965 0.27748187 0.28408285 0.29053390 0.29684478 0.30302426 0.30908021
## 1966 0.35174446 0.35876258 0.36564602 0.37240224 0.37903807 0.38555969
## 1967 0.43142004 0.43883777 0.44613217 0.45330922 0.46037439 0.46733277
## 1968 0.51620328 0.52400328 0.53168886 0.53926491 0.54673600 0.55410636
## 1969 0.60582527 0.61399145 0.62205044 0.63000634 0.63786302 0.64562410
## 1970 0.70005065 0.70856839 0.71698494 0.72530383 0.73352839 0.74166174
## Jul Aug Sep Oct Nov Dec
## 1961 0.06513121 0.06873437 0.07215784 0.07542608 0.07855847 0.08157066
## 1962 0.11891939 0.12307010 0.12708531 0.13097750 0.13475731 0.13843396
## 1963 0.17821163 0.18296247 0.18759304 0.19211202 0.19652712 0.20084519
## 1964 0.24365373 0.24896554 0.25416635 0.25926286 0.26426110 0.26916654
## 1965 0.31501977 0.32084940 0.32657497 0.33220188 0.33773506 0.34317903
## 1966 0.39197283 0.39828272 0.40449419 0.41061170 0.41663940 0.42258114
## 1967 0.47418904 0.48094758 0.48761246 0.49418746 0.50067612 0.50708176
## 1968 0.56137996 0.56856052 0.57565152 0.58265623 0.58957772 0.59641889
## 1969 0.65329298 0.66087288 0.66836682 0.67577766 0.68310811 0.69036073
## 1970 0.74970687 0.75766657 0.76554352 0.77334024 0.78105913 0.78870249
The above output prediction value are in logarithemic part,convert them to original form we need to transform them.
#2.718 is e value and round them to 0 decimal
pred1<-round(2.718^pred$pred,0)
pred1 #give op of 1960 to 1970
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1961 450 425 479 492 509 583 670 667 558 497 430 477
## 1962 496 468 527 542 560 642 737 734 614 547 473 525
## 1963 546 516 580 597 617 707 812 808 676 602 521 578
## 1964 601 568 639 657 679 778 894 890 745 663 573 637
## 1965 661 625 703 723 748 857 984 980 820 730 631 701
## 1966 728 688 775 796 823 943 1083 1079 903 804 695 772
## 1967 802 758 853 877 906 1039 1193 1188 994 885 765 850
## 1968 883 834 939 965 998 1143 1313 1308 1094 975 843 935
## 1969 972 919 1034 1063 1099 1259 1446 1440 1205 1073 928 1030
## 1970 1070 1012 1138 1170 1210 1386 1592 1585 1326 1181 1021 1134
plot this model
#line type (lty) can be specified using either text ("blank", "solid", "dashed", "dotted", "dotdash", "longdash", "twodash") or number (0, 1, 2, 3, 4, 5, 6). Note that lty = "solid" is identical to lty=1.
ts.plot(AirPassengers,pred1,log="y",lty=c(1,3))
In above graph, dark(solid) line is original values and dotted are predicted values
Compare predicted values with original values
#Get only 1961 values
data1<-head(pred1,12)
data1
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1961 450 425 479 492 509 583 670 667 558 497 430 477
We can see you predict almost close values to Original values
#Predicted Values
predicted_1960 <- round(data1)#head of Predicted
predicted_1960
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1961 450 425 479 492 509 583 670 667 558 497 430 477
#Original
original_1960 <- tail(AirPassengers,12) #tail of original
original_1960
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1960 417 391 419 461 472 535 622 606 508 461 390 432
Lets Test this Model we are going to take a dataset till 1959, and then we predict value of 1960, then validate that 1960 from alredy existing value we have it in dataset
#Recreate model till 1959
datawide <- ts(AirPassengers, frequency = 12, start=c(1949,1), end=c(1959,12))
datawide
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
#Create model
fit1 <- arima(log(datawide),c(0,1,1),seasonal = list(order=c(0,1,1),period=12))
pred <- predict(fit1,n.ahead=10*12) # predictfor now 1960 to 1970
pred1<-2.718^pred$pred
pred1 #give op of 1960 to 1970
## Jan Feb Mar Apr May Jun Jul
## 1960 419.0628 398.6732 466.2820 454.1188 472.9611 546.7614 621.8017
## 1961 469.5742 446.7270 522.4849 508.8558 529.9692 612.6649 696.7502
## 1962 526.1740 500.5730 585.4623 570.1903 593.8487 686.5121 780.7325
## 1963 589.5961 560.9092 656.0306 638.9179 665.4278 769.2603 874.8376
## 1964 660.6627 628.5180 735.1048 715.9294 745.6347 861.9826 980.2855
## 1965 740.2952 704.2761 823.7102 802.2235 835.5093 965.8811 1098.4436
## 1966 829.5262 789.1655 922.9956 898.9190 936.2169 1082.3030 1230.8438
## 1967 929.5126 884.2870 1034.2482 1007.2696 1049.0631 1212.7576 1379.2027
## 1968 1041.5507 990.8740 1158.9107 1128.6801 1175.5113 1358.9366 1545.4440
## 1969 1167.0934 1110.3083 1298.5992 1264.7248 1317.2007 1522.7351 1731.7230
## Aug Sep Oct Nov Dec
## 1960 629.7291 526.4044 461.9958 406.3747 452.0098
## 1961 705.6331 589.8542 517.6821 455.3568 506.4925
## 1962 790.6861 660.9519 580.0806 510.2429 567.5423
## 1963 885.9909 740.6193 650.0003 571.7447 635.9506
## 1964 992.7833 829.8893 728.3476 640.6596 712.6045
## 1965 1112.4477 929.9195 816.1385 717.8810 798.4978
## 1966 1246.5359 1042.0067 914.5112 804.4104 894.7442
## 1967 1396.7863 1167.6043 1024.7412 901.3694 1002.5916
## 1968 1565.1470 1308.3407 1148.2577 1010.0154 1123.4383
## 1969 1753.8009 1466.0406 1286.6622 1131.7569 1258.8512
data11=round(head(pred1,12),0) #head of Predicted
data22=round(tail(AirPassengers,12),0) #tail of original
plot(data11,col="red", type="l")
lines(data22,col="blue")
qqnorm is a generic function the default method of which produces a normal QQ plot of the values in y. qqline adds a line to a “theoretical”, by default normal, quantile-quantile plot which passes through the probs quantiles, by default the first and third quartiles.
qqnorm(residuals(fit))
qqline(residuals(fit))
The linearity of the points suggests that the data are normally distributed with mean = 0
To cross check the stationarity of timeseries, use Augemented Dickey-Fuller test (from tseries package). Rejecting the null hypothesis suggests that a time series is stationary
adf.test(fit$residuals, alternative ="stationary")
## Warning in adf.test(fit$residuals, alternative = "stationary"): p-value
## smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: fit$residuals
## Dickey-Fuller = -4.7907, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
From the above p-value, we can conclude that the residuals of our ARIMA prediction model is stationary.