The data is been taken from the kaggle website. the link for the data source is as follows https://www.kaggle.com/limkongkong/airpassengers .
The main aim of doing the analysis and modelling for the dataset is to review the time series theory and also doing some experiments with current R packages. ARIMA model has been used for the dataset of Air Passengers and we have done exploratory data analysis, we have tried to decompose the dataset so as to test the stationarity. After identifying whether data is stationarity we have fitted a model by using augmented algorithm. Finally, we have forecasted the trend for next 10 years Setup.
data(AirPassengers)
air <- AirPassengers
class(air)
## [1] "ts"
To do the analysis of data first we need to look at the data review it with necessary functions like summary statistics and later doing it with plots.
air
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432
sum(is.na(air))
## [1] 0
frequency(air)
## [1] 12
cycle(air)
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 1 2 3 4 5 6 7 8 9 10 11 12
## 1950 1 2 3 4 5 6 7 8 9 10 11 12
## 1951 1 2 3 4 5 6 7 8 9 10 11 12
## 1952 1 2 3 4 5 6 7 8 9 10 11 12
## 1953 1 2 3 4 5 6 7 8 9 10 11 12
## 1954 1 2 3 4 5 6 7 8 9 10 11 12
## 1955 1 2 3 4 5 6 7 8 9 10 11 12
## 1956 1 2 3 4 5 6 7 8 9 10 11 12
## 1957 1 2 3 4 5 6 7 8 9 10 11 12
## 1958 1 2 3 4 5 6 7 8 9 10 11 12
## 1959 1 2 3 4 5 6 7 8 9 10 11 12
## 1960 1 2 3 4 5 6 7 8 9 10 11 12
summary(air)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 104.0 180.0 265.5 280.3 360.5 622.0
plot(air,xlab="Date", ylab = "Number of Passengers",main="Passenger Count from the year 1949 till 1961")
There is a R package known as ggfortify which is used as extension from the R package ggplot2 as an alternative for base function to plot directly with the help of time series.
After doing this to see any seasonal effects we will use the function of boxplot to show the necessary plot.
boxplot(air~cycle(air),xlab="Date", ylab = "Passenger Numbers (1000's)" ,main ="Monthly Passenger Count from the year 1949 till 1961")
So according to the boxplot we can say that there are some place where we can make some initial inferences. We can say that due to the growth in the demand for the air flights to travel because of these the number of passengers are increasing day by day yearly which can be an indication for the linear trend. In the year of 12 months we can say there is a year of season in which we can say that most of the people travel at that time so in this we can say that between the period 6 to 9 we can see that with the higher variances and mean compare to other months. So, according to the plot we can see that we cannot see any missing values and also not any outliers so the next step of data cleaning is not required.
Using the method Moving average we will do the estimates of the trend, seasonal and also the random components with the decomposition of the time series.
The formula for the Multiplicative we can say is : Y[t]=T[t]???S[t]???e[t]
Where . Y(t) stands for no. of the travelers at t time. . T(t) stands for trend constituent at t time. . S(t) stands for seasonal factor for t time. . e(t) stands for the random error factor for t time.
With the help of this we will be doing the function of decomposition to use the R functions for making plots, autoplot for the further analysis.
decomposeair <- decompose(air,"multiplicative")
autoplot(decomposeair)
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Warning: Removed 24 rows containing missing values (geom_path).
So In the plot of decomposition we get to see that the seasonality and the trend line are previously inferred, but for the random factor it is under the “remainder”.
Mean, covariance and variance are not elements of time according to the condition of stationary time series. The time arrangement is required to be stationary so as to fit ARIMA models and for that purpose we have utilized two techniques: 1.Augmented Dickey-Fuller Test for checking stationarity adf.test function is used from tseries package of R to test the stationarity of the time series and for that purpose we have applied Augmented Dickey-Fuller Test Firstly, we will check the hypothesis: H0 : time series is not stationary HA : time series is stationary
adf.test(diff(log(air)), alternative="stationary", k=0)
## Warning in adf.test(diff(log(air)), alternative = "stationary", k = 0): p-
## value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff(log(air))
## Dickey-Fuller = -9.6003, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary
As per the results mentioned above, we will reject the null hypothesis, as p-value is less than 5% and accept the alternative hypothesis and conclude that the time series is stationary. 2. Auto-correlation function acf function has been used from package of R and with the help of this function we will plot the correlation between the time series and its lags. A specific lag is meaningfully interrelated with series, if the auto-correlation surpass the dashed blue line.
autoplot(acf(log(air))) + labs(title="Correlogram of Air Passengers from 1949 to 1961")
From the above figure it can be predicted that there is a positive relation with the cycle of 12 months, which is indicated by maximum at lag 1 or 12 months. We have now plotted the acf, list object with a arbitrary component has previously been formed.
We can see the decay of ACF chart is too slow, which shows us that the population is not stationary. Now let’s check how the ACF and PACF curve will result after we test regression on the differences.
acf(diff(log(air)))
pacf(diff(log(air)))
We can see the ACF plot disconnects after first interval. We get to know ACF curves gets a interval which means p value should be 0. We can see (0,1,1) as (p,d,q) combination with lowest AIC and BIC.
As we can see that as per the findings this may not be the best fitted model for the dataset as because it do not see the seasonality and effects over time.
(arimaair <- arima(log(air), c(0, 1, 1),seasonal = list(order = c(0, 1, 1), period = 12)))
##
## Call:
## arima(x = log(air), order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1),
## period = 12))
##
## Coefficients:
## ma1 sma1
## -0.4018 -0.5569
## s.e. 0.0896 0.0731
##
## sigma^2 estimated as 0.001348: log likelihood = 244.7, aic = -485.4
arimaair
##
## Call:
## arima(x = log(air), order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1),
## period = 12))
##
## Coefficients:
## ma1 sma1
## -0.4018 -0.5569
## s.e. 0.0896 0.0731
##
## sigma^2 estimated as 0.001348: log likelihood = 244.7, aic = -485.4
The ARIMA(2,1,1) (0,1,0)[12] parameters model which are defined as lag 1 with differencing (d), by which an autoreverting term of second lag (p) which has a changing average for the model of order 1(q).
ARIMA model:
Y^=0.5960Yt2+0.2143Yt120.9819et1+E
In which E stands for error.
For doing the model diagnostics for the residuals and the acf. We will use the package ggfortify R package in which the ggtsdiag function which will include a autocovariance plot.
ggtsdiag(arimaair)
We can say that the ARIMA model is good fit as the plots can been seen around 0 centered as noise which includes with no pattern.
Finally, using the Forecast function we will now fit the Arima Model with the time series and later the we will calculate the future 10 years.
(arimaair <- arima(log(air), c(0, 1, 1),seasonal = list(order = c(0, 1, 1), period = 12)))
##
## Call:
## arima(x = log(air), order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1),
## period = 12))
##
## Coefficients:
## ma1 sma1
## -0.4018 -0.5569
## s.e. 0.0896 0.0731
##
## sigma^2 estimated as 0.001348: log likelihood = 244.7, aic = -485.4
arimaair
##
## Call:
## arima(x = log(air), order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1),
## period = 12))
##
## Coefficients:
## ma1 sma1
## -0.4018 -0.5569
## s.e. 0.0896 0.0731
##
## sigma^2 estimated as 0.001348: log likelihood = 244.7, aic = -485.4
pred <- predict(arimaair, n.ahead = 10*12)
ts.plot(air,2.718^pred$pred, log = "y", lty = c(1,3))
To conclude, we have used some time series packages in R which are tseries, ggfortify and forecast. This was a good basic ARIMA modeling project now we can move to more complex comparisons, new time series dataset and models in R.