Manipur is a tiny state in the Northeast region of India. I will forecast the next census which is going to happen in 2021. The census happens every 10 years. The data is collected from a govt. published document. The data consist of two variables - Year and the correspoding population. I will load the data first.
census
## Year Population
## 1 1901 284465
## 2 1911 346222
## 3 1921 384016
## 4 1931 445606
## 5 1941 512069
## 6 1951 577635
## 7 1961 780037
## 8 1971 1072753
## 9 1981 1420953
## 10 1991 1837149
## 11 2001 2293896
## 12 2011 2721756
class(census)
## [1] "data.frame"
I will change the data frame into time series.
census.ts <- ts(census$Population, start = 1901, deltat = 10)
class(census.ts)
## [1] "ts"
census.ts
## Time Series:
## Start = 1901
## End = 2011
## Frequency = 0.1
## [1] 284465 346222 384016 445606 512069 577635 780037 1072753
## [9] 1420953 1837149 2293896 2721756
plot(census.ts, main = "Population trend of Manipur", xlab = "Years", ylab = "Population")
Before we go ahead and build the time series model, we will check ACF and PACF. So that we can see what order we can have for ARIMA model.
acf(census.ts)
pacf(census.ts)
From ACF we can see the only significant lag is 2nd lag. From PACF only 1st lag is significant.
We can also see what auto arima function suggests from forecast package.
require(forecast)
auto.arima(census.ts)
## Series: census.ts
## ARIMA(1,2,0)
##
## Coefficients:
## ar1
## 0.548
## s.e. 0.241
##
## sigma^2 estimated as 2.5e+09: log likelihood=-122.6
## AIC=249.1 AICc=250.8 BIC=249.7
We will build our first model with auto.arima()’s suggestion, that is, (1, 2, 0)
model1 <- arima(census.ts, order = c(1,2,0))
model1
## Series: census.ts
## ARIMA(1,2,0)
##
## Coefficients:
## ar1
## 0.548
## s.e. 0.241
##
## sigma^2 estimated as 2.5e+09: log likelihood=-122.6
## AIC=249.1 AICc=250.8 BIC=249.7
Lets validate whether the coefficients are significant or not with confint(). Also check whether the residuals are normally distributed with shapiro.test() and check whether the residuals are significant with acf(). We will plot the residuals as well.
confint(model1)
## 2.5 % 97.5 %
## ar1 0.07672 1.02
model1.res <- residuals(model1)
shapiro.test(model1.res)
##
## Shapiro-Wilk normality test
##
## data: model1.res
## W = 0.8092, p-value = 0.01193
acf(model1.res)
plot(model1.res)
We will do forecast from this model and plot the forecast.
model1.prediction <- forecast.Arima(model1, h = 1)
model1.prediction
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2021 3133778 3069688 3197868 3035761 3231795
plot.forecast(model1.prediction, xlab = "Years", ylab = "Population", main = "Expected population in 2021")
We will try another model and see if there is any difference. The 2nd model order is (1,2,1).
model2 <- arima(census.ts, order = c(1,2,1))
model2
## Series: census.ts
## ARIMA(1,2,1)
##
## Coefficients:
## ar1 ma1
## 0.202 0.487
## s.e. 0.607 0.646
##
## sigma^2 estimated as 2.46e+09: log likelihood=-122.5
## AIC=251.1 AICc=255.1 BIC=252
We will validate the 2nd model as well.
confint(model2)
## 2.5 % 97.5 %
## ar1 -0.9875 1.392
## ma1 -0.7790 1.754
model2.res <- residuals(model2)
shapiro.test(model2.res)
##
## Shapiro-Wilk normality test
##
## data: model2.res
## W = 0.8321, p-value = 0.0222
acf(model2.res)
plot(model2.res)
And the forecast from the 2nd model is
model2.prediction <- forecast.Arima(model2, h = 1)
model2.prediction
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2021 3123737 3060200 3187274 3026565 3220909
plot.forecast(model2.prediction, xlab = "Years", ylab = "Population", main = "Expected population in 2021")
Looking at the Log likelihood, AIC and confidence interval values between the two models, it is better to go ahead with the first model. Moreover, it is not limited to just these two models, we can always try few more orders.