Manipur is a tiny state in the Northeast region of India. I will forecast the next census which is going to happen in 2021. The census happens every 10 years. The data is collected from a govt. published document. The data consist of two variables - Year and the correspoding population. I will load the data first.

census
##    Year Population
## 1  1901     284465
## 2  1911     346222
## 3  1921     384016
## 4  1931     445606
## 5  1941     512069
## 6  1951     577635
## 7  1961     780037
## 8  1971    1072753
## 9  1981    1420953
## 10 1991    1837149
## 11 2001    2293896
## 12 2011    2721756
class(census)
## [1] "data.frame"

I will change the data frame into time series.

census.ts <- ts(census$Population, start = 1901, deltat = 10)
class(census.ts)
## [1] "ts"
census.ts
## Time Series:
## Start = 1901 
## End = 2011 
## Frequency = 0.1 
##  [1]  284465  346222  384016  445606  512069  577635  780037 1072753
##  [9] 1420953 1837149 2293896 2721756
plot(census.ts, main = "Population trend of Manipur", xlab = "Years", ylab = "Population")

plot of chunk unnamed-chunk-3

Before we go ahead and build the time series model, we will check ACF and PACF. So that we can see what order we can have for ARIMA model.

acf(census.ts)

plot of chunk unnamed-chunk-4

pacf(census.ts)

plot of chunk unnamed-chunk-4

From ACF we can see the only significant lag is 2nd lag. From PACF only 1st lag is significant.

We can also see what auto arima function suggests from forecast package.

require(forecast)
auto.arima(census.ts)
## Series: census.ts 
## ARIMA(1,2,0)                    
## 
## Coefficients:
##         ar1
##       0.548
## s.e.  0.241
## 
## sigma^2 estimated as 2.5e+09:  log likelihood=-122.6
## AIC=249.1   AICc=250.8   BIC=249.7

We will build our first model with auto.arima()’s suggestion, that is, (1, 2, 0)

model1 <- arima(census.ts, order = c(1,2,0))
model1
## Series: census.ts 
## ARIMA(1,2,0)                    
## 
## Coefficients:
##         ar1
##       0.548
## s.e.  0.241
## 
## sigma^2 estimated as 2.5e+09:  log likelihood=-122.6
## AIC=249.1   AICc=250.8   BIC=249.7

Lets validate whether the coefficients are significant or not with confint(). Also check whether the residuals are normally distributed with shapiro.test() and check whether the residuals are significant with acf(). We will plot the residuals as well.

confint(model1)
##       2.5 % 97.5 %
## ar1 0.07672   1.02
model1.res <- residuals(model1)
shapiro.test(model1.res)
## 
##  Shapiro-Wilk normality test
## 
## data:  model1.res
## W = 0.8092, p-value = 0.01193
acf(model1.res)

plot of chunk unnamed-chunk-8

plot(model1.res)

plot of chunk unnamed-chunk-8

We will do forecast from this model and plot the forecast.

model1.prediction <- forecast.Arima(model1, h = 1)
model1.prediction
##      Point Forecast   Lo 80   Hi 80   Lo 95   Hi 95
## 2021        3133778 3069688 3197868 3035761 3231795
plot.forecast(model1.prediction, xlab = "Years", ylab = "Population", main = "Expected population in 2021")

plot of chunk unnamed-chunk-9

We will try another model and see if there is any difference. The 2nd model order is (1,2,1).

model2 <- arima(census.ts, order = c(1,2,1))
model2
## Series: census.ts 
## ARIMA(1,2,1)                    
## 
## Coefficients:
##         ar1    ma1
##       0.202  0.487
## s.e.  0.607  0.646
## 
## sigma^2 estimated as 2.46e+09:  log likelihood=-122.5
## AIC=251.1   AICc=255.1   BIC=252

We will validate the 2nd model as well.

confint(model2)
##       2.5 % 97.5 %
## ar1 -0.9875  1.392
## ma1 -0.7790  1.754
model2.res <- residuals(model2)
shapiro.test(model2.res)
## 
##  Shapiro-Wilk normality test
## 
## data:  model2.res
## W = 0.8321, p-value = 0.0222
acf(model2.res)

plot of chunk unnamed-chunk-11

plot(model2.res)

plot of chunk unnamed-chunk-11

And the forecast from the 2nd model is

model2.prediction <- forecast.Arima(model2, h = 1)
model2.prediction
##      Point Forecast   Lo 80   Hi 80   Lo 95   Hi 95
## 2021        3123737 3060200 3187274 3026565 3220909
plot.forecast(model2.prediction, xlab = "Years", ylab = "Population", main = "Expected population in 2021")

plot of chunk unnamed-chunk-12

Looking at the Log likelihood, AIC and confidence interval values between the two models, it is better to go ahead with the first model. Moreover, it is not limited to just these two models, we can always try few more orders.