Time Series

About Data

U.S. Census Bureau hosted by the Federal Reserve Economic Database (FRED),they update their information according the amount of data that is brought in.

A population is defined as a group of individuals of the same species living and interbreeding within a given area. Members of a population often rely on the same resources, are subject to similar environmental constraints, and depend on the availability of other members to persist over time.

and every time in all places there will certainly continue to experience an increase in population, therefore we will try to make a time series model to see the population in a few years

Prepare Dataset and library

import Dataset

dataset = read.csv("POP.csv")
knitr::kable(head(dataset))
realtime_start value date realtime_end
2019-12-06 156309 1952-01-01 2019-12-06
2019-12-06 156527 1952-02-01 2019-12-06
2019-12-06 156731 1952-03-01 2019-12-06
2019-12-06 156943 1952-04-01 2019-12-06
2019-12-06 157140 1952-05-01 2019-12-06
2019-12-06 157343 1952-06-01 2019-12-06

import Library

library(dplyr)
library(tidyverse) #data manipulation
library(lubridate) # date manipulation
library(forecast) # time series library
library(MLmetrics) # calculate error
library(tseries) # adf.test
library(fpp) # usconsumtion
library(padr) # for padding
library(ggplot2) # visualize
library(gridExtra) # visualize

data processing

we will change a date column as date type and delete a realtime_start and realtime_end column in our dataset

dataset = dataset %>% 
  select(value, date) %>% 
  mutate(date = as.Date(date)) 
## [1] "1952-01-01" "2019-12-01"

we know this data is start from 1952 to end 2019 year

check missing value or date

dataset %>% 
  padr::pad() %>% 
  is.na() %>% 
  colSums()
## pad applied on the interval: month
## value  date 
##     0     0

we have a good data, becouse we dont have any lost value or date

Time Series

About Time Series

Time series is a method of analyzing and processing data which the values are affected by time. The action of predicting future values based on its value in the previous period of time is called forecasting. the boject in time series have 3 component these are trend, seasonal and error

data_ts = ts(dataset$value , start = c(1952,1), frequency = 12)

Cross Validation

we will split data test and train, for test we will use the data from 2010-01-01 to 2019-12-01 and for train set we use 1952-01-01 to 2009-12-01

test = tail(data_ts,12*9)

train = head(data_ts, length(data_ts)-length(test))

EDA

in time series that have 2 type data, that is additive and multiplicative, we will check out out data time series is additive or multiplicative

train %>% 
  autoplot()


as a plot in below we know this time series is additive

train %>% 
  decompose(type = "multiplicative") %>% 
  autoplot()


and we know to this time series has seasonal

Modeling Forcast

There are few ways of forecasting in time series:
1. Simple Moving Avarage
Predict the value of a data point by calculating the average of n numbers of data before the data point. This method is more appropriate for a data with no trend and seasonal and used for smoothing error of the data.

2. Exponential Smoothing
Predict future values by smoothing on error, trend, and seasonal, where it gives different weights on each data point used for prediction. Exponential Smoothing divided into three that is simpel exponentional smoothing, Holt and Holt Winter

3. ARIMA
ARIMA is an acronyms for Auto Regresive Integreted Moving Averange. ARIMA is great for predicting a stationary time series data. Stationary time series data has values which move around its mean (no trend nor seasonality)

becouse in our data time series we dont have seasonal but have a trend, we will try Holt, Holtwinter and ARIMA forcasting

using HoltWinter

becouse in out data time series have a seasonality we will set up gamma is TRUE and we will create 2 model the one of is manual tuning model and second one is auto tuning model

hw1 = HoltWinters(x = train, gamma = T, beta = 0.05, alpha = 0.8)
hw2 = HoltWinters(x = train)

we create 2 model HoltWinter hw1 with tuning and hw2 is auto tuning

forecast

forecast model 9 years a head

hw1_forecast = forecast(hw1, h = 12*9)
hw2_forecast = forecast(hw2, h = 12*9)

evaluation

train %>% 
  autoplot()+
  autolayer(hw1_forecast, series = "model with tuning")+
  autolayer(test, series = "test")+
  autolayer(hw2_forecast, series = "model non-tuning")

a = test %>% 
  autoplot()+
  autolayer(hw2_forecast, series = "model non-tuning")


b = test %>% 
  autoplot()+
   autolayer(hw1_forecast, series = "model with tuning")

grid.arrange(a,b)

accuracy(hw1_forecast$mean, test)
##                 ME     RMSE      MAE        MPE      MAPE      ACF1 Theil's U
## Test set -1008.179 1208.741 1008.495 -0.3107722 0.3108736 0.9654125  6.529574
accuracy(hw2_forecast$mean, test)
##                 ME     RMSE      MAE       MPE     MAPE      ACF1 Theil's U
## Test set -803.0506 955.7368 803.0506 -0.247659 0.247659 0.9644099  5.162915

From the analysis above, we can conclude that we have sucsesfully forecast population increase with Holtwinter and found model wiht auto tunning as the best model with the lowest error (MAPE 0.247659).

Using Arima

adf test

train %>% 
  adf.test()
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -1.3632, Lag order = 8, p-value = 0.8478
## alternative hypothesis: stationary

Based on Augmented Dickey-Fuller Test (adf.test) result, the p-value > alpha and therefore we need to perform differencing on the data first.

train %>% 
  diff(lag = 12) %>% 
  diff() %>% 
  adf.test()
## Warning in adf.test(.): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -5.1044, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

and after differencing the p-value is < alpha and therefore the data is already stationary.

train %>% 
  diff(lag = 12) %>% 
  diff() %>% 
  tsdisplay()

create arima model

we will create 2 model with arima. model the one of is manual tuning model and second one is auto tuning model

arim = Arima(train, order = c(0,1,0), seasonal = c(0,1,0)) # 
auto_arim = auto.arima(train)
summary(auto_arim)
## Series: train 
## ARIMA(1,1,1)(1,1,1)[12] 
## 
## Coefficients:
##          ar1      ma1    sar1     sma1
##       0.9738  -0.6237  0.0720  -0.7699
## s.e.  0.0102   0.0328  0.0667   0.0460
## 
## sigma^2 estimated as 692.5:  log likelihood=-2896.91
## AIC=5803.83   AICc=5803.91   BIC=5826.55
## 
## Training set error measures:
##                      ME     RMSE    MAE           MPE        MAPE        MASE
## Training set -0.6654741 25.99739 10.148 -0.0003652596 0.004764247 0.003866058
##                  ACF1
## Training set 0.105954

we can see arima with auto tuning have order = c(1,1,1) and seasonal = c(1,1,1) and that is diferent from with model manual tuning

check AIC

arim$aic
## [1] 6324.178
auto_arim$aic
## [1] 5803.826

forecast

forecast model 9 years a head

arim_fc = forecast(arim, 12*9)
auto_arim_fc = forecast(arim, 12*9)

evaluation

train %>% 
  autoplot()+
  autolayer(arim_fc, series = "Arima Forcast")+
  autolayer(auto_arim_fc, series = "Auto Arima Forcast")

c = test %>% 
  autoplot()+
  autolayer(arim_fc, series = "model non-tuning")


d = test %>% 
  autoplot()+
   autolayer(auto_arim_fc, series = "model with tuning")

grid.arrange(c,d)

### check accuracy MAPE

accuracy(arim_fc, test)
##                      ME      RMSE       MAE           MPE        MAPE
## Training set  -1.242852  30.86293  16.51099 -0.0005525104 0.007544183
## Test set     323.717833 377.08267 327.21109  0.1002121992 0.101334354
##                     MASE      ACF1 Theil's U
## Training set 0.006290153 0.4086018        NA
## Test set     0.124656817 0.9389123   2.05404
accuracy(auto_arim_fc, test)
##                      ME      RMSE       MAE           MPE        MAPE
## Training set  -1.242852  30.86293  16.51099 -0.0005525104 0.007544183
## Test set     323.717833 377.08267 327.21109  0.1002121992 0.101334354
##                     MASE      ACF1 Theil's U
## Training set 0.006290153 0.4086018        NA
## Test set     0.124656817 0.9389123   2.05404

from the analys above, we can conclude that we have sucsesfully forecast population increase and MAPE score is better then use HoltWinter

Assumption Check

shapiro

with shapiro test we can check a residuals is normaly distributed or not, if p-value < 0.05 that mean the forecast residuals are not normaly distributed and then if p-value > 0.05 that mean residuals are normally distributed

shapiro.test(arim_fc$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  arim_fc$residuals
## W = 0.60971, p-value < 2.2e-16
hist(arim_fc$residuals, breaks = 10)

shapiro.test(auto_arim_fc$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  auto_arim_fc$residuals
## W = 0.60971, p-value < 2.2e-16
hist(auto_arim_fc$residuals, breaks = 10)

we can assumption the forecast residuals are not normally distributed

Ljung-Box

with Ljung-Box we can determine forecast have autocorrelation in the forecast errors or not, if p-value < 0.05 that mean there is an autocorrelation in the forecast errors and if p-value > 0.05 that mean No autocorrelation in the forecast errors

Box.test(arim_fc$residuals, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  arim_fc$residuals
## X-squared = 118.71, df = 1, p-value < 2.2e-16
Box.test(auto_arim_fc$residuals, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  auto_arim_fc$residuals
## X-squared = 118.71, df = 1, p-value < 2.2e-16

Based on the assumption check, in 2 model arima, there is an autocorrelation on our forecast residuals (p-value < 0.05) and our forecast’s residuals are not distributed normally (p-value < 0.05).therefore it’s residuals may not be appeared around its mean as seen in the histogram, But, if we inspect the distribution of residuals through a line plot, it is actually resembles the error plot from our time series object decomposition.
In a time series, such errors might emerge from various unpredictable events and is actually quite unavoidable. One strategy to overcome it is to analyze what kinds of unpredictable events that might occur and occurs frequently. This can be done by time series analysis using seasonality adjustment. in population growth mybe sometime we dont know how much birthrate at some place and the causes