Time Series
About Data
U.S. Census Bureau hosted by the Federal Reserve Economic Database (FRED),they update their information according the amount of data that is brought in.
A population is defined as a group of individuals of the same species living and interbreeding within a given area. Members of a population often rely on the same resources, are subject to similar environmental constraints, and depend on the availability of other members to persist over time.
and every time in all places there will certainly continue to experience an increase in population, therefore we will try to make a time series model to see the population in a few years
Prepare Dataset and library
import Dataset
dataset = read.csv("POP.csv")
knitr::kable(head(dataset))| realtime_start | value | date | realtime_end |
|---|---|---|---|
| 2019-12-06 | 156309 | 1952-01-01 | 2019-12-06 |
| 2019-12-06 | 156527 | 1952-02-01 | 2019-12-06 |
| 2019-12-06 | 156731 | 1952-03-01 | 2019-12-06 |
| 2019-12-06 | 156943 | 1952-04-01 | 2019-12-06 |
| 2019-12-06 | 157140 | 1952-05-01 | 2019-12-06 |
| 2019-12-06 | 157343 | 1952-06-01 | 2019-12-06 |
import Library
library(dplyr)
library(tidyverse) #data manipulation
library(lubridate) # date manipulation
library(forecast) # time series library
library(MLmetrics) # calculate error
library(tseries) # adf.test
library(fpp) # usconsumtion
library(padr) # for padding
library(ggplot2) # visualize
library(gridExtra) # visualizedata processing
we will change a date column as date type and delete a realtime_start and realtime_end column in our dataset
dataset = dataset %>%
select(value, date) %>%
mutate(date = as.Date(date)) ## [1] "1952-01-01" "2019-12-01"
we know this data is start from 1952 to end 2019 year
check missing value or date
dataset %>%
padr::pad() %>%
is.na() %>%
colSums()## pad applied on the interval: month
## value date
## 0 0
we have a good data, becouse we dont have any lost value or date
Time Series
About Time Series
Time series is a method of analyzing and processing data which the values are affected by time. The action of predicting future values based on its value in the previous period of time is called forecasting. the boject in time series have 3 component these are trend, seasonal and error
data_ts = ts(dataset$value , start = c(1952,1), frequency = 12)Cross Validation
we will split data test and train, for test we will use the data from 2010-01-01 to 2019-12-01 and for train set we use 1952-01-01 to 2009-12-01
test = tail(data_ts,12*9)
train = head(data_ts, length(data_ts)-length(test))EDA
in time series that have 2 type data, that is additive and multiplicative, we will check out out data time series is additive or multiplicative
train %>%
autoplot()
as a plot in below we know this time series is additive
train %>%
decompose(type = "multiplicative") %>%
autoplot()
and we know to this time series has seasonal
Modeling Forcast
There are few ways of forecasting in time series:
1. Simple Moving Avarage
Predict the value of a data point by calculating the average of n numbers of data before the data point. This method is more appropriate for a data with no trend and seasonal and used for smoothing error of the data.
2. Exponential Smoothing
Predict future values by smoothing on error, trend, and seasonal, where it gives different weights on each data point used for prediction. Exponential Smoothing divided into three that is simpel exponentional smoothing, Holt and Holt Winter
3. ARIMA
ARIMA is an acronyms for Auto Regresive Integreted Moving Averange. ARIMA is great for predicting a stationary time series data. Stationary time series data has values which move around its mean (no trend nor seasonality)
becouse in our data time series we dont have seasonal but have a trend, we will try Holt, Holtwinter and ARIMA forcasting
using HoltWinter
becouse in out data time series have a seasonality we will set up gamma is TRUE and we will create 2 model the one of is manual tuning model and second one is auto tuning model
hw1 = HoltWinters(x = train, gamma = T, beta = 0.05, alpha = 0.8)
hw2 = HoltWinters(x = train)we create 2 model HoltWinter hw1 with tuning and hw2 is auto tuning
forecast
forecast model 9 years a head
hw1_forecast = forecast(hw1, h = 12*9)
hw2_forecast = forecast(hw2, h = 12*9)evaluation
train %>%
autoplot()+
autolayer(hw1_forecast, series = "model with tuning")+
autolayer(test, series = "test")+
autolayer(hw2_forecast, series = "model non-tuning")a = test %>%
autoplot()+
autolayer(hw2_forecast, series = "model non-tuning")
b = test %>%
autoplot()+
autolayer(hw1_forecast, series = "model with tuning")
grid.arrange(a,b)accuracy(hw1_forecast$mean, test)## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set -1008.179 1208.741 1008.495 -0.3107722 0.3108736 0.9654125 6.529574
accuracy(hw2_forecast$mean, test)## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set -803.0506 955.7368 803.0506 -0.247659 0.247659 0.9644099 5.162915
From the analysis above, we can conclude that we have sucsesfully forecast population increase with Holtwinter and found model wiht auto tunning as the best model with the lowest error (MAPE 0.247659).
Using Arima
adf test
train %>%
adf.test()##
## Augmented Dickey-Fuller Test
##
## data: .
## Dickey-Fuller = -1.3632, Lag order = 8, p-value = 0.8478
## alternative hypothesis: stationary
Based on Augmented Dickey-Fuller Test (adf.test) result, the p-value > alpha and therefore we need to perform differencing on the data first.
train %>%
diff(lag = 12) %>%
diff() %>%
adf.test()## Warning in adf.test(.): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: .
## Dickey-Fuller = -5.1044, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary
and after differencing the p-value is < alpha and therefore the data is already stationary.
train %>%
diff(lag = 12) %>%
diff() %>%
tsdisplay()create arima model
we will create 2 model with arima. model the one of is manual tuning model and second one is auto tuning model
arim = Arima(train, order = c(0,1,0), seasonal = c(0,1,0)) #
auto_arim = auto.arima(train)
summary(auto_arim)## Series: train
## ARIMA(1,1,1)(1,1,1)[12]
##
## Coefficients:
## ar1 ma1 sar1 sma1
## 0.9738 -0.6237 0.0720 -0.7699
## s.e. 0.0102 0.0328 0.0667 0.0460
##
## sigma^2 estimated as 692.5: log likelihood=-2896.91
## AIC=5803.83 AICc=5803.91 BIC=5826.55
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.6654741 25.99739 10.148 -0.0003652596 0.004764247 0.003866058
## ACF1
## Training set 0.105954
we can see arima with auto tuning have order = c(1,1,1) and seasonal = c(1,1,1) and that is diferent from with model manual tuning
check AIC
arim$aic## [1] 6324.178
auto_arim$aic## [1] 5803.826
forecast
forecast model 9 years a head
arim_fc = forecast(arim, 12*9)
auto_arim_fc = forecast(arim, 12*9)evaluation
train %>%
autoplot()+
autolayer(arim_fc, series = "Arima Forcast")+
autolayer(auto_arim_fc, series = "Auto Arima Forcast")c = test %>%
autoplot()+
autolayer(arim_fc, series = "model non-tuning")
d = test %>%
autoplot()+
autolayer(auto_arim_fc, series = "model with tuning")
grid.arrange(c,d) ### check accuracy MAPE
accuracy(arim_fc, test)## ME RMSE MAE MPE MAPE
## Training set -1.242852 30.86293 16.51099 -0.0005525104 0.007544183
## Test set 323.717833 377.08267 327.21109 0.1002121992 0.101334354
## MASE ACF1 Theil's U
## Training set 0.006290153 0.4086018 NA
## Test set 0.124656817 0.9389123 2.05404
accuracy(auto_arim_fc, test)## ME RMSE MAE MPE MAPE
## Training set -1.242852 30.86293 16.51099 -0.0005525104 0.007544183
## Test set 323.717833 377.08267 327.21109 0.1002121992 0.101334354
## MASE ACF1 Theil's U
## Training set 0.006290153 0.4086018 NA
## Test set 0.124656817 0.9389123 2.05404
from the analys above, we can conclude that we have sucsesfully forecast population increase and MAPE score is better then use HoltWinter
Assumption Check
shapiro
with shapiro test we can check a residuals is normaly distributed or not, if p-value < 0.05 that mean the forecast residuals are not normaly distributed and then if p-value > 0.05 that mean residuals are normally distributed
shapiro.test(arim_fc$residuals)##
## Shapiro-Wilk normality test
##
## data: arim_fc$residuals
## W = 0.60971, p-value < 2.2e-16
hist(arim_fc$residuals, breaks = 10)shapiro.test(auto_arim_fc$residuals)##
## Shapiro-Wilk normality test
##
## data: auto_arim_fc$residuals
## W = 0.60971, p-value < 2.2e-16
hist(auto_arim_fc$residuals, breaks = 10)we can assumption the forecast residuals are not normally distributed
Ljung-Box
with Ljung-Box we can determine forecast have autocorrelation in the forecast errors or not, if p-value < 0.05 that mean there is an autocorrelation in the forecast errors and if p-value > 0.05 that mean No autocorrelation in the forecast errors
Box.test(arim_fc$residuals, type = "Ljung-Box")##
## Box-Ljung test
##
## data: arim_fc$residuals
## X-squared = 118.71, df = 1, p-value < 2.2e-16
Box.test(auto_arim_fc$residuals, type = "Ljung-Box")##
## Box-Ljung test
##
## data: auto_arim_fc$residuals
## X-squared = 118.71, df = 1, p-value < 2.2e-16
Based on the assumption check, in 2 model arima, there is an autocorrelation on our forecast residuals (p-value < 0.05) and our forecast’s residuals are not distributed normally (p-value < 0.05).therefore it’s residuals may not be appeared around its mean as seen in the histogram, But, if we inspect the distribution of residuals through a line plot, it is actually resembles the error plot from our time series object decomposition.
In a time series, such errors might emerge from various unpredictable events and is actually quite unavoidable. One strategy to overcome it is to analyze what kinds of unpredictable events that might occur and occurs frequently. This can be done by time series analysis using seasonality adjustment. in population growth mybe sometime we dont know how much birthrate at some place and the causes