First we will read the data into a dataframe from the xls.
data <- read.xlsx2("/Users/devinteran/MSinDS/DATA624/DATA624/Project1/Data Set for Class.xls",sheetIndex = 1,colClasses = c('integer','character','double','double','double','double','double'),stringsAsFactors=FALSE)
Our data represents 5 different groups where we are going to forecast data within each of these groups using specified variables from coluns 2-6. We have 1622 data points within each group and will be forecasting 140 series points. Here is a description of each columns:
SeriesInd - this provides the time element in our data. It represents a frequency of 1.
group - this will be what we are forecasted/our output prediction.
Var01, Var02, Var03, Var05, Var07 - these are to be used in different combinations in order to forecast 140 series data within each group
data[1:15,] %>%
kbl() %>%
kable_minimal()
| SeriesInd | group | Var01 | Var02 | Var03 | Var05 | Var07 |
|---|---|---|---|---|---|---|
| 40669 | S03 | 30.64286 | 123432400 | 30.34000 | 30.49000 | 30.57286 |
| 40669 | S02 | 10.28000 | 60855800 | 10.05000 | 10.17000 | 10.28000 |
| 40669 | S01 | 26.61000 | 10369300 | 25.89000 | 26.20000 | 26.01000 |
| 40669 | S06 | 27.48000 | 39335700 | 26.82000 | 27.02000 | 27.32000 |
| 40669 | S05 | 69.26000 | 27809100 | 68.19000 | 68.72000 | 69.15000 |
| 40669 | S04 | 17.20000 | 16587400 | 16.88000 | 16.94000 | 17.10000 |
| 40670 | S03 | 30.79857 | 150476200 | 30.46428 | 30.65714 | 30.62571 |
| 40670 | S02 | 11.24000 | 215620200 | 10.40000 | 10.45000 | 10.96000 |
| 40670 | S01 | 26.30000 | 10943800 | 25.70000 | 25.95000 | 25.86000 |
| 40670 | S06 | 28.24000 | 55416000 | 27.24000 | 27.27000 | 28.07000 |
| 40670 | S05 | 69.45000 | 30174700 | 68.80000 | 69.19000 | 69.42000 |
| 40670 | S04 | 17.23000 | 11718100 | 17.00000 | 17.22000 | 17.23000 |
| 40671 | S03 | 30.74714 | 138040000 | 30.10714 | 30.62571 | 30.13857 |
| 40671 | S02 | 11.46000 | 200070600 | 11.13000 | 11.21000 | 11.37000 |
| 40671 | S01 | 26.03000 | 8933800 | 25.56000 | 25.90000 | 25.67000 |
S01 <- data %>%
mutate(date = as.Date(SeriesInd, origin = '1898-08-30')) %>%
filter(SeriesInd < 43022 & group == 'S01') %>%
select (SeriesInd,date,Var01,Var02)
S01[1:15,] %>%
kbl() %>%
kable_minimal()
| SeriesInd | date | Var01 | Var02 |
|---|---|---|---|
| 40669 | 2010-01-04 | 26.61 | 10369300 |
| 40670 | 2010-01-05 | 26.30 | 10943800 |
| 40671 | 2010-01-06 | 26.03 | 8933800 |
| 40672 | 2010-01-07 | 25.84 | 10775400 |
| 40673 | 2010-01-08 | 26.34 | 12875600 |
| 40676 | 2010-01-11 | 26.49 | 11677000 |
| 40677 | 2010-01-12 | 26.03 | 21165300 |
| 40678 | 2010-01-13 | 25.16 | 18809200 |
| 40679 | 2010-01-14 | 25.00 | 22908400 |
| 40680 | 2010-01-15 | 24.77 | 20359100 |
| 40684 | 2010-01-19 | 24.78 | 11783800 |
| 40685 | 2010-01-20 | 24.61 | 13253500 |
| 40686 | 2010-01-21 | 24.88 | 18407800 |
| 40687 | 2010-01-22 | 24.17 | 21806400 |
| 40690 | 2010-01-25 | 23.82 | 20192100 |
There are 2 rows within the group S01 that is missing a Var01 value.
library(visdat)
vis_dat(S01)
S01[is.na(S01$Var01)|is.na(S01$Var02),]
prev_value <- S01[c(1536),'Var01']
S01_clean <- S01
S01_clean[is.na(S01_clean$Var01)|is.na(S01_clean$Var02),]$Var01 <- prev_value
require(scales)
## Loading required package: scales
S01_Var01 <- ggplot(S01_clean, aes(x=Var01)) +
geom_histogram(bins=50,color='black',fill="#7fc97f") +
ggtitle("Group S01 Data") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 45))
S01_Var01_boxplot <- ggplot(S01_clean, aes(x=Var01)) +
geom_boxplot(color='black',fill="#7fc97f") +
ggtitle("") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 45))
S01_Var02 <- ggplot(S01_clean, aes(x=Var02)) +
geom_histogram(bins=30,color='black',fill='#fdc086') +
scale_x_continuous(labels = comma) +
theme(axis.text.x = element_text(angle = 45))
S01_Var02_boxplot <- ggplot(S01_clean, aes(x=Var02)) +
geom_boxplot(color='black',fill="#fdc086") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_x_continuous(labels = comma) +
theme(axis.text.x = element_text(angle = 45))
grid.arrange(S01_Var01,S01_Var01_boxplot,S01_Var02,S01_Var02_boxplot,nrow=2,widths=2:1)
t <- ts(S01$Var01, start = 2010, end = 2014, frequency = 365)
t %>% decompose(type="multiplicative") %>%
autoplot() + xlab("Year") +
ggtitle("Classical multiplicative decomposition of S01 data - Variable 01")
t <- ts(S01$Var02, start = 2010, end = 2014, frequency = 365)
t %>% decompose(type="multiplicative") %>%
autoplot() + xlab("Year") +
ggtitle("Classical multiplicative decomposition of S01 data - Variable 02")
ggtsdisplay(S01_clean$Var01)
From our previous plots we can see a clear trend. Let’s run a Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test in order to confirm this. This test assumes that Variable 01 data is stationary and we are looking to reject that statement. In order to reject that statement, we need to see that our test-statistic is less than 0.463.
S01_clean$Var01 %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 8 lags.
##
## Value of test-statistic is: 16.0591
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
Here we see a test statistic of 16.0591 which is above 0.463 so we know we can reject the original statement that the Variable 01 data is stationary.
From our KPSS test we confirmed that Variable 01 data is non-stationary and we need to differentiate it in an attempt to make it stationary. Here we are differentiating the data and running the KPSS test again. We see that our test-statistic is much lower and we can accept the assumption that Variable 01 data after being differentiated once is now stationary.
Var01_diff <- c(0,S01_clean$Var01 %>% diff())
S01_clean$Var01_diff <- Var01_diff
S01_clean$Var01_diff %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 8 lags.
##
## Value of test-statistic is: 0.1003
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
To check for whether seasonal differencing is necessary we will run the function, nsdiffs(). If there is seasonality in the data, the value returned will be greater than 0.64. If it is less than 0.64, there is no seasonality.
t <- ts(S01_clean$Var01,frequency=365)
nsdiffs(t)
## [1] 0
No seasonal differencing is needed since the value is equal to 0. This is interesting since visually it looked like there was seasonality in our decomposed plot seen above.
ggtsdisplay(S01_clean$Var02)
Just like we did with Variable 01, we are running the KPSS test and looking to reject the statement Variable 02 is stationary. To do this the KPSS test-statistic must be larger than 0.463, which it is. We reject the statement that Variable 02 data is stationary, which matches our previous theory.
S01_clean$Var02 %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 8 lags.
##
## Value of test-statistic is: 10.3302
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The KPSS test helped us confirm that Variable 02 data is non-stationary and we need to differentiate it as well. Here we are differentiating the data and running the KPSS test again. We see that our test-statistic is much lower and we can accept the assumption that Variable 02 data after being differentiated once is now stationary.
Var02_diff <- c(0,S01_clean$Var02 %>% diff())
S01_clean$Var02_diff <- Var02_diff
S01_clean$Var02_diff %>% ur.kpss() %>% summary()
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 8 lags.
##
## Value of test-statistic is: 0.0042
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
To check for whether seasonal differencing is necessary we will run the function, nsdiffs(). If there is seasonality in the data, the value returned will be greater than 0.64. If it is less than 0.64, there is no seasonality.
t <- ts(S01_clean$Var02,frequency=365)
nsdiffs(t)
## [1] 0
No seasonal differencing is needed since the value is equal to 0.
Here we will use 80% of our data as training data and will reserve 20% of our data for testing.
break_num <- floor(nrow(S01_clean)*0.8)
train <- S01_clean[1:break_num,]
test <- S01_clean[(break_num+1):nrow(S01_clean),]
The first model we will test for Variable 01 is the auto.arima model with 1 difference.
S01_train_Var01 <- ts(select(train,Var01))
fit_var01_auto <- auto.arima(S01_train_Var01,d=1)
summary(fit_var01_auto)
## Series: S01_train_Var01
## ARIMA(0,1,1) with drift
##
## Coefficients:
## ma1 drift
## 0.0845 0.0254
## s.e. 0.0286 0.0130
##
## sigma^2 estimated as 0.1859: log likelihood=-747.66
## AIC=1501.32 AICc=1501.34 BIC=1516.83
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 2.312112e-05 0.4306647 0.3040085 -0.01941244 0.8958063 0.9926807
## ACF1
## Training set -0.002772543
Second we will use the exponential smoothing (ETS) model.
fit_var01_ets <- ets(S01_train_Var01)
summary(fit_var01_ets)
## ETS(A,A,N)
##
## Call:
## ets(y = S01_train_Var01)
##
## Smoothing parameters:
## alpha = 0.9999
## beta = 1e-04
##
## Initial states:
## l = 26.4127
## b = 0.0254
##
## sigma: 0.4328
##
## AIC AICc BIC
## 7130.325 7130.371 7156.164
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.0007689551 0.4321495 0.3061459 -0.01885709 0.9020201 0.99966
## ACF1
## Training set 0.07834262
forecast_auto_var01 <- fit_var01_auto %>% forecast(nrow(train))
accuracy(forecast_auto_var01,S01_clean[1298:nrow(S01_clean),c('Var01')])
## ME RMSE MAE MPE MAPE
## Training set 2.312112e-05 0.4306647 0.3040085 -0.01941244 0.8958063
## Test set -8.699160e+00 9.5792251 8.7037779 -16.35612988 16.3638362
## MASE ACF1
## Training set 0.9926807 -0.002772543
## Test set 28.4204991 NA
forecast_ets_var01 <- fit_var01_ets %>% forecast(nrow(train))
accuracy(forecast_ets_var01,S01_clean[1298:nrow(S01_clean),c('Var01')])
## ME RMSE MAE MPE MAPE MASE
## Training set 0.0007689551 0.4321495 0.3061459 -0.01885709 0.9020201 0.99966
## Test set -8.7295616713 9.6092604 8.7341364 -16.41130797 16.4189421 28.51963
## ACF1
## Training set 0.07834262
## Test set NA
The first model we will test for Variable 02 is the auto.arima model with 1 difference.
S01_train_Var02 <- ts(select(train,Var02))
fit_var02_auto <- auto.arima(S01_train_Var02,d=1)
summary(fit_var02_auto)
## Series: S01_train_Var02
## ARIMA(1,1,3)
##
## Coefficients:
## ar1 ma1 ma2 ma3
## 0.8498 -1.4789 0.3779 0.1086
## s.e. 0.0587 0.0667 0.0537 0.0417
##
## sigma^2 estimated as 1.197e+13: log likelihood=-21351.15
## AIC=42712.31 AICc=42712.35 BIC=42738.14
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -128388.4 3452637 2293964 -12.30189 27.04777 0.8620093 0.003851951
Second we will use the exponential smoothing (ETS) model.
fit_var02_ets <- ets(S01_train_Var02)
summary(fit_var02_ets)
## ETS(M,A,N)
##
## Call:
## ets(y = S01_train_Var02)
##
## Smoothing parameters:
## alpha = 0.8174
## beta = 0.0011
##
## Initial states:
## l = 7092879.6943
## b = 1533385.7915
##
## sigma: 0.3611
##
## AIC AICc BIC
## 48309.00 48309.04 48334.84
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -901614.9 3993393 2691723 -17.35243 30.70294 1.011477 -0.2385468
forecast_auto_var02 <- fit_var02_auto %>% forecast(nrow(test))
accuracy(forecast_auto_var02,S01_clean[1298:nrow(S01_clean),c('Var02')])
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -128388.4 3452637 2293964 -12.30189 27.04777 0.8620093 0.003851951
## Test set 530138.9 3208081 2161143 -10.00262 34.96505 0.8120989 NA
forecast_ets_var02 <- fit_var02_ets %>% forecast(nrow(test))
accuracy(forecast_ets_var02,S01_clean[1298:nrow(S01_clean),c('Var02')])
## ME RMSE MAE MPE MAPE MASE
## Training set -901614.9 3993393 2691723 -17.35243 30.70294 1.011477
## Test set -44859673.6 51987522 44898015 -885.66149 885.82828 16.871455
## ACF1
## Training set -0.2385468
## Test set NA
checkresiduals(fit_var01_auto)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1) with drift
## Q* = 11.952, df = 8, p-value = 0.1534
##
## Model df: 2. Total lags used: 10
checkresiduals(fit_var02_auto)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,3)
## Q* = 20.061, df = 6, p-value = 0.002702
##
## Model df: 4. Total lags used: 10
Both auto.arima modeling methods produced the best models with the lowest MAPE values. For Variable 01 the model that was produce was ARIMA(0,1,1) with drift for Variable 02 the model is ARIMA(1,1,3).
Now we will forecast out 140 periods using our selected models for Variable 01 and 02.
final_forecast_auto_var01 <- fit_var01_auto %>% forecast(140)
autoplot(final_forecast_auto_var01)
final_forecast_auto_var02 <- fit_var02_auto %>% forecast(140)
autoplot(final_forecast_auto_var02)