Getting Data

First we will read the data into a dataframe from the xls.

data <- read.xlsx2("/Users/devinteran/MSinDS/DATA624/DATA624/Project1/Data Set for Class.xls",sheetIndex = 1,colClasses = c('integer','character','double','double','double','double','double'),stringsAsFactors=FALSE)

Exploratory Data Analysis

Our data represents 5 different groups where we are going to forecast data within each of these groups using specified variables from coluns 2-6. We have 1622 data points within each group and will be forecasting 140 series points. Here is a description of each columns:

SeriesInd - this provides the time element in our data. It represents a frequency of 1.
group - this will be what we are forecasted/our output prediction.
Var01, Var02, Var03, Var05, Var07 - these are to be used in different combinations in order to forecast 140 series data within each group

data[1:15,] %>%
  kbl() %>%
  kable_minimal()
SeriesInd group Var01 Var02 Var03 Var05 Var07
40669 S03 30.64286 123432400 30.34000 30.49000 30.57286
40669 S02 10.28000 60855800 10.05000 10.17000 10.28000
40669 S01 26.61000 10369300 25.89000 26.20000 26.01000
40669 S06 27.48000 39335700 26.82000 27.02000 27.32000
40669 S05 69.26000 27809100 68.19000 68.72000 69.15000
40669 S04 17.20000 16587400 16.88000 16.94000 17.10000
40670 S03 30.79857 150476200 30.46428 30.65714 30.62571
40670 S02 11.24000 215620200 10.40000 10.45000 10.96000
40670 S01 26.30000 10943800 25.70000 25.95000 25.86000
40670 S06 28.24000 55416000 27.24000 27.27000 28.07000
40670 S05 69.45000 30174700 68.80000 69.19000 69.42000
40670 S04 17.23000 11718100 17.00000 17.22000 17.23000
40671 S03 30.74714 138040000 30.10714 30.62571 30.13857
40671 S02 11.46000 200070600 11.13000 11.21000 11.37000
40671 S01 26.03000 8933800 25.56000 25.90000 25.67000

Cleaning Data

S01 <- data %>% 
  mutate(date = as.Date(SeriesInd, origin = '1898-08-30')) %>%
  filter(SeriesInd < 43022 & group == 'S01') %>% 
  select (SeriesInd,date,Var01,Var02)
S01[1:15,] %>%
  kbl() %>%
  kable_minimal()
SeriesInd date Var01 Var02
40669 2010-01-04 26.61 10369300
40670 2010-01-05 26.30 10943800
40671 2010-01-06 26.03 8933800
40672 2010-01-07 25.84 10775400
40673 2010-01-08 26.34 12875600
40676 2010-01-11 26.49 11677000
40677 2010-01-12 26.03 21165300
40678 2010-01-13 25.16 18809200
40679 2010-01-14 25.00 22908400
40680 2010-01-15 24.77 20359100
40684 2010-01-19 24.78 11783800
40685 2010-01-20 24.61 13253500
40686 2010-01-21 24.88 18407800
40687 2010-01-22 24.17 21806400
40690 2010-01-25 23.82 20192100

Missing Data

There are 2 rows within the group S01 that is missing a Var01 value.

library(visdat)
vis_dat(S01)

S01[is.na(S01$Var01)|is.na(S01$Var02),]
prev_value <- S01[c(1536),'Var01']
S01_clean <- S01
S01_clean[is.na(S01_clean$Var01)|is.na(S01_clean$Var02),]$Var01 <- prev_value

Visualize Data

require(scales)
## Loading required package: scales
S01_Var01 <- ggplot(S01_clean, aes(x=Var01)) +
  geom_histogram(bins=50,color='black',fill="#7fc97f") + 
  ggtitle("Group S01 Data") +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(axis.text.x = element_text(angle = 45))
S01_Var01_boxplot <- ggplot(S01_clean, aes(x=Var01)) +
  geom_boxplot(color='black',fill="#7fc97f") + 
  ggtitle("") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  theme(axis.text.x = element_text(angle = 45))
S01_Var02 <- ggplot(S01_clean, aes(x=Var02)) +
  geom_histogram(bins=30,color='black',fill='#fdc086') + 
  scale_x_continuous(labels = comma) + 
  theme(axis.text.x = element_text(angle = 45))
S01_Var02_boxplot <- ggplot(S01_clean, aes(x=Var02)) +
  geom_boxplot(color='black',fill="#fdc086") + 
  theme(plot.title = element_text(hjust = 0.5)) + 
  scale_x_continuous(labels = comma) + 
  theme(axis.text.x = element_text(angle = 45))
grid.arrange(S01_Var01,S01_Var01_boxplot,S01_Var02,S01_Var02_boxplot,nrow=2,widths=2:1)

Decomposition - Trend, Seasonal, Error

t <- ts(S01$Var01, start = 2010, end = 2014, frequency = 365)
t %>% decompose(type="multiplicative") %>%
  autoplot() + xlab("Year") +
  ggtitle("Classical multiplicative decomposition of S01 data - Variable 01")

t <- ts(S01$Var02, start = 2010, end = 2014, frequency = 365)
t %>% decompose(type="multiplicative") %>%
  autoplot() + xlab("Year") +
  ggtitle("Classical multiplicative decomposition of S01 data - Variable 02")

Time, ACF and PACF Plots

Variable 01

ggtsdisplay(S01_clean$Var01)

  • Variable 01 data has a clear trend and is non-stationary data
  • The ACF plot shows an initial large autocorrelation value that gradually decreases as the lag increases, which indicates a trend
  • The PACF shows a single significant value for lag 1. This means the first lag explains all of the correlation that exists in the higher lags. This implies a consistent trend

Differencing Variable 01 Data

Trend

From our previous plots we can see a clear trend. Let’s run a Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test in order to confirm this. This test assumes that Variable 01 data is stationary and we are looking to reject that statement. In order to reject that statement, we need to see that our test-statistic is less than 0.463.

S01_clean$Var01 %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 8 lags. 
## 
## Value of test-statistic is: 16.0591 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Here we see a test statistic of 16.0591 which is above 0.463 so we know we can reject the original statement that the Variable 01 data is stationary.

From our KPSS test we confirmed that Variable 01 data is non-stationary and we need to differentiate it in an attempt to make it stationary. Here we are differentiating the data and running the KPSS test again. We see that our test-statistic is much lower and we can accept the assumption that Variable 01 data after being differentiated once is now stationary.

Var01_diff <- c(0,S01_clean$Var01 %>% diff())
S01_clean$Var01_diff <- Var01_diff

S01_clean$Var01_diff %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 8 lags. 
## 
## Value of test-statistic is: 0.1003 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Seasonal

To check for whether seasonal differencing is necessary we will run the function, nsdiffs(). If there is seasonality in the data, the value returned will be greater than 0.64. If it is less than 0.64, there is no seasonality.

t <- ts(S01_clean$Var01,frequency=365)
nsdiffs(t)
## [1] 0

No seasonal differencing is needed since the value is equal to 0. This is interesting since visually it looked like there was seasonality in our decomposed plot seen above.

Variable 02

ggtsdisplay(S01_clean$Var02)

  • Variable 02 data is a little less clear but appears to have a negative trend and is non-stationary data
  • The ACF plot shows an initial large autocorrelation value that gradually decreases as the lag increases, which indicates a trend
  • The PACF shows a significant value for lag 1. There also appears to be some seasonality since autocorrelations increase slightly as the lag increases.

Differencing Variable 02 Data

Trend

Just like we did with Variable 01, we are running the KPSS test and looking to reject the statement Variable 02 is stationary. To do this the KPSS test-statistic must be larger than 0.463, which it is. We reject the statement that Variable 02 data is stationary, which matches our previous theory.

S01_clean$Var02 %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 8 lags. 
## 
## Value of test-statistic is: 10.3302 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The KPSS test helped us confirm that Variable 02 data is non-stationary and we need to differentiate it as well. Here we are differentiating the data and running the KPSS test again. We see that our test-statistic is much lower and we can accept the assumption that Variable 02 data after being differentiated once is now stationary.

Var02_diff <- c(0,S01_clean$Var02 %>% diff())
S01_clean$Var02_diff <- Var02_diff

S01_clean$Var02_diff %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 8 lags. 
## 
## Value of test-statistic is: 0.0042 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Seasonal

To check for whether seasonal differencing is necessary we will run the function, nsdiffs(). If there is seasonality in the data, the value returned will be greater than 0.64. If it is less than 0.64, there is no seasonality.

t <- ts(S01_clean$Var02,frequency=365)
nsdiffs(t)
## [1] 0

No seasonal differencing is needed since the value is equal to 0.

Splitting Data into Test & Training

Here we will use 80% of our data as training data and will reserve 20% of our data for testing.

break_num <- floor(nrow(S01_clean)*0.8)
train <- S01_clean[1:break_num,]
test  <- S01_clean[(break_num+1):nrow(S01_clean),]

Time Series Plots

Auto Arima - Variable 01

The first model we will test for Variable 01 is the auto.arima model with 1 difference.

S01_train_Var01 <- ts(select(train,Var01))

fit_var01_auto <- auto.arima(S01_train_Var01,d=1)
summary(fit_var01_auto)
## Series: S01_train_Var01 
## ARIMA(0,1,1) with drift 
## 
## Coefficients:
##          ma1   drift
##       0.0845  0.0254
## s.e.  0.0286  0.0130
## 
## sigma^2 estimated as 0.1859:  log likelihood=-747.66
## AIC=1501.32   AICc=1501.34   BIC=1516.83
## 
## Training set error measures:
##                        ME      RMSE       MAE         MPE      MAPE      MASE
## Training set 2.312112e-05 0.4306647 0.3040085 -0.01941244 0.8958063 0.9926807
##                      ACF1
## Training set -0.002772543

ETS - Variable 01

Second we will use the exponential smoothing (ETS) model.

fit_var01_ets <- ets(S01_train_Var01)
summary(fit_var01_ets)
## ETS(A,A,N) 
## 
## Call:
##  ets(y = S01_train_Var01) 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
##     beta  = 1e-04 
## 
##   Initial states:
##     l = 26.4127 
##     b = 0.0254 
## 
##   sigma:  0.4328
## 
##      AIC     AICc      BIC 
## 7130.325 7130.371 7156.164 
## 
## Training set error measures:
##                        ME      RMSE       MAE         MPE      MAPE    MASE
## Training set 0.0007689551 0.4321495 0.3061459 -0.01885709 0.9020201 0.99966
##                    ACF1
## Training set 0.07834262

Forecast on Test Data - Variable 01

forecast_auto_var01 <- fit_var01_auto %>% forecast(nrow(train))
accuracy(forecast_auto_var01,S01_clean[1298:nrow(S01_clean),c('Var01')])
##                         ME      RMSE       MAE          MPE       MAPE
## Training set  2.312112e-05 0.4306647 0.3040085  -0.01941244  0.8958063
## Test set     -8.699160e+00 9.5792251 8.7037779 -16.35612988 16.3638362
##                    MASE         ACF1
## Training set  0.9926807 -0.002772543
## Test set     28.4204991           NA
forecast_ets_var01 <- fit_var01_ets %>% forecast(nrow(train))
accuracy(forecast_ets_var01,S01_clean[1298:nrow(S01_clean),c('Var01')])
##                         ME      RMSE       MAE          MPE       MAPE     MASE
## Training set  0.0007689551 0.4321495 0.3061459  -0.01885709  0.9020201  0.99966
## Test set     -8.7295616713 9.6092604 8.7341364 -16.41130797 16.4189421 28.51963
##                    ACF1
## Training set 0.07834262
## Test set             NA

Auto Arima - Variable 02

The first model we will test for Variable 02 is the auto.arima model with 1 difference.

S01_train_Var02 <- ts(select(train,Var02))

fit_var02_auto <- auto.arima(S01_train_Var02,d=1)
summary(fit_var02_auto)
## Series: S01_train_Var02 
## ARIMA(1,1,3) 
## 
## Coefficients:
##          ar1      ma1     ma2     ma3
##       0.8498  -1.4789  0.3779  0.1086
## s.e.  0.0587   0.0667  0.0537  0.0417
## 
## sigma^2 estimated as 1.197e+13:  log likelihood=-21351.15
## AIC=42712.31   AICc=42712.35   BIC=42738.14
## 
## Training set error measures:
##                     ME    RMSE     MAE       MPE     MAPE      MASE        ACF1
## Training set -128388.4 3452637 2293964 -12.30189 27.04777 0.8620093 0.003851951

ETS - Variable 02 Differentiated

Second we will use the exponential smoothing (ETS) model.

fit_var02_ets <- ets(S01_train_Var02)
summary(fit_var02_ets)
## ETS(M,A,N) 
## 
## Call:
##  ets(y = S01_train_Var02) 
## 
##   Smoothing parameters:
##     alpha = 0.8174 
##     beta  = 0.0011 
## 
##   Initial states:
##     l = 7092879.6943 
##     b = 1533385.7915 
## 
##   sigma:  0.3611
## 
##      AIC     AICc      BIC 
## 48309.00 48309.04 48334.84 
## 
## Training set error measures:
##                     ME    RMSE     MAE       MPE     MAPE     MASE       ACF1
## Training set -901614.9 3993393 2691723 -17.35243 30.70294 1.011477 -0.2385468

Forecast on Test Data - Variable 02

forecast_auto_var02 <- fit_var02_auto %>% forecast(nrow(test))
accuracy(forecast_auto_var02,S01_clean[1298:nrow(S01_clean),c('Var02')])
##                     ME    RMSE     MAE       MPE     MAPE      MASE        ACF1
## Training set -128388.4 3452637 2293964 -12.30189 27.04777 0.8620093 0.003851951
## Test set      530138.9 3208081 2161143 -10.00262 34.96505 0.8120989          NA
forecast_ets_var02 <- fit_var02_ets %>% forecast(nrow(test))
accuracy(forecast_ets_var02,S01_clean[1298:nrow(S01_clean),c('Var02')])
##                       ME     RMSE      MAE        MPE      MAPE      MASE
## Training set   -901614.9  3993393  2691723  -17.35243  30.70294  1.011477
## Test set     -44859673.6 51987522 44898015 -885.66149 885.82828 16.871455
##                    ACF1
## Training set -0.2385468
## Test set             NA

Residuals of Models

checkresiduals(fit_var01_auto)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1) with drift
## Q* = 11.952, df = 8, p-value = 0.1534
## 
## Model df: 2.   Total lags used: 10
checkresiduals(fit_var02_auto)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,3)
## Q* = 20.061, df = 6, p-value = 0.002702
## 
## Model df: 4.   Total lags used: 10

Forecasting out 140 Variables

Both auto.arima modeling methods produced the best models with the lowest MAPE values. For Variable 01 the model that was produce was ARIMA(0,1,1) with drift for Variable 02 the model is ARIMA(1,1,3).

Now we will forecast out 140 periods using our selected models for Variable 01 and 02.

final_forecast_auto_var01 <- fit_var01_auto %>% forecast(140)
autoplot(final_forecast_auto_var01)

final_forecast_auto_var02 <- fit_var02_auto %>% forecast(140)
autoplot(final_forecast_auto_var02)