Group4_Project1

Getting Data

First we will read the data into a dataframe from the xls.

data <- read.xlsx2("/Users/devinteran/MSinDS/DATA624/DATA624/Project1/Data Set for Class.xls",sheetIndex = 1,colClasses = c('integer','character','double','double','double','double','double'),stringsAsFactors=FALSE)

Exploratory Data Analysis

Our data represents 5 different groups where we are going to forecast data within each of these groups using specified variables from coluns 2-6. We have 1622 data points within each group and will be forecasting 140 series points. Here is a description of each columns:

SeriesInd - this provides the time element in our data. It represents a frequency of 1.
group - this will be what we are forecasted/our output prediction.
Var01, Var02, Var03, Var05, Var07 - these are to be used in different combinations in order to forecast 140 series data within each group

data[1:15,] %>%
  kbl() %>%
  kable_minimal()

SeriesInd	group	Var01	Var02	Var03	Var05	Var07
40669	S03	30.64286	123432400	30.34000	30.49000	30.57286
40669	S02	10.28000	60855800	10.05000	10.17000	10.28000
40669	S01	26.61000	10369300	25.89000	26.20000	26.01000
40669	S06	27.48000	39335700	26.82000	27.02000	27.32000
40669	S05	69.26000	27809100	68.19000	68.72000	69.15000
40669	S04	17.20000	16587400	16.88000	16.94000	17.10000
40670	S03	30.79857	150476200	30.46428	30.65714	30.62571
40670	S02	11.24000	215620200	10.40000	10.45000	10.96000
40670	S01	26.30000	10943800	25.70000	25.95000	25.86000
40670	S06	28.24000	55416000	27.24000	27.27000	28.07000
40670	S05	69.45000	30174700	68.80000	69.19000	69.42000
40670	S04	17.23000	11718100	17.00000	17.22000	17.23000
40671	S03	30.74714	138040000	30.10714	30.62571	30.13857
40671	S02	11.46000	200070600	11.13000	11.21000	11.37000
40671	S01	26.03000	8933800	25.56000	25.90000	25.67000

Cleaning Data

S01 <- data %>% 
  mutate(date = as.Date(SeriesInd, origin = '1898-08-30')) %>%
  filter(SeriesInd < 43022 & group == 'S01') %>% 
  select (SeriesInd,date,Var01,Var02)

S01[1:15,] %>%
  kbl() %>%
  kable_minimal()

SeriesInd	date	Var01	Var02
40669	2010-01-04	26.61	10369300
40670	2010-01-05	26.30	10943800
40671	2010-01-06	26.03	8933800
40672	2010-01-07	25.84	10775400
40673	2010-01-08	26.34	12875600
40676	2010-01-11	26.49	11677000
40677	2010-01-12	26.03	21165300
40678	2010-01-13	25.16	18809200
40679	2010-01-14	25.00	22908400
40680	2010-01-15	24.77	20359100
40684	2010-01-19	24.78	11783800
40685	2010-01-20	24.61	13253500
40686	2010-01-21	24.88	18407800
40687	2010-01-22	24.17	21806400
40690	2010-01-25	23.82	20192100

Missing Data

There are 2 rows within the group S01 that is missing a Var01 value.

library(visdat)
vis_dat(S01)

S01[is.na(S01$Var01)|is.na(S01$Var02),]

prev_value <- S01[c(1536),'Var01']
S01_clean <- S01
S01_clean[is.na(S01_clean$Var01)|is.na(S01_clean$Var02),]$Var01 <- prev_value

Visualize Data

require(scales)

## Loading required package: scales

S01_Var01 <- ggplot(S01_clean, aes(x=Var01)) +
  geom_histogram(bins=50,color='black',fill="#7fc97f") + 
  ggtitle("Group S01 Data") +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(axis.text.x = element_text(angle = 45))
S01_Var01_boxplot <- ggplot(S01_clean, aes(x=Var01)) +
  geom_boxplot(color='black',fill="#7fc97f") + 
  ggtitle("") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  theme(axis.text.x = element_text(angle = 45))
S01_Var02 <- ggplot(S01_clean, aes(x=Var02)) +
  geom_histogram(bins=30,color='black',fill='#fdc086') + 
  scale_x_continuous(labels = comma) + 
  theme(axis.text.x = element_text(angle = 45))
S01_Var02_boxplot <- ggplot(S01_clean, aes(x=Var02)) +
  geom_boxplot(color='black',fill="#fdc086") + 
  theme(plot.title = element_text(hjust = 0.5)) + 
  scale_x_continuous(labels = comma) + 
  theme(axis.text.x = element_text(angle = 45))
grid.arrange(S01_Var01,S01_Var01_boxplot,S01_Var02,S01_Var02_boxplot,nrow=2,widths=2:1)

Variable 1 data has multiple peaks
Variable 02 is right skewed with many outliers

Decomposition - Trend, Seasonal, Error

t <- ts(S01$Var01, start = 2010, end = 2014, frequency = 365)
t %>% decompose(type="multiplicative") %>%
  autoplot() + xlab("Year") +
  ggtitle("Classical multiplicative decomposition of S01 data - Variable 01")

There appears to be an increasing trend and clear seasonality

t <- ts(S01$Var02, start = 2010, end = 2014, frequency = 365)
t %>% decompose(type="multiplicative") %>%
  autoplot() + xlab("Year") +
  ggtitle("Classical multiplicative decomposition of S01 data - Variable 02")

There appears to be an decreasing trend and pontential seasonality

Time, ACF and PACF Plots

Variable 01

ggtsdisplay(S01_clean$Var01)

Variable 01 data has a clear trend and is non-stationary data
The ACF plot shows an initial large autocorrelation value that gradually decreases as the lag increases, which indicates a trend
The PACF shows a single significant value for lag 1. This means the first lag explains all of the correlation that exists in the higher lags. This implies a consistent trend

Differencing Variable 01 Data

Trend

From our previous plots we can see a clear trend. Let’s run a Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test in order to confirm this. This test assumes that Variable 01 data is stationary and we are looking to reject that statement. In order to reject that statement, we need to see that our test-statistic is less than 0.463.

S01_clean$Var01 %>% ur.kpss() %>% summary()

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 8 lags. 
## 
## Value of test-statistic is: 16.0591 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Here we see a test statistic of 16.0591 which is above 0.463 so we know we can reject the original statement that the Variable 01 data is stationary.

From our KPSS test we confirmed that Variable 01 data is non-stationary and we need to differentiate it in an attempt to make it stationary. Here we are differentiating the data and running the KPSS test again. We see that our test-statistic is much lower and we can accept the assumption that Variable 01 data after being differentiated once is now stationary.

Var01_diff <- c(0,S01_clean$Var01 %>% diff())
S01_clean$Var01_diff <- Var01_diff

S01_clean$Var01_diff %>% ur.kpss() %>% summary()

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 8 lags. 
## 
## Value of test-statistic is: 0.1003 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Seasonal

To check for whether seasonal differencing is necessary we will run the function, nsdiffs(). If there is seasonality in the data, the value returned will be greater than 0.64. If it is less than 0.64, there is no seasonality.

t <- ts(S01_clean$Var01,frequency=365)
nsdiffs(t)

## [1] 0

No seasonal differencing is needed since the value is equal to 0. This is interesting since visually it looked like there was seasonality in our decomposed plot seen above.

Variable 02

ggtsdisplay(S01_clean$Var02)

Variable 02 data is a little less clear but appears to have a negative trend and is non-stationary data
The ACF plot shows an initial large autocorrelation value that gradually decreases as the lag increases, which indicates a trend
The PACF shows a significant value for lag 1. There also appears to be some seasonality since autocorrelations increase slightly as the lag increases.

Differencing Variable 02 Data

Trend

Just like we did with Variable 01, we are running the KPSS test and looking to reject the statement Variable 02 is stationary. To do this the KPSS test-statistic must be larger than 0.463, which it is. We reject the statement that Variable 02 data is stationary, which matches our previous theory.

S01_clean$Var02 %>% ur.kpss() %>% summary()

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 8 lags. 
## 
## Value of test-statistic is: 10.3302 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The KPSS test helped us confirm that Variable 02 data is non-stationary and we need to differentiate it as well. Here we are differentiating the data and running the KPSS test again. We see that our test-statistic is much lower and we can accept the assumption that Variable 02 data after being differentiated once is now stationary.

Var02_diff <- c(0,S01_clean$Var02 %>% diff())
S01_clean$Var02_diff <- Var02_diff

S01_clean$Var02_diff %>% ur.kpss() %>% summary()

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 8 lags. 
## 
## Value of test-statistic is: 0.0042 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Seasonal

t <- ts(S01_clean$Var02,frequency=365)
nsdiffs(t)

## [1] 0

No seasonal differencing is needed since the value is equal to 0.

Splitting Data into Test & Training

Here we will use 80% of our data as training data and will reserve 20% of our data for testing.

break_num <- floor(nrow(S01_clean)*0.8)
train <- S01_clean[1:break_num,]
test  <- S01_clean[(break_num+1):nrow(S01_clean),]

Time Series Plots

Auto Arima - Variable 01

The first model we will test for Variable 01 is the auto.arima model with 1 difference.

S01_train_Var01 <- ts(select(train,Var01))

fit_var01_auto <- auto.arima(S01_train_Var01,d=1)
summary(fit_var01_auto)

## Series: S01_train_Var01 
## ARIMA(0,1,1) with drift 
## 
## Coefficients:
##          ma1   drift
##       0.0845  0.0254
## s.e.  0.0286  0.0130
## 
## sigma^2 estimated as 0.1859:  log likelihood=-747.66
## AIC=1501.32   AICc=1501.34   BIC=1516.83
## 
## Training set error measures:
##                        ME      RMSE       MAE         MPE      MAPE      MASE
## Training set 2.312112e-05 0.4306647 0.3040085 -0.01941244 0.8958063 0.9926807
##                      ACF1
## Training set -0.002772543

ETS - Variable 01

Second we will use the exponential smoothing (ETS) model.

fit_var01_ets <- ets(S01_train_Var01)
summary(fit_var01_ets)

## ETS(A,A,N) 
## 
## Call:
##  ets(y = S01_train_Var01) 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
##     beta  = 1e-04 
## 
##   Initial states:
##     l = 26.4127 
##     b = 0.0254 
## 
##   sigma:  0.4328
## 
##      AIC     AICc      BIC 
## 7130.325 7130.371 7156.164 
## 
## Training set error measures:
##                        ME      RMSE       MAE         MPE      MAPE    MASE
## Training set 0.0007689551 0.4321495 0.3061459 -0.01885709 0.9020201 0.99966
##                    ACF1
## Training set 0.07834262

Forecast on Test Data - Variable 01

forecast_auto_var01 <- fit_var01_auto %>% forecast(nrow(train))
accuracy(forecast_auto_var01,S01_clean[1298:nrow(S01_clean),c('Var01')])

##                         ME      RMSE       MAE          MPE       MAPE
## Training set  2.312112e-05 0.4306647 0.3040085  -0.01941244  0.8958063
## Test set     -8.699160e+00 9.5792251 8.7037779 -16.35612988 16.3638362
##                    MASE         ACF1
## Training set  0.9926807 -0.002772543
## Test set     28.4204991           NA

forecast_ets_var01 <- fit_var01_ets %>% forecast(nrow(train))
accuracy(forecast_ets_var01,S01_clean[1298:nrow(S01_clean),c('Var01')])

##                         ME      RMSE       MAE          MPE       MAPE     MASE
## Training set  0.0007689551 0.4321495 0.3061459  -0.01885709  0.9020201  0.99966
## Test set     -8.7295616713 9.6092604 8.7341364 -16.41130797 16.4189421 28.51963
##                    ACF1
## Training set 0.07834262
## Test set             NA

Auto Arima - Variable 02

The first model we will test for Variable 02 is the auto.arima model with 1 difference.

S01_train_Var02 <- ts(select(train,Var02))

fit_var02_auto <- auto.arima(S01_train_Var02,d=1)
summary(fit_var02_auto)

## Series: S01_train_Var02 
## ARIMA(1,1,3) 
## 
## Coefficients:
##          ar1      ma1     ma2     ma3
##       0.8498  -1.4789  0.3779  0.1086
## s.e.  0.0587   0.0667  0.0537  0.0417
## 
## sigma^2 estimated as 1.197e+13:  log likelihood=-21351.15
## AIC=42712.31   AICc=42712.35   BIC=42738.14
## 
## Training set error measures:
##                     ME    RMSE     MAE       MPE     MAPE      MASE        ACF1
## Training set -128388.4 3452637 2293964 -12.30189 27.04777 0.8620093 0.003851951

ETS - Variable 02 Differentiated

Second we will use the exponential smoothing (ETS) model.

fit_var02_ets <- ets(S01_train_Var02)
summary(fit_var02_ets)

## ETS(M,A,N) 
## 
## Call:
##  ets(y = S01_train_Var02) 
## 
##   Smoothing parameters:
##     alpha = 0.8174 
##     beta  = 0.0011 
## 
##   Initial states:
##     l = 7092879.6943 
##     b = 1533385.7915 
## 
##   sigma:  0.3611
## 
##      AIC     AICc      BIC 
## 48309.00 48309.04 48334.84 
## 
## Training set error measures:
##                     ME    RMSE     MAE       MPE     MAPE     MASE       ACF1
## Training set -901614.9 3993393 2691723 -17.35243 30.70294 1.011477 -0.2385468

Forecast on Test Data - Variable 02

forecast_auto_var02 <- fit_var02_auto %>% forecast(nrow(test))
accuracy(forecast_auto_var02,S01_clean[1298:nrow(S01_clean),c('Var02')])

##                     ME    RMSE     MAE       MPE     MAPE      MASE        ACF1
## Training set -128388.4 3452637 2293964 -12.30189 27.04777 0.8620093 0.003851951
## Test set      530138.9 3208081 2161143 -10.00262 34.96505 0.8120989          NA

forecast_ets_var02 <- fit_var02_ets %>% forecast(nrow(test))
accuracy(forecast_ets_var02,S01_clean[1298:nrow(S01_clean),c('Var02')])

##                       ME     RMSE      MAE        MPE      MAPE      MASE
## Training set   -901614.9  3993393  2691723  -17.35243  30.70294  1.011477
## Test set     -44859673.6 51987522 44898015 -885.66149 885.82828 16.871455
##                    ACF1
## Training set -0.2385468
## Test set             NA

Residuals of Models

checkresiduals(fit_var01_auto)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1) with drift
## Q* = 11.952, df = 8, p-value = 0.1534
## 
## Model df: 2.   Total lags used: 10

checkresiduals(fit_var02_auto)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,3)
## Q* = 20.061, df = 6, p-value = 0.002702
## 
## Model df: 4.   Total lags used: 10

Forecasting out 140 Variables

Both auto.arima modeling methods produced the best models with the lowest MAPE values. For Variable 01 the model that was produce was ARIMA(0,1,1) with drift for Variable 02 the model is ARIMA(1,1,3).

Now we will forecast out 140 periods using our selected models for Variable 01 and 02.

final_forecast_auto_var01 <- fit_var01_auto %>% forecast(140)
autoplot(final_forecast_auto_var01)

final_forecast_auto_var02 <- fit_var02_auto %>% forecast(140)
autoplot(final_forecast_auto_var02)

Group4_Project1_Summer624

Devin Teran

6/21/2021

Getting Data

Exploratory Data Analysis

Cleaning Data

Missing Data

Visualize Data

Decomposition - Trend, Seasonal, Error

Time, ACF and PACF Plots

Variable 01

Differencing Variable 01 Data

Trend

Seasonal

Variable 02

Differencing Variable 02 Data

Trend

Seasonal

Splitting Data into Test & Training

Time Series Plots

Auto Arima - Variable 01

ETS - Variable 01

Forecast on Test Data - Variable 01

Auto Arima - Variable 02

ETS - Variable 02 Differentiated

Forecast on Test Data - Variable 02

Residuals of Models

Forecasting out 140 Variables