Project 1: S05 - Forecast Var02, Var03

Load Libraries and Data

library(readxl) # read excel
library(dplyr)
library(PerformanceAnalytics) #  correlation and histogram
library(ggplot2) # ggplot
library(tidyverse)
library(tidyr) # drop_na()
library(forecast) # autoplot
library(fpp3)
library(zoo)
library(seasonal)
library(tidymodels)
data <- readxl::read_excel('./Data Set for Class.xls')
data <- data %>% arrange(SeriesInd)

glimpse(data)
## Rows: 10,572
## Columns: 7
## $ SeriesInd <dbl> 40669, 40669, 40669, 40669, 40669, 40669, 40670, 40670, 4...
## $ group     <chr> "S03", "S02", "S01", "S06", "S05", "S04", "S03", "S02", "...
## $ Var01     <dbl> 30.64286, 10.28000, 26.61000, 27.48000, 69.26000, 17.2000...
## $ Var02     <dbl> 123432400, 60855800, 10369300, 39335700, 27809100, 165874...
## $ Var03     <dbl> 30.34000, 10.05000, 25.89000, 26.82000, 68.19000, 16.8800...
## $ Var05     <dbl> 30.49000, 10.17000, 26.20000, 27.02000, 68.72000, 16.9400...
## $ Var07     <dbl> 30.57286, 10.28000, 26.01000, 27.32000, 69.15000, 17.1000...

S05

S05 contains two variables that we’ll be creating forecasts for, Var02 and Var03. We’ll begin by cleaning up the data a little bit. First, we’ll create a date column then select the variables we want to keep as there are some unneeded columns in the data. Next we’ll transform our dataset into a tsibble, which is a data structure in R that has time series properties. We’ll also split our data into our training and test sets.

S05 <- data %>% 
  mutate(date = as.Date(SeriesInd, origin = '1898-08-30')) %>%
  filter(group == 'S05') %>%
  select(date, SeriesInd, group,  Var02, Var03) %>% 
  as_tsibble(index = date)

train <- S05 %>% 
  filter(SeriesInd < 43022)

test <- S05 %>% 
  filter(SeriesInd >= 43022)

Before going too far, when performing time series analysis, its important to understand if you’ve got gaps in your data and what those represent. We’ll use count_gaps() to analyze this:

head(train %>% 
       count_gaps())
## # A tibble: 6 x 3
##   .from      .to           .n
##   <date>     <date>     <int>
## 1 2010-01-09 2010-01-10     2
## 2 2010-01-16 2010-01-18     3
## 3 2010-01-23 2010-01-24     2
## 4 2010-01-30 2010-01-31     2
## 5 2010-02-06 2010-02-07     2
## 6 2010-02-13 2010-02-15     3

We can see that there are implicit gaps in our data, which we would expect as this is stock data and the market is closed on weekends and holidays.

Variable Analysis - Var02 & Var03

Var02 - Analysis

Having cleaned and transformed the data a little bit, lets plot Var02 from our dataset. Var02 is the daily stock volume for Exxon Mobile (XOM).

train %>% 
  autoplot(Var02) + 
  labs(title = 'Exxon Mobile Daily Stock Volume')

Our first glimpse of the data shows a pretty noisy dataset with a lot of movement. It’s hard to tell if this is white noise or if there is some seasonality. We’ll need to do some additional analysis to see if we can tease this information from the data. Further, we note that there are several large spikes within the data. While these do appear to be abnormal or outliers, we have no reason to remove or change them as they represent actual data that is not erroneous.

Var02 Distribution

Let’s take a look at the distribution of Var02.

ggplot(train) + 
  aes(x = Var02) + 
  geom_histogram()

The distribution appears to be right skewed. This indicates that perhaps a transformation might be helpful. We’ll keep this in mind for our modeling section.

Var02 Seasonality

Because our data has some implicit data gaps, we can’t use the built in gg_season function. We’ll need to build our seasonal plot using ggplot2.

train %>%
  mutate(year = as.factor(year(date))) %>%
  mutate(new_date = as.Date(paste0('2010-', month(date),'-', day(date)))) %>% 
  ggplot() + 
  aes(x = new_date, y = Var02, color = year) + 
  geom_line() +
  labs(title = "Seasonal Analysis of Exxon Stock Volume", x = '') + 
  scale_x_date(date_labels = '%b')

In looking at the plot above, there does look like there may be some seasonality to this data. While the data is noisy, it does appear that many peaks and valleys occur during the same periods every year. Let’s look at this data as a grid instead of stacked:

train %>%
  mutate(year = as.factor(year(date))) %>%
  mutate(new_date = as.Date(paste0('2010-', month(date),'-', day(date)))) %>% 
  ggplot() + 
  aes(x = new_date, y = Var02, color = year) + 
  geom_line() +
  labs(title = "Seasonal Analysis of Exxon Stock Volume", x = '') + 
  scale_x_date(date_labels = '%b') + 
  facet_wrap(~year)

Again, the data is super noisy, however, looking at the data like this, it appears that there may be some seasonality at certain times of the year, meaning the cycle would be yearly. With a yearly cycle, we would say the frequency is 365.

Let’s further decompose this data to get a better view of the seasonality and trend.

Var02 Decomposition

tn <- train[complete.cases(train),]


t <- ts(tn$Var02,  frequency = 365.25, start = 2010)

t %>% 
  decompose('multiplicative') %>% 
  autoplot()

Looking at the above decomposition, the trend was somewhat recognizable from our initial plot, however, it looks like some seasonality exits at the yearly level. If we look specifically at the seasonal plot below (truncated view), we can see that a full cycle takes about a year (10 to 375).

s <- t %>% decompose('multiplicative')
plot(s$seasonal[c(250:800)], type = "line")
abline(v = 10, col = "red")
abline(v = 375, col = "red")

Var02 Dealing with Missing Values

Let’s check our dataset for missing values:

train[!complete.cases(train$Var02),]
## # A tsibble: 1 x 5 [1D]
##   date       SeriesInd group Var02 Var03
##   <date>         <dbl> <chr> <dbl> <dbl>
## 1 2013-03-01     41821 S05      NA    NA

In looking at the above, it looks like there is only one row where we are missing a value. Let’s investigate the values around it and see what imputation method makes the most sense:

train %>% 
  filter(date >= '2013-02-15' & date < '2013-03-15') %>% 
  ggplot() + 
  aes(x = date, y = Var02) + 
  geom_line()

As we can see, the values are all over the place. That being said, I’m not sure we could argue for one method over another one. We could fill the missing value with the average over the last several days before and after the point or perhaps the easiest thing to do is just to fill that value with the proceeding value. We’ll move forward with the second option and then visualize the imputation:

train <- train %>% 
  tidyr::fill(Var02, .direction = "down") 

train%>% 
  filter(date >= '2013-02-15' & date < '2013-03-15') %>% 
  ggplot() + 
  aes(x = date, y = Var02) + 
  geom_line()

## Var02 ACF & PACF

Now that we’ve investigated our data, lets begin our model building process. If you’re not using an automated method, you need too investigate differencing and the ACF and PACF plots in order to build the appropriate fitting model. I’ll include these below, but will opt to use the auto.arima function, which will find the optimal parameters for our function.

ggtsdisplay(train$Var02)

Looking at this plot we can tell that our data is highly correlated with itself. We can also see there is a need for differencing. I’ll difference the data here to show that differencing once provides a stationary dataset.

difference(train$Var02) %>% 
  ggtsdisplay()

Looking at our differenced data above, we can see that our dataset now looks like white noise. Looking at the first lag of our ACF and PACF plot, we can see that our initial model will probably include an MA variable. Additionally, we we see the PACF plot drops off after 4 lags, which would most likely mean the model should include this as an autoregressive parameter for the model.

Var02 Model Building

We’ll use auto.arima to build our initial model. We’ll first need to split our training set into train and test sets so we can run our model on ‘unseen’ data.

Var02 <- train$Var02

break_num <- floor(length(Var02)*0.8)
ts_train <- ts(Var02[1:break_num],frequency = 365)
ts_test  <- ts(Var02[(break_num+1):length(Var02)],frequency = 365)
fit <- auto.arima(ts_train, D=1)
summary(fit)
## Series: ts_train 
## ARIMA(4,1,1)(0,1,0)[365] 
## 
## Coefficients:
##          ar1     ar2     ar3     ar4      ma1
##       0.4004  0.1751  0.0046  0.0717  -0.9852
## s.e.  0.0347  0.0359  0.0358  0.0343   0.0116
## 
## sigma^2 estimated as 6.399e+13:  log likelihood=-16121.08
## AIC=32254.16   AICc=32254.25   BIC=32283.18
## 
## Training set error measures:
##                    ME    RMSE     MAE       MPE     MAPE    MASE         ACF1
## Training set 149508.5 6759298 3641778 -2.892572 24.96967 0.47959 -0.007167943

Here we see our MAPE is ~24.97%, which isn’t bad. On average, about 25% of our forecast is incorrect. We’ll now build our forecast.

prediction <- fit %>% forecast(h = length(ts_test))
predictions <- ts(prediction$mean, start = c(1, 1), 
end = c(1,325), 
frequency = 365) 

Having built our forecast, lets look at it plotted against the actuals.

ts_test %>% 
  autoplot(series = "actuals") + 
  autolayer(predictions)  + 
  labs(title = "Actuals vs Predictions", y = "Volume")

We can see that while our forecast captures much of the variability within the data, it is not an incredibly accurate forecast. With stock data, we expect that our forecast will be less than desirable since movements in stock and volume are mostly a random walk.

forecast::accuracy(as.numeric((prediction$mean)),as.numeric((ts_test)))
##                ME    RMSE     MAE       MPE     MAPE
## Test set 147140.9 7547637 5574233 -13.69634 42.36631

Looking at our MAPE on the test set, we see that on average, 42% of our forecast is is incorrect. Now we’ll look at the residuals.

fit %>% residuals () %>%
  ggtsdisplay()

While the residuals do look like white noise, we see in our ACF plot that we have many spikes that cross the critical boundary. This is concerning because it means that our confidence interval calculations won’t be reliable. As our main focus in this exercise is to produce a forecast, we’ll continue with producing a forecast since these checks won’t affect the reliability of the forecasted values.

Var02 Final Test

We’ll now train on the entire training set and then generate our final predictions.

final_train <- ts(train$Var02,frequency = 365)

final_fit <- auto.arima(final_train, D=1)

final_predictions <- final_fit %>% forecast(h = dim(test)[1])
summary(final_fit)
## Series: final_train 
## ARIMA(4,1,1)(0,1,0)[365] 
## 
## Coefficients:
##          ar1     ar2     ar3     ar4      ma1
##       0.3927  0.1786  0.0021  0.0543  -0.9707
## s.e.  0.0346  0.0325  0.0321  0.0325   0.0195
## 
## sigma^2 estimated as 5.454e+13:  log likelihood=-21647.4
## AIC=43306.81   AICc=43306.87   BIC=43337.62
## 
## Training set error measures:
##                    ME    RMSE     MAE       MPE     MAPE      MASE         ACF1
## Training set 34831.57 6485557 3694718 -4.352891 26.08539 0.5180987 -0.003548442

Looks like our final fit is VERY similar to our previous fit. We can view our final forecasted values in the plot below:

final_predictions$mean %>% autoplot() + 
  ggtitle('Final Forecast')

Var02 Prep to Export

df <- as.data.frame(final_predictions$mean)
names(df) <- 'Var02'
df
##        Var02
## 1   11522516
## 2    9581666
## 3   17221153
## 4   15102475
## 5   11660625
## 6   13317940
## 7   12124644
## 8   12677154
## 9   14861834
## 10  20489410
## 11  16010854
## 12  14918303
## 13  13931086
## 14  11419993
## 15  10682474
## 16  11814566
## 17   7769236
## 18   9389510
## 19  14911098
## 20  19759393
## 21  19752501
## 22  17376030
## 23  18968456
## 24  16698109
## 25   9873662
## 26   9394918
## 27   8423703
## 28  10015361
## 29   9302651
## 30  11904740
## 31   9122505
## 32   8820027
## 33  16239092
## 34  16411689
## 35  13020912
## 36   9832753
## 37   7051609
## 38   6986575
## 39  11044850
## 40  10151431
## 41  10854516
## 42  10554605
## 43  10528397
## 44   9588390
## 45  13131286
## 46  11564582
## 47  14256279
## 48  10065477
## 49   7853676
## 50  13860674
## 51  10576973
## 52  10040473
## 53  15806472
## 54   9799272
## 55  33144272
## 56  14511471
## 57   7235071
## 58   8995471
## 59  11461471
## 60   8640071
## 61  12713971
## 62  10414771
## 63  11583071
## 64   8358171
## 65   7977171
## 66   8896871
## 67  13317871
## 68  10342571
## 69  10433171
## 70   6941271
## 71  12754571
## 72  12393571
## 73   6631571
## 74  11490871
## 75   8654371
## 76   7161271
## 77   5724371
## 78   6926871
## 79   5466471
## 80   7595371
## 81   6694571
## 82   9594871
## 83  11008471
## 84   9317871
## 85   7404471
## 86   8955771
## 87   8334271
## 88   7053271
## 89   8451071
## 90   9058571
## 91   6849671
## 92   5690171
## 93   5852071
## 94   5482971
## 95   3456171
## 96   6837871
## 97   4337371
## 98   5418171
## 99   4659171
## 100  8349771
## 101  7839071
## 102  5099671
## 103 11692071
## 104  6138271
## 105  6389071
## 106  5230871
## 107  9668071
## 108  8449271
## 109  9305471
## 110  5166271
## 111  7207771
## 112  4820071
## 113  5408071
## 114  6624671
## 115  6160871
## 116  6150971
## 117  7308571
## 118 13190371
## 119  6525871
## 120  7284671
## 121  8860871
## 122  7320171
## 123 16084071
## 124  8799971
## 125 12041871
## 126 10589871
## 127  6248771
## 128  8755171
## 129 12702271
## 130  7117771
## 131  8546371
## 132  5689471
## 133  5087471
## 134  6370971
## 135  6349971
## 136  5328371
## 137  9216771
## 138 10017871
## 139  5737871
## 140 10242271

Var03 - Analysis

Lets now plot Var03 from our dataset. Var03 is the daily low stock price for Exxon Mobile (XOM).

train %>% 
  autoplot(Var03) + 
  labs(title = 'Exxon Mobile Daily Low Stock Price')

Our first glimpse of the data shows a a lot of movement. It appears that there is an upward trend for several years and then a decline around mid-2014. It’s hard to tell, but it doesn’t look like there is seasonality in this data.

Var03 Distribution

Let’s take a look at the distribution of Var03.

ggplot(train) + 
  aes(x = Var03) + 
  geom_histogram()

The distribution appears to be strongly left skewed. This indicates that perhaps a transformation might be helpful. We’ll keep this in mind for our modeling section.

Var03 Seasonality

Because our data has some implicit data gaps, we can’t use the built in gg_season function. We’ll need to build our seasonal plot using ggplot2.

train %>%
  mutate(year = as.factor(year(date))) %>%
  mutate(new_date = as.Date(paste0('2010-', month(date),'-', day(date)))) %>% 
  ggplot() + 
  aes(x = new_date, y = Var03, color = year) + 
  geom_line() +
  labs(title = "Seasonal Analysis of Exxon Low Stock Price", x = '', y = 'Price') + 
  scale_x_date(date_labels = '%b')

In looking at the plot above, as we mentioned above, it doesn’t look like there is a seasonal component to this series. It looks fairly random from year to year.

Let’s further decompose this data to get a better view of the seasonality and trend.

Var03 Decomposition

tn <- train[complete.cases(train),]


t <- ts(tn$Var03,  frequency = 365.25, start = 2010)

t %>% 
  decompose('multiplicative') %>% 
  autoplot()

Looking at the above decomposition, the trend was recognizable from our initial plot, however, it looks like some seasonality exits at the yearly level. If we look specifically at the seasonal plot below (truncated view), we can see that a full cycle takes about a year (25 to 390).

s <- t %>% decompose('multiplicative')
plot(s$seasonal[c(250:800)], type = "line")
abline(v = 25, col = "red")
abline(v = 390, col = "red")

Var03 Dealing with Missing Values

Let’s check our dataset for missing values:

train[!complete.cases(train$Var03),]
## # A tsibble: 5 x 5 [1D]
##   date       SeriesInd group    Var02 Var03
##   <date>         <dbl> <chr>    <dbl> <dbl>
## 1 2013-03-01     41821 S05   14552000    NA
## 2 2016-02-10     42897 S05   16610900    NA
## 3 2016-02-11     42898 S05   19331600    NA
## 4 2016-05-20     42997 S05   13191900    NA
## 5 2016-05-23     43000 S05   11766100    NA

In looking at the above, it looks like there are 5 rows where we are missing a value. As we did with Var02, we’ll use the previous value to fill these nulls.

train <- train %>% 
  tidyr::fill(Var03, .direction = "down") 

train[!complete.cases(train$Var03),]
## # A tsibble: 0 x 5 [?]
## # ... with 5 variables: date <date>, SeriesInd <dbl>, group <chr>, Var02 <dbl>,
## #   Var03 <dbl>

We can see that our missing values have been filled.

Var03 ACF & PACF

Now that we’ve investigated our data, lets begin our model building process. If you’re not using an automated method, you need too investigate differencing and the ACF and PACF plots in order to build the appropriate fitting model. I’ll include these below, but will opt to use the auto.arima function, which will find the optimal parameters for our function.

ggtsdisplay(train$Var03) 

Looking at this plot we can tell that our data is highly correlated with itself. We can also see there is a need for differencing. I’ll difference the data here to show that differencing once provides a stationary dataset.

difference(train$Var03) %>% 
  ggtsdisplay()

Looking at our differenced data above, we can see that our dataset now looks like white noise. Looking at the first lag of our ACF and PACF plot, we can see that our initial model will probably include an MA variable.

Var03 Model Building

We’ll use auto.arima to build our initial model. We’ll first need to split our training set into train and test sets so we can run our model on ‘unseen’ data.

Var03 <- train$Var03

break_num <- floor(length(Var03)*0.8)
ts_train <- ts(Var03[1:break_num],frequency = 365)
ts_test  <- ts(Var03[(break_num+1):length(Var03)],frequency = 365)
fit <- auto.arima(ts_train, D=1)
summary(fit)
## Series: ts_train 
## ARIMA(0,1,1)(0,1,0)[365] 
## 
## Coefficients:
##          ma1
##       0.0965
## s.e.  0.0328
## 
## sigma^2 estimated as 1.592:  log likelihood=-1540.45
## AIC=3084.89   AICc=3084.9   BIC=3094.56
## 
## Training set error measures:
##                        ME     RMSE       MAE         MPE      MAPE       MASE
## Training set -0.006863079 1.068413 0.6776212 -0.01101442 0.7795547 0.06793487
##                      ACF1
## Training set 0.0001804782

Here we see our MAPE is ~77.95%, which is pretty bad bad. On average, about 78% of our forecast is incorrect. When talking about stock price, I think we may expect to see something like this because stock price really is impossible to forecasts, especially without using exogenous variables. We’ll now build our forecast.

prediction <- fit %>% forecast(h = length(ts_test))
predictions <- ts(prediction$mean, start = c(1, 1), 
end = c(1,325), 
frequency = 365) 

Having built our forecast, lets look at it plotted against the actuals.

ts_test %>% 
  autoplot(series = "actuals") + 
  autolayer(predictions)  + 
  labs(title = "Actuals vs Predictions", y = "Volume")

We can see that while our forecast captures much of the variability within the data, it is incredibly inaccurate. With stock data, we expect that our forecast will be less than desirable since movements in stock price are a random walk.

forecast::accuracy(as.numeric((prediction$mean)),as.numeric((ts_test)))
##                 ME     RMSE      MAE       MPE     MAPE
## Test set -13.84391 16.18008 13.91837 -17.63641 17.72001

Looking at our MAPE on the test set, we see that on average, 18% of our forecast is is incorrect. Now we’ll look at the residuals.

fit %>% residuals () %>%
  ggtsdisplay()

While the residuals do look like white noise, we see in our ACF & PACF plot that we have many spikes that cross the critical boundary. This is concerning because it means that our confidence interval calculations won’t be reliable. As our main focus in this exercise is to produce a forecast, we’ll continue with producing a forecast since these checks won’t affect the reliability of the forecasted values.

Var03 Final Test

We’ll now train on the entire training set and then generate our final predictions.

final_train <- ts(train$Var03,frequency = 365)

final_fit <- auto.arima(final_train, D=1)

final_predictions <- final_fit %>% forecast(h = dim(test)[1])
summary(final_fit)
## Series: final_train 
## ARIMA(0,1,1)(0,1,0)[365] 
## 
## Coefficients:
##          ma1
##       0.0965
## s.e.  0.0285
## 
## sigma^2 estimated as 1.628:  log likelihood=-2091.17
## AIC=4186.35   AICc=4186.35   BIC=4196.62
## 
## Training set error measures:
##                        ME     RMSE       MAE         MPE      MAPE       MASE
## Training set -0.006966902 1.122301 0.7408814 -0.01239113 0.8708925 0.06716763
##                      ACF1
## Training set -0.001013758

Looks like our final fit is VERY similar to our previous fit. We can view our final forecasted values in the plot below:

final_predictions$mean %>% autoplot() + 
  ggtitle('Final Forecast')

Var02 & Var03 Export Data to CSV

df1 <- as.data.frame(final_predictions$mean)
names(df1) <- 'Var03'

final_df <- cbind(df,df1)
final_df <- cbind(test$SeriesInd, final_df) %>% 
  rename('SeriesInd' = `test$SeriesInd`)
write_csv(final_df, "S05 predictions.csv" )