library(readxl) # read excel
library(dplyr)
library(PerformanceAnalytics) # correlation and histogram
library(ggplot2) # ggplot
library(tidyverse)
library(tidyr) # drop_na()
library(forecast) # autoplot
library(fpp3)
library(zoo)
library(seasonal)
library(tidymodels)
data <- readxl::read_excel('./Data Set for Class.xls')
data <- data %>% arrange(SeriesInd)
glimpse(data)
## Rows: 10,572
## Columns: 7
## $ SeriesInd <dbl> 40669, 40669, 40669, 40669, 40669, 40669, 40670, 40670, 4...
## $ group <chr> "S03", "S02", "S01", "S06", "S05", "S04", "S03", "S02", "...
## $ Var01 <dbl> 30.64286, 10.28000, 26.61000, 27.48000, 69.26000, 17.2000...
## $ Var02 <dbl> 123432400, 60855800, 10369300, 39335700, 27809100, 165874...
## $ Var03 <dbl> 30.34000, 10.05000, 25.89000, 26.82000, 68.19000, 16.8800...
## $ Var05 <dbl> 30.49000, 10.17000, 26.20000, 27.02000, 68.72000, 16.9400...
## $ Var07 <dbl> 30.57286, 10.28000, 26.01000, 27.32000, 69.15000, 17.1000...
S05 contains two variables that we’ll be creating forecasts for, Var02 and Var03. We’ll begin by cleaning up the data a little bit. First, we’ll create a date column then select the variables we want to keep as there are some unneeded columns in the data. Next we’ll transform our dataset into a tsibble, which is a data structure in R that has time series properties. We’ll also split our data into our training and test sets.
S05 <- data %>%
mutate(date = as.Date(SeriesInd, origin = '1898-08-30')) %>%
filter(group == 'S05') %>%
select(date, SeriesInd, group, Var02, Var03) %>%
as_tsibble(index = date)
train <- S05 %>%
filter(SeriesInd < 43022)
test <- S05 %>%
filter(SeriesInd >= 43022)
Before going too far, when performing time series analysis, its important to understand if you’ve got gaps in your data and what those represent. We’ll use count_gaps() to analyze this:
head(train %>%
count_gaps())
## # A tibble: 6 x 3
## .from .to .n
## <date> <date> <int>
## 1 2010-01-09 2010-01-10 2
## 2 2010-01-16 2010-01-18 3
## 3 2010-01-23 2010-01-24 2
## 4 2010-01-30 2010-01-31 2
## 5 2010-02-06 2010-02-07 2
## 6 2010-02-13 2010-02-15 3
We can see that there are implicit gaps in our data, which we would expect as this is stock data and the market is closed on weekends and holidays.
Having cleaned and transformed the data a little bit, lets plot Var02 from our dataset. Var02 is the daily stock volume for Exxon Mobile (XOM).
train %>%
autoplot(Var02) +
labs(title = 'Exxon Mobile Daily Stock Volume')
Our first glimpse of the data shows a pretty noisy dataset with a lot of movement. It’s hard to tell if this is white noise or if there is some seasonality. We’ll need to do some additional analysis to see if we can tease this information from the data. Further, we note that there are several large spikes within the data. While these do appear to be abnormal or outliers, we have no reason to remove or change them as they represent actual data that is not erroneous.
Let’s take a look at the distribution of Var02.
ggplot(train) +
aes(x = Var02) +
geom_histogram()
The distribution appears to be right skewed. This indicates that perhaps a transformation might be helpful. We’ll keep this in mind for our modeling section.
Because our data has some implicit data gaps, we can’t use the built in gg_season function. We’ll need to build our seasonal plot using ggplot2.
train %>%
mutate(year = as.factor(year(date))) %>%
mutate(new_date = as.Date(paste0('2010-', month(date),'-', day(date)))) %>%
ggplot() +
aes(x = new_date, y = Var02, color = year) +
geom_line() +
labs(title = "Seasonal Analysis of Exxon Stock Volume", x = '') +
scale_x_date(date_labels = '%b')
In looking at the plot above, there does look like there may be some seasonality to this data. While the data is noisy, it does appear that many peaks and valleys occur during the same periods every year. Let’s look at this data as a grid instead of stacked:
train %>%
mutate(year = as.factor(year(date))) %>%
mutate(new_date = as.Date(paste0('2010-', month(date),'-', day(date)))) %>%
ggplot() +
aes(x = new_date, y = Var02, color = year) +
geom_line() +
labs(title = "Seasonal Analysis of Exxon Stock Volume", x = '') +
scale_x_date(date_labels = '%b') +
facet_wrap(~year)
Again, the data is super noisy, however, looking at the data like this, it appears that there may be some seasonality at certain times of the year, meaning the cycle would be yearly. With a yearly cycle, we would say the frequency is 365.
Let’s further decompose this data to get a better view of the seasonality and trend.
tn <- train[complete.cases(train),]
t <- ts(tn$Var02, frequency = 365.25, start = 2010)
t %>%
decompose('multiplicative') %>%
autoplot()
Looking at the above decomposition, the trend was somewhat recognizable from our initial plot, however, it looks like some seasonality exits at the yearly level. If we look specifically at the seasonal plot below (truncated view), we can see that a full cycle takes about a year (10 to 375).
s <- t %>% decompose('multiplicative')
plot(s$seasonal[c(250:800)], type = "line")
abline(v = 10, col = "red")
abline(v = 375, col = "red")
Let’s check our dataset for missing values:
train[!complete.cases(train$Var02),]
## # A tsibble: 1 x 5 [1D]
## date SeriesInd group Var02 Var03
## <date> <dbl> <chr> <dbl> <dbl>
## 1 2013-03-01 41821 S05 NA NA
In looking at the above, it looks like there is only one row where we are missing a value. Let’s investigate the values around it and see what imputation method makes the most sense:
train %>%
filter(date >= '2013-02-15' & date < '2013-03-15') %>%
ggplot() +
aes(x = date, y = Var02) +
geom_line()
As we can see, the values are all over the place. That being said, I’m not sure we could argue for one method over another one. We could fill the missing value with the average over the last several days before and after the point or perhaps the easiest thing to do is just to fill that value with the proceeding value. We’ll move forward with the second option and then visualize the imputation:
train <- train %>%
tidyr::fill(Var02, .direction = "down")
train%>%
filter(date >= '2013-02-15' & date < '2013-03-15') %>%
ggplot() +
aes(x = date, y = Var02) +
geom_line()
## Var02 ACF & PACF
Now that we’ve investigated our data, lets begin our model building process. If you’re not using an automated method, you need too investigate differencing and the ACF and PACF plots in order to build the appropriate fitting model. I’ll include these below, but will opt to use the auto.arima function, which will find the optimal parameters for our function.
ggtsdisplay(train$Var02)
Looking at this plot we can tell that our data is highly correlated with itself. We can also see there is a need for differencing. I’ll difference the data here to show that differencing once provides a stationary dataset.
difference(train$Var02) %>%
ggtsdisplay()
Looking at our differenced data above, we can see that our dataset now looks like white noise. Looking at the first lag of our ACF and PACF plot, we can see that our initial model will probably include an MA variable. Additionally, we we see the PACF plot drops off after 4 lags, which would most likely mean the model should include this as an autoregressive parameter for the model.
We’ll use auto.arima to build our initial model. We’ll first need to split our training set into train and test sets so we can run our model on ‘unseen’ data.
Var02 <- train$Var02
break_num <- floor(length(Var02)*0.8)
ts_train <- ts(Var02[1:break_num],frequency = 365)
ts_test <- ts(Var02[(break_num+1):length(Var02)],frequency = 365)
fit <- auto.arima(ts_train, D=1)
summary(fit)
## Series: ts_train
## ARIMA(4,1,1)(0,1,0)[365]
##
## Coefficients:
## ar1 ar2 ar3 ar4 ma1
## 0.4004 0.1751 0.0046 0.0717 -0.9852
## s.e. 0.0347 0.0359 0.0358 0.0343 0.0116
##
## sigma^2 estimated as 6.399e+13: log likelihood=-16121.08
## AIC=32254.16 AICc=32254.25 BIC=32283.18
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 149508.5 6759298 3641778 -2.892572 24.96967 0.47959 -0.007167943
Here we see our MAPE is ~24.97%, which isn’t bad. On average, about 25% of our forecast is incorrect. We’ll now build our forecast.
prediction <- fit %>% forecast(h = length(ts_test))
predictions <- ts(prediction$mean, start = c(1, 1),
end = c(1,325),
frequency = 365)
Having built our forecast, lets look at it plotted against the actuals.
ts_test %>%
autoplot(series = "actuals") +
autolayer(predictions) +
labs(title = "Actuals vs Predictions", y = "Volume")
We can see that while our forecast captures much of the variability within the data, it is not an incredibly accurate forecast. With stock data, we expect that our forecast will be less than desirable since movements in stock and volume are mostly a random walk.
forecast::accuracy(as.numeric((prediction$mean)),as.numeric((ts_test)))
## ME RMSE MAE MPE MAPE
## Test set 147140.9 7547637 5574233 -13.69634 42.36631
Looking at our MAPE on the test set, we see that on average, 42% of our forecast is is incorrect. Now we’ll look at the residuals.
fit %>% residuals () %>%
ggtsdisplay()
While the residuals do look like white noise, we see in our ACF plot that we have many spikes that cross the critical boundary. This is concerning because it means that our confidence interval calculations won’t be reliable. As our main focus in this exercise is to produce a forecast, we’ll continue with producing a forecast since these checks won’t affect the reliability of the forecasted values.
We’ll now train on the entire training set and then generate our final predictions.
final_train <- ts(train$Var02,frequency = 365)
final_fit <- auto.arima(final_train, D=1)
final_predictions <- final_fit %>% forecast(h = dim(test)[1])
summary(final_fit)
## Series: final_train
## ARIMA(4,1,1)(0,1,0)[365]
##
## Coefficients:
## ar1 ar2 ar3 ar4 ma1
## 0.3927 0.1786 0.0021 0.0543 -0.9707
## s.e. 0.0346 0.0325 0.0321 0.0325 0.0195
##
## sigma^2 estimated as 5.454e+13: log likelihood=-21647.4
## AIC=43306.81 AICc=43306.87 BIC=43337.62
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 34831.57 6485557 3694718 -4.352891 26.08539 0.5180987 -0.003548442
Looks like our final fit is VERY similar to our previous fit. We can view our final forecasted values in the plot below:
final_predictions$mean %>% autoplot() +
ggtitle('Final Forecast')
df <- as.data.frame(final_predictions$mean)
names(df) <- 'Var02'
df
## Var02
## 1 11522516
## 2 9581666
## 3 17221153
## 4 15102475
## 5 11660625
## 6 13317940
## 7 12124644
## 8 12677154
## 9 14861834
## 10 20489410
## 11 16010854
## 12 14918303
## 13 13931086
## 14 11419993
## 15 10682474
## 16 11814566
## 17 7769236
## 18 9389510
## 19 14911098
## 20 19759393
## 21 19752501
## 22 17376030
## 23 18968456
## 24 16698109
## 25 9873662
## 26 9394918
## 27 8423703
## 28 10015361
## 29 9302651
## 30 11904740
## 31 9122505
## 32 8820027
## 33 16239092
## 34 16411689
## 35 13020912
## 36 9832753
## 37 7051609
## 38 6986575
## 39 11044850
## 40 10151431
## 41 10854516
## 42 10554605
## 43 10528397
## 44 9588390
## 45 13131286
## 46 11564582
## 47 14256279
## 48 10065477
## 49 7853676
## 50 13860674
## 51 10576973
## 52 10040473
## 53 15806472
## 54 9799272
## 55 33144272
## 56 14511471
## 57 7235071
## 58 8995471
## 59 11461471
## 60 8640071
## 61 12713971
## 62 10414771
## 63 11583071
## 64 8358171
## 65 7977171
## 66 8896871
## 67 13317871
## 68 10342571
## 69 10433171
## 70 6941271
## 71 12754571
## 72 12393571
## 73 6631571
## 74 11490871
## 75 8654371
## 76 7161271
## 77 5724371
## 78 6926871
## 79 5466471
## 80 7595371
## 81 6694571
## 82 9594871
## 83 11008471
## 84 9317871
## 85 7404471
## 86 8955771
## 87 8334271
## 88 7053271
## 89 8451071
## 90 9058571
## 91 6849671
## 92 5690171
## 93 5852071
## 94 5482971
## 95 3456171
## 96 6837871
## 97 4337371
## 98 5418171
## 99 4659171
## 100 8349771
## 101 7839071
## 102 5099671
## 103 11692071
## 104 6138271
## 105 6389071
## 106 5230871
## 107 9668071
## 108 8449271
## 109 9305471
## 110 5166271
## 111 7207771
## 112 4820071
## 113 5408071
## 114 6624671
## 115 6160871
## 116 6150971
## 117 7308571
## 118 13190371
## 119 6525871
## 120 7284671
## 121 8860871
## 122 7320171
## 123 16084071
## 124 8799971
## 125 12041871
## 126 10589871
## 127 6248771
## 128 8755171
## 129 12702271
## 130 7117771
## 131 8546371
## 132 5689471
## 133 5087471
## 134 6370971
## 135 6349971
## 136 5328371
## 137 9216771
## 138 10017871
## 139 5737871
## 140 10242271
Lets now plot Var03 from our dataset. Var03 is the daily low stock price for Exxon Mobile (XOM).
train %>%
autoplot(Var03) +
labs(title = 'Exxon Mobile Daily Low Stock Price')
Our first glimpse of the data shows a a lot of movement. It appears that there is an upward trend for several years and then a decline around mid-2014. It’s hard to tell, but it doesn’t look like there is seasonality in this data.
Let’s take a look at the distribution of Var03.
ggplot(train) +
aes(x = Var03) +
geom_histogram()
The distribution appears to be strongly left skewed. This indicates that perhaps a transformation might be helpful. We’ll keep this in mind for our modeling section.
Because our data has some implicit data gaps, we can’t use the built in gg_season function. We’ll need to build our seasonal plot using ggplot2.
train %>%
mutate(year = as.factor(year(date))) %>%
mutate(new_date = as.Date(paste0('2010-', month(date),'-', day(date)))) %>%
ggplot() +
aes(x = new_date, y = Var03, color = year) +
geom_line() +
labs(title = "Seasonal Analysis of Exxon Low Stock Price", x = '', y = 'Price') +
scale_x_date(date_labels = '%b')
In looking at the plot above, as we mentioned above, it doesn’t look like there is a seasonal component to this series. It looks fairly random from year to year.
Let’s further decompose this data to get a better view of the seasonality and trend.
tn <- train[complete.cases(train),]
t <- ts(tn$Var03, frequency = 365.25, start = 2010)
t %>%
decompose('multiplicative') %>%
autoplot()
Looking at the above decomposition, the trend was recognizable from our initial plot, however, it looks like some seasonality exits at the yearly level. If we look specifically at the seasonal plot below (truncated view), we can see that a full cycle takes about a year (25 to 390).
s <- t %>% decompose('multiplicative')
plot(s$seasonal[c(250:800)], type = "line")
abline(v = 25, col = "red")
abline(v = 390, col = "red")
Let’s check our dataset for missing values:
train[!complete.cases(train$Var03),]
## # A tsibble: 5 x 5 [1D]
## date SeriesInd group Var02 Var03
## <date> <dbl> <chr> <dbl> <dbl>
## 1 2013-03-01 41821 S05 14552000 NA
## 2 2016-02-10 42897 S05 16610900 NA
## 3 2016-02-11 42898 S05 19331600 NA
## 4 2016-05-20 42997 S05 13191900 NA
## 5 2016-05-23 43000 S05 11766100 NA
In looking at the above, it looks like there are 5 rows where we are missing a value. As we did with Var02, we’ll use the previous value to fill these nulls.
train <- train %>%
tidyr::fill(Var03, .direction = "down")
train[!complete.cases(train$Var03),]
## # A tsibble: 0 x 5 [?]
## # ... with 5 variables: date <date>, SeriesInd <dbl>, group <chr>, Var02 <dbl>,
## # Var03 <dbl>
We can see that our missing values have been filled.
Now that we’ve investigated our data, lets begin our model building process. If you’re not using an automated method, you need too investigate differencing and the ACF and PACF plots in order to build the appropriate fitting model. I’ll include these below, but will opt to use the auto.arima function, which will find the optimal parameters for our function.
ggtsdisplay(train$Var03)
Looking at this plot we can tell that our data is highly correlated with itself. We can also see there is a need for differencing. I’ll difference the data here to show that differencing once provides a stationary dataset.
difference(train$Var03) %>%
ggtsdisplay()
Looking at our differenced data above, we can see that our dataset now looks like white noise. Looking at the first lag of our ACF and PACF plot, we can see that our initial model will probably include an MA variable.
We’ll use auto.arima to build our initial model. We’ll first need to split our training set into train and test sets so we can run our model on ‘unseen’ data.
Var03 <- train$Var03
break_num <- floor(length(Var03)*0.8)
ts_train <- ts(Var03[1:break_num],frequency = 365)
ts_test <- ts(Var03[(break_num+1):length(Var03)],frequency = 365)
fit <- auto.arima(ts_train, D=1)
summary(fit)
## Series: ts_train
## ARIMA(0,1,1)(0,1,0)[365]
##
## Coefficients:
## ma1
## 0.0965
## s.e. 0.0328
##
## sigma^2 estimated as 1.592: log likelihood=-1540.45
## AIC=3084.89 AICc=3084.9 BIC=3094.56
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.006863079 1.068413 0.6776212 -0.01101442 0.7795547 0.06793487
## ACF1
## Training set 0.0001804782
Here we see our MAPE is ~77.95%, which is pretty bad bad. On average, about 78% of our forecast is incorrect. When talking about stock price, I think we may expect to see something like this because stock price really is impossible to forecasts, especially without using exogenous variables. We’ll now build our forecast.
prediction <- fit %>% forecast(h = length(ts_test))
predictions <- ts(prediction$mean, start = c(1, 1),
end = c(1,325),
frequency = 365)
Having built our forecast, lets look at it plotted against the actuals.
ts_test %>%
autoplot(series = "actuals") +
autolayer(predictions) +
labs(title = "Actuals vs Predictions", y = "Volume")
We can see that while our forecast captures much of the variability within the data, it is incredibly inaccurate. With stock data, we expect that our forecast will be less than desirable since movements in stock price are a random walk.
forecast::accuracy(as.numeric((prediction$mean)),as.numeric((ts_test)))
## ME RMSE MAE MPE MAPE
## Test set -13.84391 16.18008 13.91837 -17.63641 17.72001
Looking at our MAPE on the test set, we see that on average, 18% of our forecast is is incorrect. Now we’ll look at the residuals.
fit %>% residuals () %>%
ggtsdisplay()
While the residuals do look like white noise, we see in our ACF & PACF plot that we have many spikes that cross the critical boundary. This is concerning because it means that our confidence interval calculations won’t be reliable. As our main focus in this exercise is to produce a forecast, we’ll continue with producing a forecast since these checks won’t affect the reliability of the forecasted values.
We’ll now train on the entire training set and then generate our final predictions.
final_train <- ts(train$Var03,frequency = 365)
final_fit <- auto.arima(final_train, D=1)
final_predictions <- final_fit %>% forecast(h = dim(test)[1])
summary(final_fit)
## Series: final_train
## ARIMA(0,1,1)(0,1,0)[365]
##
## Coefficients:
## ma1
## 0.0965
## s.e. 0.0285
##
## sigma^2 estimated as 1.628: log likelihood=-2091.17
## AIC=4186.35 AICc=4186.35 BIC=4196.62
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.006966902 1.122301 0.7408814 -0.01239113 0.8708925 0.06716763
## ACF1
## Training set -0.001013758
Looks like our final fit is VERY similar to our previous fit. We can view our final forecasted values in the plot below:
final_predictions$mean %>% autoplot() +
ggtitle('Final Forecast')
df1 <- as.data.frame(final_predictions$mean)
names(df1) <- 'Var03'
final_df <- cbind(df,df1)
final_df <- cbind(test$SeriesInd, final_df) %>%
rename('SeriesInd' = `test$SeriesInd`)
write_csv(final_df, "S05 predictions.csv" )