In this project, we will perform Time Series analysis and forecast Microsoft Stock.
In this dataset, we have the following data:
Date: DateOpen: Opening priceHigh: Highest value of the dayLow: Lowest value of the dayClose: Closing priceVolume: Number of shares traded on that dayData Source: https://www.kaggle.com/datasets/vijayvvenkitesh/microsoft-stock-time-series-analysis
# Data Wrangling
library(dplyr)
library(lubridate)
# Padding
library(padr)
library(zoo)
# Visualization
library(ggplot2)
# Time Series
library(forecast)
library(MLmetrics) #MAPE
We want to do time series analysis on the opening price of Microsoft
Stock. Therefore, we will only take the Date and
Open columns from the data provided.
stock_raw <- read.csv("Microsoft_Stock.csv") %>% select(Date, Open)
# Quick overview on data
glimpse(stock_raw)
#> Rows: 1,511
#> Columns: 2
#> $ Date <chr> "4/1/2015 16:00:00", "4/2/2015 16:00:00", "4/6/2015 16:00:00", "4…
#> $ Open <dbl> 40.60, 40.66, 40.34, 41.61, 41.48, 41.25, 41.63, 41.40, 41.80, 41…
We need to convert our date data to datetime using
library lubridate.
stock_raw$Date %>% head()
#> [1] "4/1/2015 16:00:00" "4/2/2015 16:00:00" "4/6/2015 16:00:00"
#> [4] "4/7/2015 16:00:00" "4/8/2015 16:00:00" "4/9/2015 16:00:00"
stock_raw$Date %>% tail()
#> [1] "3/24/2021 16:00:00" "3/25/2021 16:00:00" "3/26/2021 16:00:00"
#> [4] "3/29/2021 16:00:00" "3/30/2021 16:00:00" "3/31/2021 16:00:00"
The date format is mm/dd/yyyy hh:mm:ss, so we use
mdy_hms() function from lubridate.
stock <- stock_raw %>% mutate(Date = mdy_hms(Date))
stock %>% head()
#> Date Open
#> 1 2015-04-01 16:00:00 40.60
#> 2 2015-04-02 16:00:00 40.66
#> 3 2015-04-06 16:00:00 40.34
#> 4 2015-04-07 16:00:00 41.61
#> 5 2015-04-08 16:00:00 41.48
#> 6 2015-04-09 16:00:00 41.25
Now our data is in their proper datatype.
There are some requirements that our data needs to fulfill before we do a time series analysis.
We must check and make adjustments so our data fulfills these requirements.
From the data above it seems like all the data is recorded at hour 16. Let’s check just to make sure.
# Get unique values of hour component from Date column
stock$Date %>% hour %>% unique
#> [1] 16 13
stock %>% filter(Date %>% hour == 13) %>% arrange(Date)
#> Date Open
#> 1 2017-11-24 13:00:00 83.01
#> 2 2018-07-03 13:00:00 100.48
#> 3 2018-11-23 13:00:00 102.17
#> 4 2018-12-24 13:00:00 97.68
#> 5 2019-07-03 13:00:00 136.80
#> 6 2019-11-29 13:00:00 152.10
#> 7 2019-12-24 13:00:00 157.48
#> 8 2020-11-27 13:00:00 214.85
#> 9 2020-12-24 13:00:00 221.42
There seem to be multiple data recorded at hour 13. To make the interval consistent, we will floor the date to daily and remove the hour data.
stock <- stock %>% mutate(Date = floor_date(Date, unit="day")) %>% arrange(Date)
Now our data has consistent time interval.
# Check if our data have any missing time data
complete_time_data <-
seq(from = min(stock$Date),
to = max(stock$Date),
by = "day")
all(complete_time_data == stock$Date)
#> [1] FALSE
Our data has missing time data. This is most likely because the stock data does not record on weekends and holidays. To fix it, we must do padding.
stock %>% arrange(Date) %>% head()
#> Date Open
#> 1 2015-04-01 40.60
#> 2 2015-04-02 40.66
#> 3 2015-04-06 40.34
#> 4 2015-04-07 41.61
#> 5 2015-04-08 41.48
#> 6 2015-04-09 41.25
stock_pad <- pad(stock) %>% arrange(Date)
stock_pad %>% head()
#> Date Open
#> 1 2015-04-01 40.60
#> 2 2015-04-02 40.66
#> 3 2015-04-03 NA
#> 4 2015-04-04 NA
#> 5 2015-04-05 NA
#> 6 2015-04-06 40.34
Because we padded the time gap, we now have NA values in our data. We
will pad it with the values of the previous day using
na.locf() from zoo package.
stock_pad <- stock_pad %>% mutate(Open = na.locf(Open))
Now our data has fulfilled all time series requirement.
We convert our data into time series object and then check whether it has trend and seasonality.
# Create TS Object
stock_ts <- ts(stock_pad$Open, start = c(2015, 4), frequency = 365)
stock_ts %>% autoplot()
# Decompose and plot again
stock_ts %>% decompose() %>% autoplot()
We can see that our data has both trend and seasonality.
We will split our time series to train and test data for model evaluation. We will use the first 75% of our data for training and the last 25% for validation test.
train_end_index <- length(stock_ts) * 0.75
stock_train <- stock_ts %>% head(train_end_index) # 1644 data
stock_test <- stock_ts %>% tail(-train_end_index) # 548 data
For the model, we will use ETS and ARIMA; using STLM for both models, ETS, and auto ARIMA.
stock_stlmets <- stlm(stock_train, s.window = "period", method = "ets")
stock_stlmarima <- stlm(stock_train, s.window = "period", method = "arima")
stock_autoarima <- auto.arima(stock_train, stepwise = T)
stock_ets <- ets(stock_train)
Forecast Microsoft Stock opening price for the next 548 days using both models
stock_stlmets_forecast <- forecast(stock_stlmets, h = 548)
stock_stlmarima_forecast <- forecast(stock_stlmarima, h = 548)
stock_autoarima_forecast <- forecast(stock_autoarima, h = 548)
stock_ets_forecast <- forecast(stock_ets, h = 548)
Visualizing forecast result
stock_train %>%
autoplot()+
autolayer(stock_stlmets_forecast$mean, series = "STLM ETS forecast result") +
autolayer(stock_stlmarima_forecast$mean, series = "STLM ARIMA forecast result") +
autolayer(stock_autoarima_forecast$mean, series = "Auto ARIMA forecast result") +
autolayer(stock_ets_forecast$mean, series = "ETS forecast result") +
autolayer(stock_test, series = "Actual result") +
theme_minimal()
All models that we have predicted an upward trend, but overall the predicted values are not close to the real values. It is also difficult to determine the better model just by visualization, so we will do model evaluation.
We will compare the AIC, MAPE, and RMSE of each model.
# Create a data frame with row index as ETS and ARIMA, and columns as AIC and RMSE
stock_eval <- data.frame(
AIC = c(stock_stlmets$model$aic,
stock_stlmarima$model$aic,
stock_ets_forecast$model$aic,
stock_autoarima$aic),
MAPE = c(MAPE(y_pred = stock_stlmets_forecast$mean, stock_test),
MAPE(y_pred = stock_stlmarima_forecast$mean, stock_test),
MAPE(y_pred = stock_ets_forecast$mean, stock_test),
MAPE(y_pred = stock_autoarima_forecast$mean, stock_test)
),
RMSE = c(RMSE(y_pred = stock_stlmets_forecast$mean, stock_test),
RMSE(y_pred = stock_stlmarima_forecast$mean, stock_test),
RMSE(y_pred = stock_ets_forecast$mean, stock_test),
RMSE(y_pred = stock_autoarima_forecast$mean, stock_test)
),
row.names = c("STLM_ETS", "STLM_ARIMA","Auto_ARIMA", "ETS")
) %>% arrange(MAPE)
stock_eval
#> AIC MAPE RMSE
#> ETS 4584.099 0.1794395 43.08072
#> STLM_ARIMA 4166.714 0.1818048 43.62158
#> Auto_ARIMA 11763.376 0.1837024 44.11558
#> STLM_ETS 11565.769 0.1906486 45.67941
We will pick STLM_ARIMA as our best model as it has the smallest information loss (AIC) and the error is not much bigger compared to ETS model.
A good forecasting model should have:
We will check if these assumptions are true.
We will check the autocorrelation with Ljung-box test. If the p-value > 0.05, then residual has no-autocorrelation and assumption is True.
Box.test(stock_stlmarima$residuals, type="Ljung-Box")
#>
#> Box-Ljung test
#>
#> data: stock_stlmarima$residuals
#> X-squared = 0.000053041, df = 1, p-value = 0.9942
Seems like our model fulfills the no autocorrelation assumption.
We will check if residual is normal using shapiro test. If p-value > 0.05, then assumption is True.
shapiro.test(stock_stlmarima$residuals)
#>
#> Shapiro-Wilk normality test
#>
#> data: stock_stlmarima$residuals
#> W = 0.93715, p-value < 0.00000000000000022
Based on the result, all of our model didn’t fulfill the normality assumption for the residuals. This may be caused by outliers that have extreme values. The result suggests that we can improve the model further in order to get better performance.