Overview

In this project, we will perform Time Series analysis and forecast Microsoft Stock.

In this dataset, we have the following data:

  • Date: Date
  • Open: Opening price
  • High: Highest value of the day
  • Low: Lowest value of the day
  • Close: Closing price
  • Volume: Number of shares traded on that day

Data Source: https://www.kaggle.com/datasets/vijayvvenkitesh/microsoft-stock-time-series-analysis

Import Libraries

# Data Wrangling
library(dplyr)
library(lubridate)

# Padding
library(padr) 
library(zoo)

# Visualization
library(ggplot2)

# Time Series
library(forecast)
library(MLmetrics) #MAPE

Data Input

We want to do time series analysis on the opening price of Microsoft Stock. Therefore, we will only take the Date and Open columns from the data provided.

stock_raw <- read.csv("Microsoft_Stock.csv") %>% select(Date, Open)
# Quick overview on data
glimpse(stock_raw)
#> Rows: 1,511
#> Columns: 2
#> $ Date <chr> "4/1/2015 16:00:00", "4/2/2015 16:00:00", "4/6/2015 16:00:00", "4…
#> $ Open <dbl> 40.60, 40.66, 40.34, 41.61, 41.48, 41.25, 41.63, 41.40, 41.80, 41…

Data Type Conversion

We need to convert our date data to datetime using library lubridate.

stock_raw$Date %>% head()
#> [1] "4/1/2015 16:00:00" "4/2/2015 16:00:00" "4/6/2015 16:00:00"
#> [4] "4/7/2015 16:00:00" "4/8/2015 16:00:00" "4/9/2015 16:00:00"
stock_raw$Date %>% tail()
#> [1] "3/24/2021 16:00:00" "3/25/2021 16:00:00" "3/26/2021 16:00:00"
#> [4] "3/29/2021 16:00:00" "3/30/2021 16:00:00" "3/31/2021 16:00:00"

The date format is mm/dd/yyyy hh:mm:ss, so we use mdy_hms() function from lubridate.

stock <- stock_raw %>% mutate(Date = mdy_hms(Date))
stock %>% head()
#>                  Date  Open
#> 1 2015-04-01 16:00:00 40.60
#> 2 2015-04-02 16:00:00 40.66
#> 3 2015-04-06 16:00:00 40.34
#> 4 2015-04-07 16:00:00 41.61
#> 5 2015-04-08 16:00:00 41.48
#> 6 2015-04-09 16:00:00 41.25

Now our data is in their proper datatype.

Data Preprocessing

There are some requirements that our data needs to fulfill before we do a time series analysis.

  1. Data must be ordered by time data
  2. Data must have consistent time interval
  3. Time data must have no gap
  4. Data must have no missing values

We must check and make adjustments so our data fulfills these requirements.

Check time interval

From the data above it seems like all the data is recorded at hour 16. Let’s check just to make sure.

# Get unique values of hour component from Date column
stock$Date %>% hour %>% unique
#> [1] 16 13
stock %>% filter(Date %>% hour == 13) %>% arrange(Date) 
#>                  Date   Open
#> 1 2017-11-24 13:00:00  83.01
#> 2 2018-07-03 13:00:00 100.48
#> 3 2018-11-23 13:00:00 102.17
#> 4 2018-12-24 13:00:00  97.68
#> 5 2019-07-03 13:00:00 136.80
#> 6 2019-11-29 13:00:00 152.10
#> 7 2019-12-24 13:00:00 157.48
#> 8 2020-11-27 13:00:00 214.85
#> 9 2020-12-24 13:00:00 221.42

There seem to be multiple data recorded at hour 13. To make the interval consistent, we will floor the date to daily and remove the hour data.

stock <- stock %>% mutate(Date = floor_date(Date, unit="day")) %>% arrange(Date)

Now our data has consistent time interval.

Check missing time data

# Check if our data have any missing time data
complete_time_data <- 
  seq(from = min(stock$Date), 
      to = max(stock$Date), 
      by = "day")

all(complete_time_data == stock$Date)
#> [1] FALSE

Our data has missing time data. This is most likely because the stock data does not record on weekends and holidays. To fix it, we must do padding.

stock %>% arrange(Date) %>% head()
#>         Date  Open
#> 1 2015-04-01 40.60
#> 2 2015-04-02 40.66
#> 3 2015-04-06 40.34
#> 4 2015-04-07 41.61
#> 5 2015-04-08 41.48
#> 6 2015-04-09 41.25
stock_pad <- pad(stock) %>% arrange(Date)
stock_pad %>% head()
#>         Date  Open
#> 1 2015-04-01 40.60
#> 2 2015-04-02 40.66
#> 3 2015-04-03    NA
#> 4 2015-04-04    NA
#> 5 2015-04-05    NA
#> 6 2015-04-06 40.34

Fill in missing values

Because we padded the time gap, we now have NA values in our data. We will pad it with the values of the previous day using na.locf() from zoo package.

stock_pad <- stock_pad %>% mutate(Open = na.locf(Open))

Now our data has fulfilled all time series requirement.

Time Series Object & EDA

We convert our data into time series object and then check whether it has trend and seasonality.

# Create TS Object
stock_ts <- ts(stock_pad$Open, start = c(2015, 4), frequency = 365)
stock_ts %>% autoplot()

# Decompose and plot again
stock_ts %>% decompose() %>% autoplot()

We can see that our data has both trend and seasonality.

Cross Validation

We will split our time series to train and test data for model evaluation. We will use the first 75% of our data for training and the last 25% for validation test.

train_end_index <- length(stock_ts) * 0.75

stock_train <- stock_ts %>% head(train_end_index) # 1644 data
stock_test <- stock_ts %>% tail(-train_end_index) # 548 data

Modeling

For the model, we will use ETS and ARIMA; using STLM for both models, ETS, and auto ARIMA.

stock_stlmets <- stlm(stock_train, s.window = "period", method = "ets")
stock_stlmarima <- stlm(stock_train, s.window = "period", method = "arima")
stock_autoarima <- auto.arima(stock_train, stepwise = T)
stock_ets <- ets(stock_train)

Forecasting

Forecast Microsoft Stock opening price for the next 548 days using both models

stock_stlmets_forecast <- forecast(stock_stlmets, h = 548)
stock_stlmarima_forecast <- forecast(stock_stlmarima, h = 548)
stock_autoarima_forecast <- forecast(stock_autoarima, h = 548)
stock_ets_forecast <- forecast(stock_ets, h = 548)

Visualizing forecast result

stock_train %>% 
  autoplot()+
  autolayer(stock_stlmets_forecast$mean, series = "STLM ETS forecast result") +
  autolayer(stock_stlmarima_forecast$mean, series = "STLM ARIMA forecast result") +
  autolayer(stock_autoarima_forecast$mean, series = "Auto ARIMA forecast result") +
  autolayer(stock_ets_forecast$mean, series = "ETS forecast result") +
  autolayer(stock_test, series = "Actual result") +
  theme_minimal()

All models that we have predicted an upward trend, but overall the predicted values are not close to the real values. It is also difficult to determine the better model just by visualization, so we will do model evaluation.

Evaluation

We will compare the AIC, MAPE, and RMSE of each model.

# Create a data frame with row index as ETS and ARIMA, and columns as AIC and RMSE
stock_eval <- data.frame(
  AIC = c(stock_stlmets$model$aic, 
          stock_stlmarima$model$aic, 
          stock_ets_forecast$model$aic,
          stock_autoarima$aic),
  MAPE = c(MAPE(y_pred = stock_stlmets_forecast$mean, stock_test),
           MAPE(y_pred = stock_stlmarima_forecast$mean, stock_test),
           MAPE(y_pred = stock_ets_forecast$mean, stock_test),
           MAPE(y_pred = stock_autoarima_forecast$mean, stock_test)
           ),
  RMSE = c(RMSE(y_pred = stock_stlmets_forecast$mean, stock_test),
           RMSE(y_pred = stock_stlmarima_forecast$mean, stock_test),
           RMSE(y_pred = stock_ets_forecast$mean, stock_test),
           RMSE(y_pred = stock_autoarima_forecast$mean, stock_test)
           ),
  row.names = c("STLM_ETS", "STLM_ARIMA","Auto_ARIMA", "ETS")
) %>% arrange(MAPE)

stock_eval 
#>                  AIC      MAPE     RMSE
#> ETS         4584.099 0.1794395 43.08072
#> STLM_ARIMA  4166.714 0.1818048 43.62158
#> Auto_ARIMA 11763.376 0.1837024 44.11558
#> STLM_ETS   11565.769 0.1906486 45.67941
  • Based on the AIC value, the best model is STLM_ARIMA followed by ETS.
  • Based on the MAPE and RMSE value, the best model is ETS followed by STLM_ARIMA.

We will pick STLM_ARIMA as our best model as it has the smallest information loss (AIC) and the error is not much bigger compared to ETS model.

Assumption test

A good forecasting model should have:

  • No-autocorrelation residual
  • Normality of residual

We will check if these assumptions are true.

No autocorrelation residuals

We will check the autocorrelation with Ljung-box test. If the p-value > 0.05, then residual has no-autocorrelation and assumption is True.

Box.test(stock_stlmarima$residuals, type="Ljung-Box")
#> 
#>  Box-Ljung test
#> 
#> data:  stock_stlmarima$residuals
#> X-squared = 0.000053041, df = 1, p-value = 0.9942

Seems like our model fulfills the no autocorrelation assumption.

Normality of residuals

We will check if residual is normal using shapiro test. If p-value > 0.05, then assumption is True.

shapiro.test(stock_stlmarima$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  stock_stlmarima$residuals
#> W = 0.93715, p-value < 0.00000000000000022

Based on the result, all of our model didn’t fulfill the normality assumption for the residuals. This may be caused by outliers that have extreme values. The result suggests that we can improve the model further in order to get better performance.