The stock market is a public and organized environment for trading certain securities and real estate (stocks, stock options, real estate funds, etc.). Transactions can take place through stock exchanges or on the so-called over-the-counter markets (markets in which securities not traded on exchanges are traded, within the legal rules provided for by law and regulations, without coordination by private self-regulatory entities).
The motivation of this study is to analyze the nature of financial data. Therefore, I decided on analysing the stock price of a well know company whose stock is effected by market trends. The goal of this project is to discover interesting facts of a company’s Stock price and develop a suitable model for forecasting.
In the stock market, we carry out purchase and sale transactions for profit. In this analysis, we are going to analyze the Linear Regression model to predict the stock’s closing price for a period of time.
First it’s important to establish what we’re aiming to solve. Broadly, stock market analysis is divided into two parts – Fundamental Analysis and Technical Analysis.
Our focus will be on the technical analysis part, for which we will be using the dataset of Microsoft stock price from April 2015 to April 2021 to build a model capable of estimating the closing stock price.
The data originally comes from an open dataset on the Kaggle website, which was gathered using google sheets. This is known as the OHLC data or Open, High, Low, Close data that denotes the opening price, highest price, lowest price and the closing price for a date.
library(ggplot2)
library(mltools)
library(forecast)
library(tseries)
library(mltools)
library(xts)
library(prophet)
library(tidyverse)
library(caret)
data <- read.csv("E:/Forcasting/MS_Stock_data/Microsoft_Stock.csv")
#Viewing the structure of data
str(data)
## 'data.frame': 1511 obs. of 6 variables:
## $ Date : chr "4/1/2015 16:00:00" "4/2/2015 16:00:00" "4/6/2015 16:00:00" "4/7/2015 16:00:00" ...
## $ Open : num 40.6 40.7 40.3 41.6 41.5 ...
## $ High : num 40.8 40.7 41.8 41.9 41.7 ...
## $ Low : num 40.3 40.1 40.2 41.3 41 ...
## $ Close : num 40.7 40.3 41.5 41.5 41.4 ...
## $ Volume: int 36865322 37487476 39223692 28809375 24753438 25723861 28022002 30276692 24244382 27343581 ...
#Converting date column to date format
data['Date'] <- as.Date(data$Date, "%m/%d/%Y")
#Checking for missing values
colSums(is.na(data))
## Date Open High Low Close Volume
## 0 0 0 0 0 0
There is no missing data in the dataset.
ggplot(data = data, aes(x = Date, y = Close)) +
geom_line(colour = "purple") +
labs(title = "Stock Price of Microsoft over the Years",
subtitle = "2015-2021",
x = "Date", y = "Stock Price ($)")
The time series plots depicts the price of Microsoft Stock.It shows the price of the stock increases over time even though the price shows volatility.Thus the stock price is not stationary over time due to changing mean and variance over time.
Histogram below depicts the change in mean and variance over time of closing price. Also, the distribution of Closing price is not normal. Due to this violation of normality, using a Linear model to predict the closing price is not a good choice (which will be proved in further analysis).
hist(data$Close, prob = T, col ="purple", breaks = 100,
xlim = c(min(data$Close) - 1 ,max(data$Close) + 1),
main = "Histogram of Stock Closing Price", xlab = "Closing Price")
To verify the normality of distribution we can also plot a Q-Q plot as below. The below plot shows that the disrtibution of Closing price is not mormally distributed.
ggplot(data=data, aes(sample = Close),) +
stat_qq() +
stat_qq_line(col='purple') + ggtitle('QQ plot of Closing Price')
Box plots of numerical values is used to verify if there is presence of outliers. Below plots suggests there are no outliers.
par(mfrow = c(3,2), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
boxplot(data$Open,main="Open")
boxplot(data$High,main="High")
boxplot(data$Low,main="Low")
boxplot(data$Close,main="Close")
boxplot(data$Volume,main="Volume")
sample_index <- 1:(nrow(data)*0.70)
train_data <- data[sample_index,]
test_data <- data[-sample_index,]
close.lm <- lm(Close ~ Date ,data = train_data)
summary(close.lm)
##
## Call:
## lm(formula = Close ~ Date, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.4561 -5.8235 -0.3349 5.4332 16.1337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.616e+02 7.537e+00 -114.3 <2e-16 ***
## Date 5.418e-02 4.357e-04 124.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.272 on 1055 degrees of freedom
## Multiple R-squared: 0.9361, Adjusted R-squared: 0.936
## F-statistic: 1.546e+04 on 1 and 1055 DF, p-value: < 2.2e-16
Using 70% of the data as a training dataset, after fitting a regression line using date to predict closing price of a stock, we get the above regression results. From the above summary we can see that the p-value of date variable is less than the specified value of α=0.05. So, we have enough evidence to reject the null hypothesis and say that there is significant linear relationship between the date & closing price values.
Using the testing data set (30% of data), predicted the closing price as below.
predict_close <- predict(close.lm,newdata = test_data[c('Date')] )
test_data_pred <- test_data
test_data_pred['predict_close'] <- predict_close
Using rmse function, found the rmse value for the linear model. The value is 54
rmse_close_lm <- rmse(actuals = test_data_pred$Close,
preds = test_data_pred$predict_close)
Plotted the test data (Purple), train data actual values (Orange) & the regression values or predicted values (green). From the below regression line we can say that linear modeling for the given time series is not a good choice for predictions. Also our Linear model has identified a drop in early 2020 which is when the Coivd-19 pandemic has started to show effect of the stock price. To enhance our predictions we can try using modelling techniques such as, Auto ARIMA & Prophet.
ggplot(data = train_data, aes(x = Date, y = Close)) +
geom_line(colour = "purple") +
geom_line(data = test_data, aes(x = Date, y = Close),colour = "orange") +
geom_line(data = test_data_pred, aes(x = Date, y = predict_close),colour = "green") +
labs(title = "Stock Price of Microsoft over the Years(Predicted)",
subtitle = "2015-2021",
x = "Date", y = "Stock Price ($)")
Before we start with ARIMA modeling we will check if the data of closing price is stationary. Our data is said to be stationary if we have constant variance and each value has a zero correlation with all other values in the series with a mean of zero. From the above “Stock Price of Microsoft over the Years” plot we can see closing price of the stock increases over time even though the price shows volatility.Thus the stock price is not stationary over time due to changing mean and variance over time.
We shall run ADF & KPSS tests to check for stationarity.
#Augmented Dickey-Fuller
adf.test(train_data$Close)
##
## Augmented Dickey-Fuller Test
##
## data: train_data$Close
## Dickey-Fuller = -1.5753, Lag order = 10, p-value = 0.7581
## alternative hypothesis: stationary
From the results of ADF test, we can see the p-value is higher than the significance level of 0.05. So, we can not reject the null hypothesis and conclude that the closing price is not stationary.
#KPSS test
kpss.test(train_data$Close)
##
## KPSS Test for Level Stationarity
##
## data: train_data$Close
## KPSS Level = 12.768, Truncation lag parameter = 7, p-value = 0.01
From the results of KPSS test we can say the data for closing price is not stationary. Now, we perform first order differencing to remove the non-stationarity. We can verify this by ploting the closing price differences. (The data after differencing is stationary for the most of the time, there is non-stationarity at the end due to the effect of Coivd-19 pandemic)
plot(diff(train_data$Close), type="l",main="Microsoft Stock Closing price",ylab="Close Differences",xlab="Days")
Now, we will perform stationarity tests on differenced data.
#Augmented Dickey-Fuller
adf.test(diff(train_data$Close))
##
## Augmented Dickey-Fuller Test
##
## data: diff(train_data$Close)
## Dickey-Fuller = -11.6, Lag order = 10, p-value = 0.01
## alternative hypothesis: stationary
#KPSS
kpss.test(diff(train_data$Close))
##
## KPSS Test for Level Stationarity
##
## data: diff(train_data$Close)
## KPSS Level = 0.16595, Truncation lag parameter = 7, p-value = 0.1
Results are contradicting between ADF & KPSS tests, where the former says the data is stationary where as the later says the data is non-stationary. so running Phillips-Perron Test on differenced data to confirm.
pp.test(diff(train_data$Close),alternative="stationary")
## Warning in pp.test(diff(train_data$Close), alternative = "stationary"): p-value
## smaller than printed p-value
##
## Phillips-Perron Unit Root Test
##
## data: diff(train_data$Close)
## Dickey-Fuller Z(alpha) = -1072.1, Truncation lag parameter = 7, p-value
## = 0.01
## alternative hypothesis: stationary
The above differenced time series shows that it is stationary and now we will plot the ACF & PACF plots for this data to verify the behavior of time series. ARIMA models take into account the past values to predict the future values. There are three important parameters in ARIMA:
From the below ACF & PACF plots we can estimate the parameters p & q.
acf(diff(train_data$Close))
pacf(diff(train_data$Close))
From the ACF plot we can also infer that there is no seasonality in our data. This is expected out of Microsoft stocks time series data. As per the ACF plot we can confirm the time series appears to be a Moving Average series. Based on the ACF & PACF plot we can make an approximate estimate of the order of the series and build the following ARIMA models. The one with lowest AIC & BIC values would be our best choice.
#ARIMA(0,1,1)
model1 <- Arima(train_data$Close,order=c(0,1,1),include.constant=T)
model1
## Series: train_data$Close
## ARIMA(0,1,1) with drift
##
## Coefficients:
## ma1 drift
## -0.1400 0.0860
## s.e. 0.0329 0.0311
##
## sigma^2 = 1.381: log likelihood = -1668.01
## AIC=3342.01 AICc=3342.03 BIC=3356.9
checkresiduals(model1)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1) with drift
## Q* = 35.845, df = 8, p-value = 1.874e-05
##
## Model df: 2. Total lags used: 10
#ARIMA(0,1,2)
model2 <- Arima(train_data$Close,order=c(0,1,2),include.constant=T)
model2
## Series: train_data$Close
## ARIMA(0,1,2) with drift
##
## Coefficients:
## ma1 ma2 drift
## -0.1191 -0.0945 0.0860
## s.e. 0.0311 0.0331 0.0283
##
## sigma^2 = 1.372: log likelihood = -1664.01
## AIC=3336.03 AICc=3336.07 BIC=3355.88
checkresiduals(model2)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,2) with drift
## Q* = 27.397, df = 7, p-value = 0.0002827
##
## Model df: 3. Total lags used: 10
#ARIMA(0,1,3)
model3 <- Arima(train_data$Close,order=c(0,1,3),include.constant=T)
model3
## Series: train_data$Close
## ARIMA(0,1,3) with drift
##
## Coefficients:
## ma1 ma2 ma3 drift
## -0.1184 -0.0926 0.0483 0.0862
## s.e. 0.0307 0.0327 0.0313 0.0301
##
## sigma^2 = 1.37: log likelihood = -1662.83
## AIC=3335.65 AICc=3335.71 BIC=3360.46
checkresiduals(model3)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,3) with drift
## Q* = 22.948, df = 6, p-value = 0.0008142
##
## Model df: 4. Total lags used: 10
#ARIMA(0,1,4)
model4 <- Arima(train_data$Close,order=c(0,1,4),include.constant=T)
model4
## Series: train_data$Close
## ARIMA(0,1,4) with drift
##
## Coefficients:
## ma1 ma2 ma3 ma4 drift
## -0.1313 -0.0867 0.0540 -0.0730 0.0858
## s.e. 0.0312 0.0312 0.0312 0.0355 0.0274
##
## sigma^2 = 1.366: log likelihood = -1660.74
## AIC=3333.47 AICc=3333.55 BIC=3363.25
checkresiduals(model4)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,4) with drift
## Q* = 19.534, df = 5, p-value = 0.001528
##
## Model df: 5. Total lags used: 10
Parameter tuning for ARIMA consumes a lot of time. So we will use auto ARIMA which automatically selects the best combination of (p,q,d) that provides the least error.
#Auto ARIMA
model5<-auto.arima((train_data$Close), seasonal=FALSE)
model5
## Series: (train_data$Close)
## ARIMA(3,1,3) with drift
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 ma3 drift
## 0.1546 -0.3164 0.6935 -0.2780 0.2418 -0.6800 0.0846
## s.e. 0.1104 0.0788 0.0706 0.1165 0.0857 0.0766 0.0216
##
## sigma^2 = 1.346: log likelihood = -1651.89
## AIC=3319.78 AICc=3319.91 BIC=3359.47
checkresiduals(model5)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,1,3) with drift
## Q* = 3.4724, df = 3, p-value = 0.3244
##
## Model df: 7. Total lags used: 10
From the residual plots we can confirm that the residuals are not distinguishable from a white noise series as the results are not significant.
Below table contains the AIC & BIC values of all the ARIMA models built. Auto ARIMA suggests that we need to use ARIMA(3,1,3) model based on AIC & BIC values. Thus, we shall proceed with ARIMA(3,1,3) model as suggested by Auto ARIMA as it has better AIC & BIC values.
| models_name | models_AIC | models_BIC |
|---|---|---|
| ARIMA(0,1,1) | 3342.01 | 3356.90 |
| ARIMA(0,1,2) | 3336.03 | 3355.88 |
| ARIMA(0,1,3) | 3335.65 | 3360.46 |
| ARIMA(0,1,4) | 3333.47 | 3363.25 |
| Auto ARIMA(3,1,3) | 3319.78 | 3359.47 |
The Multi-Step Forecast does the job of forecasting the next 150 trading days without using the test data set.
plot(forecast(model5,h=150),main="ARIMA(3,1,3) Forecast",ylab="Closing Price",xlab="Date")
The accuracy of the model can be see below. This menthod of forcasting gives an RMSE of 3.9
test_pred <- Arima(test_data$Close, model=model5)
accuracy(test_pred)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.2346034 3.856331 2.640417 0.09741845 1.456463 0.9915436
## ACF1
## Training set -0.2150405
Prophet, designed and pioneered by Facebook, is a time series forecasting library that requires no data preprocessing and is extremely simple to implement. The input for Prophet is a dataframe with two columns: date and target (ds and y). Prophet tries to capture the seasonality in the past data and works well when the dataset is large.
prophet_data = train_data %>%
rename(ds = Date, y = Close)
p_model = prophet(prophet_data)
future = make_future_dataframe(p_model,periods = 365)
forecast = predict(p_model,future)
plot(p_model,forecast) +
ylab("Closing Price") + xlab("Date") + theme_bw()
dyplot.prophet(p_model,forecast)
Time series data can exhibit a variety of patterns, and it is often helpful to split a time series into several components, each representing an underlying pattern category. In the below plots we shall Visualization of Time Series components.
prophet_plot_components(p_model,forecast)
plot(p_model,forecast) +
add_changepoints_to_plot(p_model) +
theme_bw() +
ylab("Closing Price") +
xlab("Date")
There are basically two methods to analyze the seasonality of a Time Series: additive and multiplicative. In the additive model, the behavior is linear where changes over time are consistently made by the same amount, like a linear trend.
The model fitted above is done using additive seasonality assumption, as the default value of “seasonality.mode” parameter is ‘additive’.
In multiplicative model, trend and seasonal components are multiplied and then added to the error component. It is not linear, can be exponential or quadratic and represented by a curved line.
Let’s fit a multiplicative model and perform forcasting below:
multi = prophet(prophet_data,seasonality.mode = "multiplicative")
multi_forecast = predict(multi,future)
dyplot.prophet(multi,multi_forecast)
prophet_plot_components(multi,multi_forecast)
We could see the forecast is not doing a good job in multiplicative model as we are getting negative stock prices, which is not possible.
model_holidays = prophet(prophet_data,fit = FALSE)
model_holidays = add_country_holidays(model_holidays,country_name = "US")
model_holidays = fit.prophet(model_holidays,prophet_data)
forecast_holidays = predict(model_holidays,future)
prophet_plot_components(model_holidays,forecast_holidays)
forecast_holidays %>%
filter(holidays != 0) %>%
dplyr::select(-ds:-additive_terms_upper, -holidays:-holidays_upper, -weekly:-yhat, -contains("upper"), -contains("lower")) %>%
mutate_all(~ if_else(. == 0, as.numeric(NA), .)) %>%
summarize_if(is.numeric, ~ max(., na.rm = T)) %>%
pivot_longer(
cols = `Christmas Day`:`Washington's Birthday`,
names_to = 'holiday',
values_to = 'effect'
) %>%
ggplot() +
geom_col(aes(effect,holiday))+
theme_bw()
From the above plot, Veterans Day, Columbus Day seems to be the most important holidays affecting stock prices for Microsoft.
We will calculate RMSE, MAE, MAPE to assess the model performance on test data. Before that we will increase our forcating time period to include all the dats availabe in our testing dataset.
test_dates = test_data[,1]
future_2 = make_future_dataframe(p_model,periods = 700)
forecast_2 = predict(p_model,future_2)
forecast_metric_data = forecast_2 %>%
as_tibble() %>%
mutate(ds = as.Date(ds)) %>%
filter(ds %in% (test_dates))
RMSE = sqrt(mean((test_data$Close - forecast_metric_data$yhat)^2))
MAE = mean(abs(test_data$Close - forecast_metric_data$yhat))
MAPE = mean(abs((test_data$Close - forecast_metric_data$yhat)/test_data$Close))
print(paste("RMSE:",round(RMSE,2)))
## [1] "RMSE: 39.93"
print(paste("MAE:",round(MAE,2)))
## [1] "MAE: 32.91"
print(paste("MAPE:",round(MAPE,2)))
## [1] "MAPE: 0.16"
df.cv <- cross_validation(p_model, initial = 730, period = 30, horizon = 30, units = 'days')
head(df.cv)
## y ds yhat yhat_lower yhat_upper cutoff
## 1 67.53 2017-04-24 65.29656 63.91355 66.66080 2017-04-23
## 2 67.92 2017-04-25 65.18932 63.79759 66.56335 2017-04-23
## 3 67.83 2017-04-26 65.18328 63.84720 66.61522 2017-04-23
## 4 68.27 2017-04-27 65.05404 63.60974 66.42490 2017-04-23
## 5 68.46 2017-04-28 64.97823 63.72417 66.29444 2017-04-23
## 6 69.41 2017-05-01 64.66387 63.20180 66.14533 2017-04-23
unique(df.cv$cutoff)
## [1] "2017-04-23 GMT" "2017-05-23 GMT" "2017-06-22 GMT" "2017-07-22 GMT"
## [5] "2017-08-21 GMT" "2017-09-20 GMT" "2017-10-20 GMT" "2017-11-19 GMT"
## [9] "2017-12-19 GMT" "2018-01-18 GMT" "2018-02-17 GMT" "2018-03-19 GMT"
## [13] "2018-04-18 GMT" "2018-05-18 GMT" "2018-06-17 GMT" "2018-07-17 GMT"
## [17] "2018-08-16 GMT" "2018-09-15 GMT" "2018-10-15 GMT" "2018-11-14 GMT"
## [21] "2018-12-14 GMT" "2019-01-13 GMT" "2019-02-12 GMT" "2019-03-14 GMT"
## [25] "2019-04-13 GMT" "2019-05-13 GMT"
df.cv %>%
ggplot()+
geom_point(aes(ds,y)) +
geom_point(aes(ds,yhat,color = factor(cutoff)))+
theme_bw()+
xlab("Date")+
ylab("Closing Price")+
scale_color_discrete(name = 'Cutoff')
We can generate a table for various performance metrics for different time horizons.
performance_metrics(df.cv)
## horizon mse rmse mae mape mdape smape
## 1 3 days 9.258266 3.042740 2.246496 0.02292406 0.01577315 0.02288683
## 2 4 days 13.499756 3.674201 2.657491 0.02682402 0.01419638 0.02672558
## 3 5 days 17.746545 4.212665 3.090287 0.03099727 0.02163692 0.03090786
## 4 6 days 21.677617 4.655923 3.385956 0.03396338 0.02526102 0.03376395
## 5 7 days 26.101180 5.108931 3.674707 0.03698395 0.02805215 0.03666708
## 6 8 days 25.519320 5.051665 3.633358 0.03712603 0.02959705 0.03686874
## 7 9 days 26.443754 5.142349 3.683692 0.03807638 0.02959705 0.03779097
## 8 10 days 28.072759 5.298373 3.786991 0.03976921 0.02828739 0.03941697
## 9 11 days 29.256258 5.408905 3.823282 0.03981966 0.02828739 0.03951853
## 10 12 days 30.269546 5.501777 3.886161 0.03981611 0.02735668 0.03975067
## 11 13 days 30.061244 5.482814 3.972235 0.03963228 0.02560659 0.03992493
## 12 14 days 35.477348 5.956286 4.468172 0.04467092 0.03237950 0.04471008
## 13 15 days 35.122291 5.926406 4.453522 0.04536710 0.03547567 0.04527155
## 14 16 days 32.560974 5.706222 4.229467 0.04372185 0.03151711 0.04353098
## 15 17 days 31.532093 5.615344 4.173718 0.04302798 0.03317428 0.04301686
## 16 18 days 33.609581 5.797377 4.332762 0.04409168 0.03501626 0.04411983
## 17 19 days 32.769370 5.724454 4.252051 0.04316524 0.03141626 0.04323926
## 18 20 days 32.326262 5.685619 4.034710 0.04032537 0.02604089 0.04044021
## 19 21 days 32.759297 5.723574 4.112874 0.04092266 0.02893690 0.04090769
## 20 22 days 33.423048 5.781267 4.192185 0.04166204 0.03006686 0.04161086
## 21 23 days 29.822982 5.461042 4.175911 0.04219432 0.03565251 0.04217370
## 22 24 days 24.251086 4.924539 3.869859 0.03975605 0.03399120 0.03984401
## 23 25 days 25.797846 5.079158 4.003124 0.04078213 0.03414902 0.04081244
## 24 26 days 26.294472 5.127814 3.935535 0.03949667 0.03065525 0.03954483
## 25 27 days 29.319786 5.414775 4.100798 0.04018234 0.03414902 0.04036263
## 26 28 days 30.885771 5.557497 4.215261 0.04106138 0.03406208 0.04122067
## 27 29 days 33.044848 5.748465 4.385580 0.04316954 0.03406208 0.04343926
## 28 30 days 34.334708 5.859583 4.534927 0.04582468 0.03406208 0.04586384
## coverage
## 1 0.5332671
## 2 0.5241090
## 3 0.4558093
## 4 0.4011917
## 5 0.3333333
## 6 0.3878407
## 7 0.4276730
## 8 0.4039956
## 9 0.3704071
## 10 0.3843098
## 11 0.3878407
## 12 0.3096559
## 13 0.2907880
## 14 0.3360849
## 15 0.3187686
## 16 0.2939424
## 17 0.2830189
## 18 0.3584906
## 19 0.3584906
## 20 0.3700210
## 21 0.2907880
## 22 0.2889772
## 23 0.2899702
## 24 0.3783515
## 25 0.3892751
## 26 0.3396226
## 27 0.3140954
## 28 0.2704403
Now, we can plot visualizations for metrics like RMSE, MAE, MAPE, MDAPE, SMAPE.
plot_cross_validation_metric(df.cv, metric = 'rmse')
plot_cross_validation_metric(df.cv, metric = 'mae')
plot_cross_validation_metric(df.cv, metric = 'mape')
plot_cross_validation_metric(df.cv, metric = 'mdape')
plot_cross_validation_metric(df.cv, metric = 'smape')