Predicting Stock Returns in the Dow Jones Index

Background

When looking to invest, many people turn to the stock market. Knowing which industries and companies to invest in is critical before making an investment. In this study, we will analyze weekly stock prices of the Dow Jones Industrial Average, one of the most widely known stock indexes in the world. The Dow Jones Index consists of 30 price-weighted stocks that represent different industrial sectors of the US economy. It includes companies like Disney, Pfizer, General Electric, and Microsoft.

Our goal in this study is to determine which stock will produce the greatest rate of return in the following week. To accomplish this, we will explore both parametric and nonparametric techniques, including linear regression, random forests, and support vector machines. Since there are 30 stocks in the data set, we will fit individual models for each stock and compare the overall model performance using the Root Mean Squared Error (RMSE), which represents the average difference between the true value of the stock return and the predicted value. Additionally, we would like to quantify the risk of a potential stock investment in comparison to the market. We will use a Capital Asset Pricing Model to calculate the risks of the different stocks in the Dow Jones Index.

Data

The Dow Jones data set can be imported from the UCI Machine Learning Repository and comprises weekly stock information for the first two quarters of 2011. It contains data related to each of the different stocks, like the open and close prices, the number of shares traded, and the percentage change in the stock price throughout the week. To identify which stock will produce the greatest rate of return, the variable we would like to predict is the percent_change_next_weeks_price, which is a quantitative variable indicating the stock’s return in the following week.

library(tidyverse)
library(lubridate)
library(forecast)
library(caret)
library(tree)
library(Hmisc)
library(quantmod)
library(GGally)
library(kableExtra)
theme_set(theme_minimal())
dow <- read_csv("/Users/Kelli/Documents/Hw/DA6813/dow_jones_index.data")

Cleaning the data

Before beginning our analysis, we first cleaned the input variables by removing the $ in the price variables, factoring the categorical features, and changing the date format. We then checked for missing data.

dow$quarter <- as_factor(dow$quarter)
dow$stock <- as_factor(dow$stock)
dow$date <- as.Date(dow$date, "%m/%d/%Y")
dow$open <- as.numeric(str_remove(dow$open, "[$]"))
dow$high <- as.numeric(str_remove(dow$high, "[$]"))
dow$low <- as.numeric(str_remove(dow$low, "[$]"))
dow$close <- as.numeric(str_remove(dow$close, "[$]"))
dow$next_weeks_open <- as.numeric(str_remove(dow$next_weeks_open, "[$]"))
dow$next_weeks_close <- as.numeric(str_remove(dow$next_weeks_close, "[$]"))
str(dow[])

## tibble[,16] [750 x 16] (S3: tbl_df/tbl/data.frame)
##  $ quarter                           : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stock                             : Factor w/ 30 levels "AA","AXP","BA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ date                              : Date[1:750], format: "2011-01-07" "2011-01-14" ...
##  $ open                              : num [1:750] 15.8 16.7 16.2 15.9 16.2 ...
##  $ high                              : num [1:750] 16.7 16.7 16.4 16.6 17.4 ...
##  $ low                               : num [1:750] 15.8 15.6 15.6 15.8 16.2 ...
##  $ close                             : num [1:750] 16.4 16 15.8 16.1 17.1 ...
##  $ volume                            : num [1:750] 2.40e+08 2.43e+08 1.38e+08 1.51e+08 1.54e+08 ...
##  $ percent_change_price              : num [1:750] 3.79 -4.43 -2.47 1.64 5.93 ...
##  $ percent_change_volume_over_last_wk: num [1:750] NA 1.38 -43.02 9.36 1.99 ...
##  $ previous_weeks_volume             : num [1:750] NA 2.40e+08 2.43e+08 1.38e+08 1.51e+08 ...
##  $ next_weeks_open                   : num [1:750] 16.7 16.2 15.9 16.2 17.3 ...
##  $ next_weeks_close                  : num [1:750] 16 15.8 16.1 17.1 17.4 ...
##  $ percent_change_next_weeks_price   : num [1:750] -4.428 -2.471 1.638 5.933 0.231 ...
##  $ days_to_next_dividend             : num [1:750] 26 19 12 5 97 90 83 76 69 62 ...
##  $ percent_return_next_dividend      : num [1:750] 0.183 0.188 0.19 0.186 0.175 ...

Missing data

There are 30 missing observations in percent_change_volume_over_last_wk and previous_weeks_volume. For these variables, we grouped the data by stock and assigned the missing observation with the variable’s mean.

colSums(is.na(dow))

##                            quarter                              stock 
##                                  0                                  0 
##                               date                               open 
##                                  0                                  0 
##                               high                                low 
##                                  0                                  0 
##                              close                             volume 
##                                  0                                  0 
##               percent_change_price percent_change_volume_over_last_wk 
##                                  0                                 30 
##              previous_weeks_volume                    next_weeks_open 
##                                 30                                  0 
##                   next_weeks_close    percent_change_next_weeks_price 
##                                  0                                  0 
##              days_to_next_dividend       percent_return_next_dividend 
##                                  0                                  0

dow <- dow %>% 
  group_by(stock) %>%
  mutate(percent_change_volume_over_last_wk = ifelse(is.na(percent_change_volume_over_last_wk),
                                                     mean(percent_change_volume_over_last_wk,
                                                     na.rm=TRUE),
                                                     percent_change_volume_over_last_wk),
         previous_weeks_volume = ifelse(is.na(previous_weeks_volume),
                                        mean(previous_weeks_volume, na.rm=TRUE), 
                                        previous_weeks_volume)) %>% 
  ungroup()

Lag Plots

Since stock prices often depend on past values, we selected a small subset of lagged variables to be used as our predictors. A lagged variable is a time series version of itself going back a given number of time periods. To determine the number of time periods to go back, we looked at the lag plots for each variable. Below are the time series plots for Close and Volume. Both appear to have the strongest linear relationship at Lag 1.

close.ts <- ts(dow$close, freq = 365.25/7, start = decimal_date(ymd("2011-01-07")))
gglagplot(close.ts, lags = 4, do.lines = F) +
  labs(title = "Closing Stock Price Lag Plots") +
  scale_color_viridis_c() + 
  theme(plot.title = element_text(hjust = 0.5), legend.position = "none")

volume.ts <- ts(dow$volume, freq = 365.25/7, start = decimal_date(ymd("2011-01-07")))
gglagplot(volume.ts, lags = 4, do.lines = FALSE) +
  labs(title = "Number of Shares Traded Lag Plots") +
  scale_color_viridis_c() + 
  scale_x_continuous(labels = function(x) format(x, big.mark = ",", scientific = F)) +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = F)) +
  theme(plot.title = element_text(hjust = 0.5), legend.position = "none", 
        axis.text.x = element_text(angle = 45, hjust = 1))

We then created three new variables in our data to be used as our predictors, the first lag of volume, close, and percent_change_next_weeks_price. The value of the first time period in each lagged variables is NA, so we assigned the column mean to these observations grouping the data by stock.

dow.lag <- dow %>% 
  group_by(stock) %>% 
  mutate(close.lag = lag(close, n = 1), 
         volume.lag = lag(volume, n = 1),
         pct_chnge_nxt_wk_price.lag = lag(percent_change_next_weeks_price, n = 1)) %>% 
  ungroup()

# New variables
head(dow.lag[,(ncol(dow.lag)-2):ncol(dow.lag)])

## # A tibble: 6 x 3
##   close.lag volume.lag pct_chnge_nxt_wk_price.lag
##       <dbl>      <dbl>                      <dbl>
## 1      NA           NA                     NA    
## 2      16.4  239655616                     -4.43 
## 3      16.0  242963398                     -2.47 
## 4      15.8  138428495                      1.64 
## 5      16.1  151379173                      5.93 
## 6      17.1  154387761                      0.231

dow.lag <- dow.lag %>% 
  group_by(stock) %>%
  mutate(close.lag = ifelse(is.na(close.lag), mean(close.lag, na.rm=T), close.lag),
         volume.lag = ifelse(is.na(volume.lag), mean(volume.lag, na.rm=T), volume.lag),
         pct_chnge_nxt_wk_price.lag = ifelse(is.na(pct_chnge_nxt_wk_price.lag),
                                             mean(pct_chnge_nxt_wk_price.lag, na.rm=T),
                                             pct_chnge_nxt_wk_price.lag)) %>% 
  ungroup()

Exploratory Analysis

Based on the scatterplots and correlations below, there does not appear to be any multicollinearity issues between our predictors. There is a significant negative relationship between the lags of the closing stock price and the number of shares traded. Overall, the linear relationships between the variables are relatively weak, which may affect the performance of the linear model.

dow.lag %>% 
  select(percent_change_next_weeks_price, pct_chnge_nxt_wk_price.lag, 
         close.lag, volume.lag, percent_change_volume_over_last_wk, quarter) %>%
  ggpairs(ggplot2::aes(color=quarter),
          columnLabels = c("Stock Return", "Stock Return Lag", "Close Lag", 
                           "Volume Lag", "Percent Change in Volume", "Quarter"), 
          title = "Stock Price Distributions of the Dow Jones Index\n",
          upper = list(combo=wrap("box_no_facet", alpha=0.7), discrete=wrap("barDiag", alpha=0.7)),
          diag = list(continuous=wrap("densityDiag", alpha=0.5), discrete=wrap("barDiag", alpha=0.7)),
          lower = list(combo=wrap("box_no_facet", alpha=0.7), continuous=wrap("smooth", alpha=0.5),
                       discrete=wrap("barDiag", alpha=0.7))) + 
  theme(strip.text.x = element_text(angle = 30, color = "black"))

Methodology

To compare our models on new, unseen data, we will train each model using data from January through May and assess their performance using June stock returns. This leaves 630 observations (21 weeks of data for each stock) in our training set and 120 observations in our test set.

dow.train <- dow.lag %>% 
  filter(date <= "2011-05-31") %>% 
  select(percent_change_next_weeks_price, pct_chnge_nxt_wk_price.lag, 
         close.lag, volume.lag, percent_change_volume_over_last_wk, stock)

dow.test <- dow.lag %>% 
  filter(date >= "2011-06-01") %>% 
  select(percent_change_next_weeks_price, pct_chnge_nxt_wk_price.lag, 
         close.lag, volume.lag, percent_change_volume_over_last_wk, stock)

Linear Regression

To predict which stocks will produce the greatest return, the first modeling technique we used was a Linear Regression model. Overall, the average test error of the model is 3.17 and the test R$^2$ is 28.53%, which means the model can explain about 29% of the variation in the stock returns on average.

stocks <- as_factor(unique(dow.lag$stock))
RMSE <- rep(NA, length(stocks))
RSquared <- rep(NA, length(stocks))
lm.stock.predictions <- data.frame(Stock = stocks, RMSE = RMSE, RSquared = RSquared)

for(i in 1:length(stocks)){
  
  stock.train <- subset(dow.train, stock == stocks[i])
  stock.test <- subset(dow.test, stock == stocks[i])
  
  lm.fit <- lm(percent_change_next_weeks_price ~ pct_chnge_nxt_wk_price.lag +
               close.lag + volume.lag + percent_change_volume_over_last_wk, 
               data = stock.train) 
  
  lm.preds <- lm.fit %>% 
    predict(stock.test) 
  
  lm.rmse <- RMSE(stock.test$percent_change_next_weeks_price, lm.preds)
  lm.stock.predictions[i, c("RMSE")] <- lm.rmse
  
  lm.r2 <- R2(stock.test$percent_change_next_weeks_price, lm.preds)
  lm.stock.predictions[i, c("RSquared")] <- lm.r2
}

# Overall accuracy
lm.rmse.avg <- mean(lm.stock.predictions$RMSE, na.rm = TRUE)
lm.r2.avg <- mean(lm.stock.predictions$RSquared, na.rm = TRUE)
data.frame(lm.rmse.avg, lm.r2.avg) %>% 
  mutate(lm.r2.avg = scales::percent(lm.r2.avg, accuracy = 0.01)) %>% 
  rename("Average Test Error" = lm.rmse.avg,
         "Average Test RSquared" = lm.r2.avg) %>% 
  kable(align = c("c", "c"), digits = 2) %>% 
  kable_styling(bootstrap_options = "hover")

Average Test Error	Average Test RSquared
3.17	28.53%

# Individual stock models
lm.stock.predictions %>% 
  arrange(RMSE) %>% 
  mutate(RSquared = scales::percent(RSquared, accuracy = 0.01)) %>% 
  kable(align = c("l", "c", "c"), digits = 2) %>% 
  kable_styling(bootstrap_options = c("striped", "hover")) %>% 
  scroll_box(height = "400px")

Stock	RMSE	RSquared
KRFT	1.45	0.25%
JNJ	1.91	7.81%
T	1.94	0.00%
WMT	1.95	65.75%
MRK	1.97	7.86%
PG	2.13	24.51%
KO	2.52	1.58%
TRV	2.54	74.10%
DD	2.55	55.82%
IBM	2.55	0.21%
VZ	2.70	1.27%
CVX	2.77	84.18%
MCD	2.88	16.07%
PFE	2.90	2.11%
MMM	2.91	44.88%
BA	2.92	59.19%
INTC	2.95	61.26%
GE	3.20	30.91%
HPQ	3.21	0.45%
BAC	3.46	10.62%
JPM	3.52	4.13%
UTX	3.69	12.25%
DIS	3.88	44.76%
XOM	3.95	44.35%
CAT	4.05	66.48%
AXP	4.15	20.37%
MSFT	4.27	35.69%
HD	4.45	2.75%
CSCO	5.29	44.15%
AA	6.35	32.05%

Decision Trees

The next method we compared were decision trees. Decision trees are a type of nonparametric method that work by dividing the predictor space into distinct regions. When predicting a continuous outcome variable, regression trees partition the training data based on the measure that minimizes the Residual Sum of Squares (RSS). The data is further subdivided through a process called recursive binary splitting until there are too few observations to continue. Decision trees are very easy to interpret and can identify the most important variables based on where they are used to split the data. Overall, our regression tree has an average test error of 3.13, a slight improvement from the linear model. However, for some of the stocks, the decision tree is predicting the same return for each week, which creates NA values for the R$^2$ in these stocks.

stocks <- as_factor(unique(dow.lag$stock))
RMSE <- rep(NA, length(stocks))
RSquared <- rep(0, length(stocks))
tree.stock.predictions <- data.frame(Stock = stocks, RMSE = RMSE, RSquared = RSquared)

for(i in 1:length(stocks)){
  
  stock.train <- subset(dow.train, stock == stocks[i])
  stock.test <- subset(dow.test, stock == stocks[i])
  
  tree.fit <- tree(percent_change_next_weeks_price ~ pct_chnge_nxt_wk_price.lag +
                   close.lag + volume.lag + percent_change_volume_over_last_wk, 
                   data = stock.train)
  
  tree.preds <- tree.fit %>% 
    predict(stock.test) 
  
  tree.rmse <- RMSE(stock.test$percent_change_next_weeks_price, tree.preds)
  tree.stock.predictions[i, c("RMSE")] <- tree.rmse
  
  tree.r2 <- R2(stock.test$percent_change_next_weeks_price, tree.preds)
  tree.stock.predictions[i, c("RSquared")] <- tree.r2
}

# Overall accuracy
tree.rmse.avg <- mean(tree.stock.predictions$RMSE, na.rm = TRUE)
tree.r2.avg <- mean(tree.stock.predictions$RSquared, na.rm = TRUE)
data.frame(tree.rmse.avg, tree.r2.avg) %>% 
  mutate(tree.r2.avg = scales::percent(tree.r2.avg, accuracy = 0.01)) %>% 
  rename("Average Test Error" = tree.rmse.avg,
         "Average Test RSquared" = tree.r2.avg) %>% 
  kable(align = c("c", "c"), digits = 2) %>% 
  kable_styling(bootstrap_options = "hover")

Average Test Error	Average Test RSquared
3.13	50.97%

# Individual stock models
tree.stock.predictions %>% 
  arrange(RMSE) %>% 
  mutate(RSquared = scales::percent(RSquared, accuracy = 0.01)) %>% 
  kable(align = c("l", "c", "c"), digits = 2) %>% 
  kable_styling(bootstrap_options = c("striped", "hover")) %>% 
  scroll_box(height = "400px")

Stock	RMSE	RSquared
VZ	1.80	50.52%
KRFT	1.92	NA
T	1.96	NA
JNJ	1.99	NA
MCD	2.07	85.45%
PG	2.37	68.99%
IBM	2.38	54.68%
WMT	2.44	78.51%
KO	2.53	94.38%
BAC	2.57	91.65%
GE	2.79	76.61%
MRK	2.80	68.64%
TRV	2.86	NA
MMM	2.86	37.13%
JPM	2.97	7.43%
HD	2.98	NA
DD	3.04	NA
UTX	3.36	22.83%
BA	3.42	40.09%
CVX	3.45	NA
DIS	3.52	16.74%
MSFT	3.54	NA
INTC	3.63	95.45%
HPQ	3.86	2.14%
AXP	3.92	6.40%
XOM	4.19	NA
PFE	4.23	35.63%
AA	4.64	33.31%
CSCO	4.80	58.28%
CAT	5.08	45.49%

Random Forests

To improve on our results, we then tried a random forest model. Random forests are built on many decision trees using different subsets of predictors randomly selected at each split. This produces a combination of trees that are relatively uncorrelated with each other. Using a technique called bootstrapping, the random forest uses a sample of the data set in fitting the model and evaluates its performance on the sample left out, known as the Out-Of-Bag (OOB) data set. In regression settings, the final predictions are calculated by taking the average prediction of each individual tree. Combining the results of multiple uncorrelated trees often improves the prediction accuracy. In this case, the random forest decreased the overall test error to 3.02, lower than both the linear model and the decision tree.

stocks <- as_factor(unique(dow.lag$stock))
RMSE <- rep(NA, length(stocks))
RSquared <- rep(NA, length(stocks))
rf.stock.predictions <- data.frame(Stock = stocks, RMSE = RMSE, RSquared = RSquared)

for(i in 1:length(stocks)){
  
  stock.train <- subset(dow.train, stock == stocks[i])
  stock.test <- subset(dow.test, stock == stocks[i])
  
  set.seed(1)
  rf.fit <- train(percent_change_next_weeks_price ~ pct_chnge_nxt_wk_price.lag +
                   close.lag + volume.lag + percent_change_volume_over_last_wk, 
                   data = stock.train,  method = "rf",
                   metric = "RMSE", preProcess = c("center","scale"), 
                   trControl = trainControl("cv", number = 10))
  
  rf.preds <- rf.fit %>% 
    predict(stock.test) 
  
  rf.rmse <- RMSE(stock.test$percent_change_next_weeks_price, rf.preds)
  rf.stock.predictions[i, c("RMSE")] <- rf.rmse

  rf.r2 <- R2(stock.test$percent_change_next_weeks_price, rf.preds)
  rf.stock.predictions[i, c("RSquared")] <- rf.r2
}

# Overall accuracy
rf.rmse.avg <- mean(rf.stock.predictions$RMSE, na.rm = TRUE)
rf.r2.avg <- mean(rf.stock.predictions$RSquared, na.rm = TRUE)
data.frame(rf.rmse.avg, rf.r2.avg) %>% 
  mutate(rf.r2.avg = scales::percent(rf.r2.avg, accuracy = 0.01)) %>% 
  rename("Average Test Error" = rf.rmse.avg,
         "Average Test RSquared" = rf.r2.avg) %>% 
  kable(align = c("c", "c"), digits = 2) %>% 
  kable_styling(bootstrap_options = "hover")

Average Test Error	Average Test RSquared
3.02	40.47%

# Individual stock models
rf.stock.predictions %>% 
  arrange(RMSE) %>% 
  mutate(RSquared = scales::percent(RSquared, accuracy = 0.01)) %>% 
  kable(align = c("l", "c", "c"), digits = 2) %>% 
  kable_styling(bootstrap_options = c("striped", "hover")) %>% 
  scroll_box(height = "400px")

Stock	RMSE	RSquared
KRFT	1.20	17.94%
JNJ	1.79	82.09%
VZ	1.84	63.34%
T	1.96	0.24%
WMT	2.02	19.15%
KO	2.28	41.81%
MRK	2.31	54.87%
MCD	2.49	33.45%
IBM	2.59	7.44%
BA	2.68	76.45%
PG	2.70	57.99%
TRV	2.72	81.63%
INTC	2.85	53.87%
MMM	2.87	3.88%
DD	3.00	18.86%
BAC	3.14	17.68%
JPM	3.14	7.87%
DIS	3.24	3.26%
GE	3.28	35.58%
CVX	3.30	19.94%
HD	3.35	16.22%
AXP	3.43	63.99%
HPQ	3.73	87.16%
PFE	3.84	37.39%
MSFT	3.91	27.04%
XOM	3.91	36.81%
CAT	3.95	96.43%
UTX	4.02	50.92%
AA	4.17	36.72%
CSCO	4.96	64.20%

Support Vector Regression

For our final method, we used a Support Vector Regression model with a polynomial kernel. The goal in Support Vector Regression is to find a function, $f(x)$ that deviates from the observed response by a value no greater than the margin of tolerance (epsilon) for each training point. Our polynomial SVR model was able to minimize the test error to 2.95, lower than the previous models. Based on the overall RMSE, the model leads to test predictions that are within 2.95% of the true rate of return. On average, the SVR can explain around 42% of the variation in the stock return on the test set.

stocks <- as_factor(unique(dow.train$stock))
RMSE <- rep(NA, length(stocks))
RSquared <- rep(NA, length(stocks))
svm.stock.predictions <- data.frame(Stock = stocks, RMSE = RMSE, RSquared = RSquared)

final.pred <- rep(NA, length(stocks))
stock.predictions <- data.frame(Stock = stocks, Prediction = final.pred)

for(i in 1:length(stocks)){
  
  stock.train <- subset(dow.train, stock == stocks[i])
  stock.test <- subset(dow.test, stock == stocks[i])
  
  set.seed(1)
  svm.fit <- train(percent_change_next_weeks_price ~ pct_chnge_nxt_wk_price.lag +
                   close.lag + volume.lag + percent_change_volume_over_last_wk, 
                   data = stock.train,  method = "svmPoly",
                   metric = "RMSE", preProcess = c("center","scale"), 
                   trControl = trainControl(method = "cv"))
                    
  svm.preds <- svm.fit %>% 
    predict(stock.test) 
  stock.predictions[i, c("Prediction")] <- svm.preds[4] # Final prediction
  
  svm.rmse <- RMSE(stock.test$percent_change_next_weeks_price, svm.preds)
  svm.stock.predictions[i, c("RMSE")] <- svm.rmse
  
  svm.r2 <- R2(stock.test$percent_change_next_weeks_price, svm.preds)
  svm.stock.predictions[i, c("RSquared")] <- svm.r2
}

svm.rmse.avg <- mean(svm.stock.predictions$RMSE, na.rm = TRUE)
svm.r2.avg <- mean(svm.stock.predictions$RSquared, na.rm = TRUE)
data.frame(svm.rmse.avg, svm.r2.avg) %>% 
  mutate(svm.r2.avg = scales::percent(svm.r2.avg, accuracy = 0.01)) %>% 
  rename("Average Test Error" = svm.rmse.avg,
         "Average Test RSquared" = svm.r2.avg) %>% 
  kable(align = c("c", "c"), digits = 2) %>% 
  kable_styling(bootstrap_options = "hover")

Average Test Error	Average Test RSquared
2.95	41.73%

svm.stock.predictions %>% 
  arrange(RMSE) %>% 
  mutate(RSquared = scales::percent(RSquared, accuracy = 0.01)) %>% 
  kable(align = c("l", "c", "c"), digits = 2) %>% 
  kable_styling(bootstrap_options = c("striped", "hover")) %>% 
  scroll_box(height = "400px")

Stock	RMSE	RSquared
KRFT	1.28	63.44%
WMT	1.65	76.59%
JNJ	1.72	23.18%
T	1.89	5.87%
MRK	1.93	0.04%
PG	2.21	16.84%
DD	2.26	65.74%
KO	2.39	49.26%
IBM	2.45	28.49%
PFE	2.57	1.18%
MCD	2.62	15.35%
TRV	2.65	77.15%
VZ	2.79	34.94%
MMM	2.86	13.09%
DIS	2.89	44.36%
BA	2.96	46.57%
HD	3.06	76.16%
JPM	3.09	28.81%
HPQ	3.22	63.28%
CVX	3.22	95.09%
INTC	3.34	13.23%
GE	3.40	12.04%
BAC	3.42	72.68%
XOM	3.52	30.39%
CAT	3.74	88.89%
UTX	3.75	9.81%
AXP	4.11	4.46%
MSFT	4.31	67.80%
CSCO	4.32	35.71%
AA	4.92	91.59%

Model Comparison

Overall, the Support Vector Regression model with a polynomial kernel produced the lowest test error with a fairly high R$^2$. Therefore, we decided to use this model to make our final predictions on next week’s stock returns and compared them to the risk of each stock using a Capital Asset Pricing Model.

data.frame(Model = c("Linear Model", "Regression Tree", "Random Forest", "Support Vector Regression"),
           RMSE = c(lm.rmse.avg, tree.rmse.avg, rf.rmse.avg, svm.rmse.avg),
           RSquared = c(lm.r2.avg, tree.r2.avg, rf.r2.avg, svm.r2.avg)) %>% 
  arrange(RMSE) %>% 
  mutate(RSquared = scales::percent(RSquared, accuracy = 0.01)) %>% 
  kable(align = c("l", "c", "c"), digits = 2) %>% 
  kable_styling(bootstrap_options = c("striped", "hover"))

Model	RMSE	RSquared
Support Vector Regression	2.95	41.73%
Random Forest	3.02	40.47%
Regression Tree	3.13	50.97%
Linear Model	3.17	28.53%

Capital Asset Pricing Model

To determine the risk of each stock, we first created a Dow market indicator to compare each stock. The Dow is calculated by adding all of the stock prices of its 30 components and dividing the sum by a divisor, which in 2011 was about 0.132. Using this formula, we aggregated the data by date and calculated the sum of the stock’s closing prices and divided it by 0.132. We then used the delt() function to calculate the percentage change for each week’s stock return.

Once we had calculated our returns, we used a Capital Asset Pricing Model (CAPM) to calculate the betas, ${\beta}$, of each stock. ${\beta}$ is a statistical measure that captures the relationship between the returns of a security and the returns of the market. If a stock is riskier than the market, it will have a ${\beta}$ greater than 1 and if ${\beta}$ is less than 1, that stock is expected to reduce the risk of a portfolio. The model assumes there are no transaction costs, investments are infinitely divisible, and investors can access all information and are equally informed.

Our results show that 12 out of 30 stocks have a ${\beta}$ greater than 1, which means that about 40% of the stocks are more risky than the market. The stock of Boeing has the highest ${\beta}$ of 1.63, which indicates that it contains the greatest investment risk. The remaining 18 stocks are expected to reduce the risk of a portfolio. Overall, Kraft is the least risky stock in the Dow Jones Index with a ${\beta}$ of 0.19.

# Calculate Dow
dow.agg <- aggregate(dow$close, by = list(dow$date), FUN = function(x) sum(x)/0.132)
return.DOW <- na.omit(Delt(dow.agg[,2]))

stocks <- as_factor(unique(dow$stock))
stock.returns <- data.frame(matrix(0.0, ncol = 30, nrow = 24))
colnames(stock.returns) <- stocks

# Calculate stock returns
for(i in 1:length(stocks)){
  dow.sub <- subset(dow, stock == stocks[i])
  stock.returns[i] <- na.omit(Delt(dow.sub$close))
}

stock.returns <- cbind(stock.returns, return.DOW) %>% 
  rename(DOW = Delt.1.arithmetic)

# Calculate betas
beta.AA <- lm(AA ~ DOW, data = stock.returns)$coef[2]
beta.AXP <- lm(AXP ~ DOW, data = stock.returns)$coef[2]
beta.BA <- lm(BA ~ DOW, data = stock.returns)$coef[2]
beta.BAC <- lm(BAC ~ DOW, data = stock.returns)$coef[2]
beta.CAT <- lm(CAT ~ DOW, data = stock.returns)$coef[2]
beta.CSCO <- lm(CSCO ~ DOW, data = stock.returns)$coef[2]
beta.CVX <- lm(CVX ~ DOW, data = stock.returns)$coef[2]
beta.DD <- lm(DD ~ DOW, data = stock.returns)$coef[2]
beta.DIS <- lm(DIS ~ DOW, data = stock.returns)$coef[2]
beta.GE <- lm(GE ~ DOW, data = stock.returns)$coef[2]
beta.HD <- lm(HD ~ DOW, data = stock.returns)$coef[2]
beta.HPQ <- lm(HPQ ~ DOW, data = stock.returns)$coef[2]
beta.IBM <- lm(IBM ~ DOW, data = stock.returns)$coef[2]
beta.INTC <- lm(INTC ~ DOW, data = stock.returns)$coef[2]
beta.JNJ <- lm(JNJ ~ DOW, data = stock.returns)$coef[2]
beta.JPM <- lm(JPM ~ DOW, data = stock.returns)$coef[2]
beta.KRFT <- lm(KRFT ~ DOW, data = stock.returns)$coef[2]
beta.KO <- lm(KO ~ DOW, data = stock.returns)$coef[2]
beta.MCD <- lm(MCD ~ DOW, data = stock.returns)$coef[2]
beta.MMM <- lm(MMM ~ DOW, data = stock.returns)$coef[2]
beta.MRK <- lm(MRK ~ DOW, data = stock.returns)$coef[2]
beta.MSFT <- lm(MSFT ~ DOW, data = stock.returns)$coef[2]
beta.PFE <- lm(PFE ~ DOW, data = stock.returns)$coef[2]
beta.PG <- lm(PG ~ DOW, data = stock.returns)$coef[2]
beta.T <- lm(`T` ~ DOW, data = stock.returns)$coef[2]
beta.TRV <- lm(TRV ~ DOW, data = stock.returns)$coef[2]
beta.UTX <- lm(UTX ~ DOW, data = stock.returns)$coef[2]
beta.VZ <- lm(VZ ~ DOW, data = stock.returns)$coef[2]
beta.WMT <- lm(WMT ~ DOW, data = stock.returns)$coef[2]
beta.XOM <- lm(XOM ~ DOW, data = stock.returns)$coef[2]

df <- data.frame(Stock = c("Alcoa Corporation", "American Express", "Boeing", "Bank of America", 
                           "Caterpillar Inc.", "Cisco Systems, Inc.", "Chevron", "DuPont", 
                           "Disney", "General Electric", "Home Depot", "HP Inc.", "IBM", 
                           "Intel", "Johnson & Johnson", "J.P. Morgan Chase & Co.", "Kraft",
                           "Coca Cola", "McDonald's", "3M", "Merck & Co.", "Microsoft", 
                           "Pfizer", "Procter & Gamble", "AT&T", "Travelers", "United Technologies",
                           "Verizon", "Walmart", "Exxon Mobil"),
                 Beta = c(beta.AA, beta.AXP, beta.BA, beta.BAC, beta.CAT, beta.CSCO,
                          beta.CVX, beta.DD, beta.DIS, beta.GE, beta.HD, beta.HPQ, beta.IBM,
                          beta.INTC, beta.JNJ, beta.JPM, beta.KRFT, beta.KO, beta.MCD,
                          beta.MMM, beta.MRK, beta.MSFT, beta.PFE, beta.PG, beta.T, beta.TRV,
                          beta.UTX, beta.VZ, beta.WMT, beta.XOM)) 
df <- cbind(df, return = stock.predictions[,2])
df %>% 
  arrange(desc(Beta)) %>%
  transmute(Stock = Stock, Beta = Beta, 
            "Predicted Return" = scales::percent(return, scale = 1, accuracy = 0.01)) %>% 
  kable(align = c("l", "c", "c"), digits = 2) %>% 
  kable_styling(bootstrap_options = c("striped", "hover")) %>% 
  scroll_box(height = "400px")

Stock	Beta	Predicted Return
Boeing	1.63	2.67%
Caterpillar Inc.	1.49	2.68%
Disney	1.49	1.89%
HP Inc.	1.43	-0.48%
General Electric	1.36	1.57%
Alcoa Corporation	1.30	-0.18%
Intel	1.25	0.13%
3M	1.24	0.54%
United Technologies	1.20	0.05%
Exxon Mobil	1.20	1.70%
DuPont	1.13	3.52%
American Express	1.09	0.46%
Travelers	1.00	0.38%
Home Depot	0.99	0.91%
IBM	0.92	0.56%
J.P. Morgan Chase & Co.	0.87	2.28%
Chevron	0.84	1.43%
AT&T	0.81	0.09%
Verizon	0.76	-0.08%
Johnson & Johnson	0.76	0.49%
McDonald’s	0.74	0.23%
Pfizer	0.72	0.67%
Microsoft	0.71	-0.72%
Walmart	0.69	0.44%
Coca Cola	0.68	0.38%
Cisco Systems, Inc.	0.65	-0.87%
Bank of America	0.65	-0.82%
Merck & Co.	0.54	-0.01%
Procter & Gamble	0.49	-0.97%
Kraft	0.19	0.64%

Our predictions indicate that DuPont and Caterpillar will have the greatest rate of return in the following week. Based on the table above, some of the more risky stocks are associated with higher returns. For instance, DuPont is expected to have about a 3.5% increase in return in the following week, followed by Caterpillar which is expected to have around a 2.7% increase in return. Additionally, based on the boxplots below, some of the more risky stocks are associated with more variance in their returns. For example, Alcoa Corporation (AA), Caterpillar (CAT), and Exxon Mobil (XOM) have wider spreads in their distributions and have a $\beta$ of 1.3, 1.49, and 1.2, respectively.

stock.returns %>% 
  pivot_longer(cols = everything(), names_to = "stock", values_to = "return") %>% 
  ggplot(aes(x = return, y = fct_rev(stock))) +
  geom_boxplot(aes(fill = stock), alpha = 0.5, show.legend = FALSE) + 
  scale_x_continuous(labels = scales::percent_format()) +
  coord_cartesian(xlim = c(-0.1, 0.1)) +
  labs(title = "Dow Jones Stock Return Distributions \n", x = "\n Expected Return", y = "Stock\n")

Conclusion

To predict which stocks will have the greatest return in the following week, we compared four different modeling techniques: linear regression, decision trees, random forests, and support vector machines. Overall, the Support Vector Regression model with a polynomial kernel produced the lowest test error compared to the other methods. Based on our predictions, we believe that DuPont and Caterpillar will have the highest returns. In considering the risks of the different stocks, our CAPM suggests that Boeing’s stock has the highest volatility in comparison to the Dow Jones Index with a $\beta$ of 1.63 while Kraft’s stock is less volatile than the overall market with a $\beta$ of 0.19.

References

Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York, NY: Springer.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R. New York, NY: Springer.