In the second lecture sequence this week, we heard about cluster-then-predict, a methodology in which you first cluster observations and then build cluster-specific prediction models. In the lecture sequence, we saw how this methodology helped improve the prediction of heart attack risk. In this assignment, we’ll use cluster-then-predict to predict future stock prices using historical stock data.
When selecting which stocks to invest in, investors seek to obtain good future returns. In this problem, we will first use clustering to identify clusters of stocks that have similar returns over time. Then, we’ll use logistic regression to predict whether or not the stocks will have positive future returns.
For this problem, we’ll use StocksCluster.csv, which contains monthly stock returns from the NASDAQ stock exchange. The NASDAQ is the second-largest stock exchange in the world, and it lists many technology companies. The stock price data used in this problem was obtained from infochimps, a website providing access to many datasets.
Each observation in the dataset is the monthly returns of a particular company in a particular year. The years included are 2000-2009. The companies are limited to tickers that were listed on the exchange for the entire period 2000-2009, and whose stock price never fell below $1. So, for example, one observation is for Yahoo in 2000, and another observation is for Yahoo in 2001. Our goal will be to predict whether or not the stock return in December will be positive, using the stock returns for the first 11 months of the year.
This dataset contains the following variables:
ReturnJan = the return for the company’s stock during January (in the year of the observation).
ReturnFeb = the return for the company’s stock during February (in the year of the observation).
ReturnMar = the return for the company’s stock during March (in the year of the observation).
ReturnApr = the return for the company’s stock during April (in the year of the observation).
ReturnMay = the return for the company’s stock during May (in the year of the observation).
ReturnJune = the return for the company’s stock during June (in the year of the observation).
ReturnJuly = the return for the company’s stock during July (in the year of the observation).
ReturnAug = the return for the company’s stock during August (in the year of the observation).
ReturnSep = the return for the company’s stock during September (in the year of the observation).
ReturnOct = the return for the company’s stock during October (in the year of the observation).
ReturnNov = the return for the company’s stock during November (in the year of the observation).
PositiveDec = whether or not the company’s stock had a positive return in December (in the year of the observation). This variable takes value 1 if the return was positive, and value 0 if the return was not positive.
For the first 11 variables, the value stored is a proportional change in stock value during that month. For instance, a value of 0.05 means the stock increased in value 5% during the month, while a value of -0.02 means the stock decreased in value 2% during the month.
setwd("C:/Users/jzchen/Documents/Courses/Analytics Edge/Unit_6_Clustering")
stocks <- read.csv("StocksCluster.csv")
str(stocks)
## 'data.frame': 11580 obs. of 12 variables:
## $ ReturnJan : num 0.0807 -0.0107 0.0477 -0.074 -0.031 ...
## $ ReturnFeb : num 0.0663 0.1021 0.036 -0.0482 -0.2127 ...
## $ ReturnMar : num 0.0329 0.1455 0.0397 0.0182 0.0915 ...
## $ ReturnApr : num 0.1831 -0.0844 -0.1624 -0.0247 0.1893 ...
## $ ReturnMay : num 0.13033 -0.3273 -0.14743 -0.00604 -0.15385 ...
## $ ReturnJune : num -0.0176 -0.3593 0.0486 -0.0253 -0.1061 ...
## $ ReturnJuly : num -0.0205 -0.0253 -0.1354 -0.094 0.3553 ...
## $ ReturnAug : num 0.0247 0.2113 0.0334 0.0953 0.0568 ...
## $ ReturnSep : num -0.0204 -0.58 0 0.0567 0.0336 ...
## $ ReturnOct : num -0.1733 -0.2671 0.0917 -0.0963 0.0363 ...
## $ ReturnNov : num -0.0254 -0.1512 -0.0596 -0.0405 -0.0853 ...
## $ PositiveDec: int 0 0 0 1 1 1 1 0 0 0 ...
library(caTools)
set.seed(144)
spl <- sample.split(stocks$PositiveDec, SplitRatio = 0.7)
stocksTrain <- subset(stocks, spl == T)
stocksTest <- subset(stocks, spl == F)
stocksModel <- glm(PositiveDec~., data = stocksTrain, family = binomial)
stocksPred <- predict(stocksModel, type = "response")
table(stocksTrain$PositiveDec, stocksPred >= 0.5)
##
## FALSE TRUE
## 0 990 2689
## 1 787 3640
Note the type argument in predict command. If we typed in method = “response”, the result would have been wrong
Evaluate the model on the test set
stocksPred_test <- predict(stocksModel, newdata = stocksTest, type = "response")
table(stocksTest$PositiveDec, stocksPred_test >= 0.5)
##
## FALSE TRUE
## 0 417 1160
## 1 344 1553
Baseline model accuracy 0.5460564
table(stocksTest$PositiveDec)
##
## 0 1
## 1577 1897
First remove the dependent variable
limitedTrain <- stocksTrain
limitedTrain$PositiveDec <- NULL
limitedTest <- stocksTest
limitedTest$PositiveDec <- NULL
Normalize the data
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
preproc <- preProcess(limitedTrain)
normTrain <- predict(preproc, limitedTrain)
normTest <- predict(preproc, limitedTest)
Note that here we normalized the training set. Test set is really not normalized to zero mean and 1 sd.
summary(normTest)
## ReturnJan ReturnFeb ReturnMar
## Min. :-3.743836 Min. :-3.251044 Min. :-4.07731
## 1st Qu.:-0.485690 1st Qu.:-0.348951 1st Qu.:-0.40662
## Median :-0.066856 Median :-0.006860 Median :-0.05674
## Mean :-0.000419 Mean :-0.003862 Mean : 0.00583
## 3rd Qu.: 0.357729 3rd Qu.: 0.264647 3rd Qu.: 0.35653
## Max. : 8.412973 Max. : 9.552365 Max. : 9.00982
## ReturnApr ReturnMay ReturnJune
## Min. :-4.47865 Min. :-5.84445 Min. :-4.73628
## 1st Qu.:-0.51121 1st Qu.:-0.43819 1st Qu.:-0.44968
## Median :-0.11414 Median :-0.05346 Median :-0.02678
## Mean :-0.03638 Mean : 0.02651 Mean : 0.04315
## 3rd Qu.: 0.32742 3rd Qu.: 0.42290 3rd Qu.: 0.43010
## Max. : 6.84589 Max. : 7.21362 Max. :29.00534
## ReturnJuly ReturnAug ReturnSep
## Min. :-5.201454 Min. :-4.62097 Min. :-3.57222
## 1st Qu.:-0.512039 1st Qu.:-0.51546 1st Qu.:-0.38067
## Median :-0.026576 Median :-0.10277 Median : 0.08215
## Mean : 0.006016 Mean :-0.04973 Mean : 0.02939
## 3rd Qu.: 0.457193 3rd Qu.: 0.38781 3rd Qu.: 0.45847
## Max. :12.790901 Max. : 6.66889 Max. : 7.09106
## ReturnOct ReturnNov
## Min. :-3.807577 Min. :-4.881463
## 1st Qu.:-0.393856 1st Qu.:-0.396764
## Median : 0.006783 Median :-0.002337
## Mean : 0.029672 Mean : 0.017128
## 3rd Qu.: 0.419005 3rd Qu.: 0.424617
## Max. : 7.428466 Max. :21.007786
set.seed(144)
km <- kmeans(normTrain, centers = 3)
str(km)
## List of 9
## $ cluster : Named int [1:8106] 1 1 1 3 1 3 2 2 1 3 ...
## ..- attr(*, "names")= chr [1:8106] "1" "2" "4" "6" ...
## $ centers : num [1:3, 1:11] -0.4523 0.2641 0.7425 -0.1497 -0.0415 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "1" "2" "3"
## .. ..$ : chr [1:11] "ReturnJan" "ReturnFeb" "ReturnMar" "ReturnApr" ...
## $ totss : num 89155
## $ withinss : num [1:3] 31204 38032 9937
## $ tot.withinss: num 79173
## $ betweenss : num 9982
## $ size : int [1:3] 3157 4696 253
## $ iter : int 5
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
Obtain training set and test set cluster assignments.
Assign the test observations to the clusters built upon the training set.
library(flexclust)
## Loading required package: grid
## Loading required package: modeltools
## Loading required package: stats4
km.kcca <- as.kcca(km, normTrain)
clusterTrain <- predict(km.kcca)
clusterTest <- predict(km.kcca, newdata = normTest)
How many test-set observations were assigned to Cluster 2?
table(clusterTest)
## clusterTest
## 1 2 3
## 1298 2080 96
Using the subset function, build data frames stocksTrain1, stocksTrain2, and stocksTrain3, containing the elements in the stocksTrain data frame assigned to clusters 1, 2, and 3, respectively (be careful to take subsets of stocksTrain, not of normTrain). Similarly build stocksTest1, stocksTest2, and stocksTest3 from the stocksTest data frame.
stockTrain1 <- subset(stocksTrain, clusterTrain == 1)
stockTrain2 <- subset(stocksTrain, clusterTrain == 2)
stockTrain3 <- subset(stocksTrain, clusterTrain == 3)
stockTest1 <- subset(stocksTest, clusterTest == 1)
stockTest2 <- subset(stocksTest, clusterTest == 2)
stockTest3 <- subset(stocksTest, clusterTest == 3)
Which training set data frame has the highest average value of the dependent variable?
tapply(stocksTrain$PositiveDec, clusterTrain, mean)
## 1 2 3
## 0.6024707 0.5140545 0.4387352
Build logistic regression models
stocksModel1 <- glm(PositiveDec ~., data = stockTrain1, family = binomial)
stocksModel2 <- glm(PositiveDec ~., data = stockTrain2, family = binomial)
stocksModel3 <- glm(PositiveDec ~., data = stockTrain3, family = binomial)
Using StocksModel1, make test-set predictions called PredictTest1 on the data frame stocksTest1. Using StocksModel2, make test-set predictions called PredictTest2 on the data frame stocksTest2. Using StocksModel3, make test-set predictions called PredictTest3 on the data frame stocksTest3.
predictTest1 <- predict(stocksModel1, newdata = stockTest1, type = "response")
predictTest2 <- predict(stocksModel2, newdata = stockTest2, type = "response")
predictTest3 <- predict(stocksModel3, newdata = stockTest3, type = "response")
Evaluate the accuracy
table(stockTest1$PositiveDec, predictTest1 >= 0.5)
##
## FALSE TRUE
## 0 30 471
## 1 23 774
table(stockTest2$PositiveDec, predictTest2 >= 0.5)
##
## FALSE TRUE
## 0 388 626
## 1 309 757
table(stockTest3$PositiveDec, predictTest3 >= 0.5)
##
## FALSE TRUE
## 0 49 13
## 1 21 13
To compute the overall test-set accuracy of the cluster-then-predict approach, we can combine all the test-set predictions into a single vector and all the true outcomes into a single vector:
allPredictions <- c(predictTest1, predictTest2, predictTest3)
allOutcomes <- c(stockTest1$PositiveDec, stockTest2$PositiveDec, stockTest3$PositiveDec)
Evalute the overall accuracy
table(allOutcomes, allPredictions>= 0.5)
##
## allOutcomes FALSE TRUE
## 0 467 1110
## 1 353 1544
We see a modest improvement over the original logistic regression model. Since predicting stock returns is a notoriously hard problem, this is a good increase in accuracy. By investing in stocks for which we are more confident that they will have positive returns (by selecting the ones with higher predicted probabilities), this cluster-then-predict model can give us an edge over the original logistic regression model.