Source: Analytics Edge Unit 6 Homework

Techniques involved: Normalization, cluster-then-predict

In the second lecture sequence this week, we heard about cluster-then-predict, a methodology in which you first cluster observations and then build cluster-specific prediction models. In the lecture sequence, we saw how this methodology helped improve the prediction of heart attack risk. In this assignment, we’ll use cluster-then-predict to predict future stock prices using historical stock data.

When selecting which stocks to invest in, investors seek to obtain good future returns. In this problem, we will first use clustering to identify clusters of stocks that have similar returns over time. Then, we’ll use logistic regression to predict whether or not the stocks will have positive future returns.

For this problem, we’ll use StocksCluster.csv, which contains monthly stock returns from the NASDAQ stock exchange. The NASDAQ is the second-largest stock exchange in the world, and it lists many technology companies. The stock price data used in this problem was obtained from infochimps, a website providing access to many datasets.

Each observation in the dataset is the monthly returns of a particular company in a particular year. The years included are 2000-2009. The companies are limited to tickers that were listed on the exchange for the entire period 2000-2009, and whose stock price never fell below $1. So, for example, one observation is for Yahoo in 2000, and another observation is for Yahoo in 2001. Our goal will be to predict whether or not the stock return in December will be positive, using the stock returns for the first 11 months of the year.

This dataset contains the following variables:

ReturnJan = the return for the company’s stock during January (in the year of the observation).

ReturnFeb = the return for the company’s stock during February (in the year of the observation).

ReturnMar = the return for the company’s stock during March (in the year of the observation).

ReturnApr = the return for the company’s stock during April (in the year of the observation).

ReturnMay = the return for the company’s stock during May (in the year of the observation).

ReturnJune = the return for the company’s stock during June (in the year of the observation).

ReturnJuly = the return for the company’s stock during July (in the year of the observation).

ReturnAug = the return for the company’s stock during August (in the year of the observation).

ReturnSep = the return for the company’s stock during September (in the year of the observation).

ReturnOct = the return for the company’s stock during October (in the year of the observation).

ReturnNov = the return for the company’s stock during November (in the year of the observation).

PositiveDec = whether or not the company’s stock had a positive return in December (in the year of the observation). This variable takes value 1 if the return was positive, and value 0 if the return was not positive.

For the first 11 variables, the value stored is a proportional change in stock value during that month. For instance, a value of 0.05 means the stock increased in value 5% during the month, while a value of -0.02 means the stock decreased in value 2% during the month.

Load the data

setwd("C:/Users/jzchen/Documents/Courses/Analytics Edge/Unit_6_Clustering")
stocks <- read.csv("StocksCluster.csv")
str(stocks)
## 'data.frame':    11580 obs. of  12 variables:
##  $ ReturnJan  : num  0.0807 -0.0107 0.0477 -0.074 -0.031 ...
##  $ ReturnFeb  : num  0.0663 0.1021 0.036 -0.0482 -0.2127 ...
##  $ ReturnMar  : num  0.0329 0.1455 0.0397 0.0182 0.0915 ...
##  $ ReturnApr  : num  0.1831 -0.0844 -0.1624 -0.0247 0.1893 ...
##  $ ReturnMay  : num  0.13033 -0.3273 -0.14743 -0.00604 -0.15385 ...
##  $ ReturnJune : num  -0.0176 -0.3593 0.0486 -0.0253 -0.1061 ...
##  $ ReturnJuly : num  -0.0205 -0.0253 -0.1354 -0.094 0.3553 ...
##  $ ReturnAug  : num  0.0247 0.2113 0.0334 0.0953 0.0568 ...
##  $ ReturnSep  : num  -0.0204 -0.58 0 0.0567 0.0336 ...
##  $ ReturnOct  : num  -0.1733 -0.2671 0.0917 -0.0963 0.0363 ...
##  $ ReturnNov  : num  -0.0254 -0.1512 -0.0596 -0.0405 -0.0853 ...
##  $ PositiveDec: int  0 0 0 1 1 1 1 0 0 0 ...

INITIAL LOGISTIC REGRESSION MODEL

library(caTools)
set.seed(144)
spl <- sample.split(stocks$PositiveDec, SplitRatio = 0.7)
stocksTrain <- subset(stocks, spl == T)
stocksTest <- subset(stocks, spl == F)
stocksModel <- glm(PositiveDec~., data = stocksTrain, family = binomial)
stocksPred <- predict(stocksModel, type = "response")
table(stocksTrain$PositiveDec, stocksPred >= 0.5)
##    
##     FALSE TRUE
##   0   990 2689
##   1   787 3640

Note the type argument in predict command. If we typed in method = “response”, the result would have been wrong

Evaluate the model on the test set

stocksPred_test <- predict(stocksModel, newdata = stocksTest, type = "response")
table(stocksTest$PositiveDec, stocksPred_test >= 0.5)
##    
##     FALSE TRUE
##   0   417 1160
##   1   344 1553

Baseline model accuracy 0.5460564

table(stocksTest$PositiveDec)
## 
##    0    1 
## 1577 1897

Clustering the stocks

First remove the dependent variable

limitedTrain <- stocksTrain
limitedTrain$PositiveDec <- NULL
limitedTest <- stocksTest
limitedTest$PositiveDec <- NULL

Normalize the data

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
preproc <- preProcess(limitedTrain)
normTrain <- predict(preproc, limitedTrain)
normTest <- predict(preproc, limitedTest)

Note that here we normalized the training set. Test set is really not normalized to zero mean and 1 sd.

summary(normTest)
##    ReturnJan           ReturnFeb           ReturnMar       
##  Min.   :-3.743836   Min.   :-3.251044   Min.   :-4.07731  
##  1st Qu.:-0.485690   1st Qu.:-0.348951   1st Qu.:-0.40662  
##  Median :-0.066856   Median :-0.006860   Median :-0.05674  
##  Mean   :-0.000419   Mean   :-0.003862   Mean   : 0.00583  
##  3rd Qu.: 0.357729   3rd Qu.: 0.264647   3rd Qu.: 0.35653  
##  Max.   : 8.412973   Max.   : 9.552365   Max.   : 9.00982  
##    ReturnApr          ReturnMay          ReturnJune      
##  Min.   :-4.47865   Min.   :-5.84445   Min.   :-4.73628  
##  1st Qu.:-0.51121   1st Qu.:-0.43819   1st Qu.:-0.44968  
##  Median :-0.11414   Median :-0.05346   Median :-0.02678  
##  Mean   :-0.03638   Mean   : 0.02651   Mean   : 0.04315  
##  3rd Qu.: 0.32742   3rd Qu.: 0.42290   3rd Qu.: 0.43010  
##  Max.   : 6.84589   Max.   : 7.21362   Max.   :29.00534  
##    ReturnJuly          ReturnAug          ReturnSep       
##  Min.   :-5.201454   Min.   :-4.62097   Min.   :-3.57222  
##  1st Qu.:-0.512039   1st Qu.:-0.51546   1st Qu.:-0.38067  
##  Median :-0.026576   Median :-0.10277   Median : 0.08215  
##  Mean   : 0.006016   Mean   :-0.04973   Mean   : 0.02939  
##  3rd Qu.: 0.457193   3rd Qu.: 0.38781   3rd Qu.: 0.45847  
##  Max.   :12.790901   Max.   : 6.66889   Max.   : 7.09106  
##    ReturnOct           ReturnNov        
##  Min.   :-3.807577   Min.   :-4.881463  
##  1st Qu.:-0.393856   1st Qu.:-0.396764  
##  Median : 0.006783   Median :-0.002337  
##  Mean   : 0.029672   Mean   : 0.017128  
##  3rd Qu.: 0.419005   3rd Qu.: 0.424617  
##  Max.   : 7.428466   Max.   :21.007786

K-means clustering

set.seed(144)
km <- kmeans(normTrain, centers = 3)
str(km)
## List of 9
##  $ cluster     : Named int [1:8106] 1 1 1 3 1 3 2 2 1 3 ...
##   ..- attr(*, "names")= chr [1:8106] "1" "2" "4" "6" ...
##  $ centers     : num [1:3, 1:11] -0.4523 0.2641 0.7425 -0.1497 -0.0415 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "1" "2" "3"
##   .. ..$ : chr [1:11] "ReturnJan" "ReturnFeb" "ReturnMar" "ReturnApr" ...
##  $ totss       : num 89155
##  $ withinss    : num [1:3] 31204 38032 9937
##  $ tot.withinss: num 79173
##  $ betweenss   : num 9982
##  $ size        : int [1:3] 3157 4696 253
##  $ iter        : int 5
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

Obtain training set and test set cluster assignments.

Assign the test observations to the clusters built upon the training set.

library(flexclust)
## Loading required package: grid
## Loading required package: modeltools
## Loading required package: stats4
km.kcca <- as.kcca(km, normTrain)
clusterTrain <- predict(km.kcca)
clusterTest <- predict(km.kcca, newdata = normTest)

How many test-set observations were assigned to Cluster 2?

table(clusterTest)
## clusterTest
##    1    2    3 
## 1298 2080   96

Using the subset function, build data frames stocksTrain1, stocksTrain2, and stocksTrain3, containing the elements in the stocksTrain data frame assigned to clusters 1, 2, and 3, respectively (be careful to take subsets of stocksTrain, not of normTrain). Similarly build stocksTest1, stocksTest2, and stocksTest3 from the stocksTest data frame.

stockTrain1 <- subset(stocksTrain, clusterTrain == 1)
stockTrain2 <- subset(stocksTrain, clusterTrain == 2)
stockTrain3 <- subset(stocksTrain, clusterTrain == 3)

stockTest1 <- subset(stocksTest, clusterTest == 1)
stockTest2 <- subset(stocksTest, clusterTest == 2)
stockTest3 <- subset(stocksTest, clusterTest == 3)

Which training set data frame has the highest average value of the dependent variable?

tapply(stocksTrain$PositiveDec, clusterTrain, mean)
##         1         2         3 
## 0.6024707 0.5140545 0.4387352

CLUSTER-SPECIFIC PREDICTIONS

Build logistic regression models

stocksModel1 <- glm(PositiveDec ~., data = stockTrain1, family = binomial)
stocksModel2 <- glm(PositiveDec ~., data = stockTrain2, family = binomial)
stocksModel3 <- glm(PositiveDec ~., data = stockTrain3, family = binomial)

Using StocksModel1, make test-set predictions called PredictTest1 on the data frame stocksTest1. Using StocksModel2, make test-set predictions called PredictTest2 on the data frame stocksTest2. Using StocksModel3, make test-set predictions called PredictTest3 on the data frame stocksTest3.

predictTest1 <- predict(stocksModel1, newdata = stockTest1, type = "response")
predictTest2 <- predict(stocksModel2, newdata = stockTest2, type = "response")
predictTest3 <- predict(stocksModel3, newdata = stockTest3, type = "response")

Evaluate the accuracy

table(stockTest1$PositiveDec, predictTest1 >= 0.5)
##    
##     FALSE TRUE
##   0    30  471
##   1    23  774
table(stockTest2$PositiveDec, predictTest2 >= 0.5)
##    
##     FALSE TRUE
##   0   388  626
##   1   309  757
table(stockTest3$PositiveDec, predictTest3 >= 0.5)
##    
##     FALSE TRUE
##   0    49   13
##   1    21   13

To compute the overall test-set accuracy of the cluster-then-predict approach, we can combine all the test-set predictions into a single vector and all the true outcomes into a single vector:

allPredictions <- c(predictTest1, predictTest2, predictTest3)
allOutcomes <- c(stockTest1$PositiveDec, stockTest2$PositiveDec, stockTest3$PositiveDec)

Evalute the overall accuracy

table(allOutcomes, allPredictions>= 0.5)
##            
## allOutcomes FALSE TRUE
##           0   467 1110
##           1   353 1544

We see a modest improvement over the original logistic regression model. Since predicting stock returns is a notoriously hard problem, this is a good increase in accuracy. By investing in stocks for which we are more confident that they will have positive returns (by selecting the ones with higher predicted probabilities), this cluster-then-predict model can give us an edge over the original logistic regression model.