Homework assignment for MITx “The Analytics Edge”

Cluster-then-predict is a methodology in which you first cluster observations and then build cluster-specific prediction models. For example, this methodology can help improve the prediction of heart attack risk. We’ll use cluster-then-predict to predict future stock prices using historical stock data.

When selecting which stocks to invest in, investors seek to obtain good future returns. We will first use clustering to identify clusters of stocks that have similar returns over time. Then, we’ll use logistic regression to predict whether or not the stocks will have positive future returns.

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)
library(knitr)
library(kableExtra)
library(formattable)
library(caTools) # training/test set creation tools
library(caret) # has preProcess command to normalize data
library(flexclust) # obtain training set and testing sut cluster assignments for observations

data

For this problem, we’ll use StocksCluster.csv, which contains monthly stock returns from the NASDAQ stock exchange. The NASDAQ is the second-largest stock exchange in the world, and it lists many technology companies. The stock price data used in this problem was obtained from infochimps, a website providing access to many datasets.

Each observation in the dataset is the monthly returns of a particular company in a particular year. The years included are 2000-2009. The companies are limited to tickers that were listed on the exchange for the entire period 2000-2009, and whose stock price never fell below $1. So, for example, one observation is for Yahoo in 2000, and another observation is for Yahoo in 2001. Our goal will be to predict whether or not the stock return in December will be positive, using the stock returns for the first 11 months of the year.

stocks <- read.csv("StocksCluster.csv")
str(stocks)
## 'data.frame':    11580 obs. of  12 variables:
##  $ ReturnJan  : num  0.0807 -0.0107 0.0477 -0.074 -0.031 ...
##  $ ReturnFeb  : num  0.0663 0.1021 0.036 -0.0482 -0.2127 ...
##  $ ReturnMar  : num  0.0329 0.1455 0.0397 0.0182 0.0915 ...
##  $ ReturnApr  : num  0.1831 -0.0844 -0.1624 -0.0247 0.1893 ...
##  $ ReturnMay  : num  0.13033 -0.3273 -0.14743 -0.00604 -0.15385 ...
##  $ ReturnJune : num  -0.0176 -0.3593 0.0486 -0.0253 -0.1061 ...
##  $ ReturnJuly : num  -0.0205 -0.0253 -0.1354 -0.094 0.3553 ...
##  $ ReturnAug  : num  0.0247 0.2113 0.0334 0.0953 0.0568 ...
##  $ ReturnSep  : num  -0.0204 -0.58 0 0.0567 0.0336 ...
##  $ ReturnOct  : num  -0.1733 -0.2671 0.0917 -0.0963 0.0363 ...
##  $ ReturnNov  : num  -0.0254 -0.1512 -0.0596 -0.0405 -0.0853 ...
##  $ PositiveDec: int  0 0 0 1 1 1 1 0 0 0 ...

This dataset contains the following variables:

ReturnJan = the return for the company's stock during January (in the year of the observation). 
ReturnFeb = the return for the company's stock during February (in the year of the observation). 
ReturnMar = the return for the company's stock during March (in the year of the observation). 
ReturnApr = the return for the company's stock during April (in the year of the observation). 
ReturnMay = the return for the company's stock during May (in the year of the observation). 
ReturnJune = the return for the company's stock during June (in the year of the observation). 
ReturnJuly = the return for the company's stock during July (in the year of the observation). 
ReturnAug = the return for the company's stock during August (in the year of the observation). 
ReturnSep = the return for the company's stock during September (in the year of the observation). 
ReturnOct = the return for the company's stock during October (in the year of the observation). 
ReturnNov = the return for the company's stock during November (in the year of the observation). 
PositiveDec = whether or not the company's stock had a positive return in December (in the year of the observation). This variable takes value 1 if the return was positive, and value 0 if the return was not positive.

For the first 11 variables, the value stored is a proportional change in stock value during that month. For instance, a value of 0.05 means the stock increased in value 5% during the month, while a value of -0.02 means the stock decreased in value 2% during the month.

summary(stocks)
##    ReturnJan            ReturnFeb           ReturnMar        
##  Min.   :-0.7616205   Min.   :-0.690000   Min.   :-0.712994  
##  1st Qu.:-0.0691663   1st Qu.:-0.077748   1st Qu.:-0.046389  
##  Median : 0.0009965   Median :-0.010626   Median : 0.009878  
##  Mean   : 0.0126316   Mean   :-0.007605   Mean   : 0.019402  
##  3rd Qu.: 0.0732606   3rd Qu.: 0.043600   3rd Qu.: 0.077066  
##  Max.   : 3.0683060   Max.   : 6.943694   Max.   : 4.008621  
##    ReturnApr           ReturnMay          ReturnJune       
##  Min.   :-0.826503   Min.   :-0.92207   Min.   :-0.717920  
##  1st Qu.:-0.054468   1st Qu.:-0.04640   1st Qu.:-0.063966  
##  Median : 0.009059   Median : 0.01293   Median :-0.000880  
##  Mean   : 0.026308   Mean   : 0.02474   Mean   : 0.005938  
##  3rd Qu.: 0.085338   3rd Qu.: 0.08396   3rd Qu.: 0.061566  
##  Max.   : 2.528827   Max.   : 6.93013   Max.   : 4.339713  
##    ReturnJuly           ReturnAug           ReturnSep        
##  Min.   :-0.7613096   Min.   :-0.726800   Min.   :-0.839730  
##  1st Qu.:-0.0731917   1st Qu.:-0.046272   1st Qu.:-0.074648  
##  Median :-0.0008047   Median : 0.007205   Median :-0.007616  
##  Mean   : 0.0030509   Mean   : 0.016198   Mean   :-0.014721  
##  3rd Qu.: 0.0718205   3rd Qu.: 0.070783   3rd Qu.: 0.049476  
##  Max.   : 2.5500000   Max.   : 3.626609   Max.   : 5.863980  
##    ReturnOct           ReturnNov          PositiveDec    
##  Min.   :-0.685504   Min.   :-0.747171   Min.   :0.0000  
##  1st Qu.:-0.070915   1st Qu.:-0.054890   1st Qu.:0.0000  
##  Median : 0.002115   Median : 0.008522   Median :1.0000  
##  Mean   : 0.005651   Mean   : 0.011387   Mean   :0.5461  
##  3rd Qu.: 0.074542   3rd Qu.: 0.076576   3rd Qu.:1.0000  
##  Max.   : 5.665138   Max.   : 3.271676   Max.   :1.0000

54.61% of observations had positive returns in December.

April has the largest mean value (0.026308), and September has the smallest mean value (-0.014721).

correlations

data.frame(cor(stocks)) %>%
  round(4) %>%
  formattable(list(
    area(col = 1:12, row = 1:12) ~
      color_tile("white", "red")
    ))
ReturnJan ReturnFeb ReturnMar ReturnApr ReturnMay ReturnJune ReturnJuly ReturnAug ReturnSep ReturnOct ReturnNov PositiveDec
ReturnJan 1.0000 0.0668 -0.0905 -0.0377 -0.0444 0.0922 -0.0814 -0.0228 -0.0264 0.1430 0.0676 0.0047
ReturnFeb 0.0668 1.0000 -0.1560 -0.1914 -0.0955 0.1700 -0.0618 0.1316 0.0435 -0.0873 -0.1547 -0.0382
ReturnMar -0.0905 -0.1560 1.0000 0.0097 -0.0039 -0.0859 0.0034 -0.0220 0.0765 -0.0119 0.0373 0.0224
ReturnApr -0.0377 -0.1914 0.0097 1.0000 0.0638 -0.0110 0.0806 -0.0518 -0.0289 0.0485 0.0318 0.0944
ReturnMay -0.0444 -0.0955 -0.0039 0.0638 1.0000 -0.0211 0.0909 -0.0331 0.0220 0.0172 0.0480 0.0582
ReturnJune 0.0922 0.1700 -0.0859 -0.0110 -0.0211 1.0000 -0.0292 0.0107 0.0447 -0.0226 -0.0653 0.0234
ReturnJuly -0.0814 -0.0618 0.0034 0.0806 0.0909 -0.0292 1.0000 0.0007 0.0689 -0.0547 -0.0484 0.0744
ReturnAug -0.0228 0.1316 -0.0220 -0.0518 -0.0331 0.0107 0.0007 1.0000 0.0007 -0.0756 -0.1165 0.0042
ReturnSep -0.0264 0.0435 0.0765 -0.0289 0.0220 0.0447 0.0689 0.0007 1.0000 -0.0581 -0.0197 0.0416
ReturnOct 0.1430 -0.0873 -0.0119 0.0485 0.0172 -0.0226 -0.0547 -0.0756 -0.0581 1.0000 0.1917 -0.0526
ReturnNov 0.0676 -0.1547 0.0373 0.0318 0.0480 -0.0653 -0.0484 -0.1165 -0.0197 0.1917 1.0000 -0.0623
PositiveDec 0.0047 -0.0382 0.0224 0.0944 0.0582 0.0234 0.0744 0.0042 0.0416 -0.0526 -0.0623 1.0000

The largest correlation coefficient between two return variables is 0.19167279, between ReturnOct and ReturnNov.

create training and testing sets

set.seed(144)

spl <- sample.split(stocks$PositiveDec, SplitRatio = 0.7) # 70% in training set
stocksTrain <- subset(stocks, spl == TRUE)
stocksTest <- subset(stocks, spl == FALSE)

logistic regression model

StocksModel <- glm(PositiveDec ~ .,
                   data = stocksTrain, family = "binomial")

summary(StocksModel)
## 
## Call:
## glm(formula = PositiveDec ~ ., family = "binomial", data = stocksTrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4333  -1.2265   0.9102   1.1006   2.2611  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.14878    0.02384   6.240 4.37e-10 ***
## ReturnJan    0.31742    0.13906   2.283  0.02246 *  
## ReturnFeb   -0.29349    0.13113  -2.238  0.02521 *  
## ReturnMar    0.28716    0.14890   1.928  0.05380 .  
## ReturnApr    1.05849    0.14527   7.286 3.18e-13 ***
## ReturnMay    0.75472    0.16438   4.591 4.40e-06 ***
## ReturnJune   0.49435    0.15937   3.102  0.00192 ** 
## ReturnJuly   0.75114    0.16110   4.662 3.12e-06 ***
## ReturnAug    0.09395    0.17503   0.537  0.59142    
## ReturnSep    0.72669    0.17083   4.254 2.10e-05 ***
## ReturnOct   -0.60645    0.14452  -4.196 2.71e-05 ***
## ReturnNov   -0.84449    0.15698  -5.380 7.46e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11168  on 8105  degrees of freedom
## Residual deviance: 10951  on 8094  degrees of freedom
## AIC: 10975
## 
## Number of Fisher Scoring iterations: 4

logistic model prediction

Accuracy of the model on the training set.

predictlog <- predict(StocksModel, newdata = stocksTrain, type = "response")

confusionmatrix <- table(predictlog > .5, stocksTrain$PositiveDec)
confusionmatrix
##        
##            0    1
##   FALSE  990  787
##   TRUE  2689 3640
cat("\nAccuracy", sum(diag(confusionmatrix))/nrow(stocksTrain))
## 
## Accuracy 0.5711818

Accuracy of the model on the test set.

predictlog <- predict(StocksModel, newdata = stocksTest, type = "response")

confusionmatrix <- table(predictlog > .5, stocksTest$PositiveDec)
confusionmatrix
##        
##            0    1
##   FALSE  417  344
##   TRUE  1160 1553
cat("\nAccuracy", sum(diag(confusionmatrix))/nrow(stocksTest))
## 
## Accuracy 0.5670697

Accuracy on the test set of the baseline model that always predicts the most common outcome (PositiveDec = 1)

mean(stocksTest$PositiveDec)
## [1] 0.5460564

clustering stocks

In cluster-then-predict, our final goal is to predict the dependent variable, which is unknown to us at the time of prediction. Therefore, if we need to know the outcome value to perform the clustering, the methodology is no longer useful for prediction of an unknown outcome value.

This is an important point that is sometimes mistakenly overlooked. If you use the outcome value to cluster, you might conclude your method strongly outperforms a non-clustering alternative. However, this is because it is using the outcome to determine the clusters, which is not valid.

Remove the dependent variable before clustering.

limitedTrain <- stocksTrain
limitedTrain$PositiveDec <- NULL
limitedTest <- stocksTest
limitedTest$PositiveDec <- NULL

Normalize data.

preproc <- preProcess(limitedTrain)

normTrain <- predict(preproc, limitedTrain)
normTest <- predict(preproc, limitedTest)
summary(normTrain)
##    ReturnJan          ReturnFeb          ReturnMar       
##  Min.   :-4.57682   Min.   :-3.43004   Min.   :-4.54609  
##  1st Qu.:-0.48271   1st Qu.:-0.35589   1st Qu.:-0.40758  
##  Median :-0.07055   Median :-0.01875   Median :-0.05778  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.35898   3rd Qu.: 0.25337   3rd Qu.: 0.36106  
##  Max.   :18.06234   Max.   :34.92751   Max.   :24.77296  
##    ReturnApr         ReturnMay          ReturnJune      
##  Min.   :-5.0227   Min.   :-4.96759   Min.   :-4.82957  
##  1st Qu.:-0.4757   1st Qu.:-0.43045   1st Qu.:-0.45602  
##  Median :-0.1104   Median :-0.06983   Median :-0.04354  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.3400   3rd Qu.: 0.35906   3rd Qu.: 0.37273  
##  Max.   :14.6959   Max.   :42.69158   Max.   :10.84515  
##    ReturnJuly         ReturnAug          ReturnSep       
##  Min.   :-5.19139   Min.   :-5.60378   Min.   :-5.47078  
##  1st Qu.:-0.51832   1st Qu.:-0.47163   1st Qu.:-0.39604  
##  Median :-0.02372   Median :-0.07393   Median : 0.04767  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.47735   3rd Qu.: 0.39967   3rd Qu.: 0.42287  
##  Max.   :17.33975   Max.   :27.14273   Max.   :39.05435  
##    ReturnOct          ReturnNov       
##  Min.   :-3.53719   Min.   :-4.31684  
##  1st Qu.:-0.42176   1st Qu.:-0.43564  
##  Median :-0.01891   Median :-0.01878  
##  Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.37451   3rd Qu.: 0.42560  
##  Max.   :31.25996   Max.   :17.18255
summary(normTest)
##    ReturnJan           ReturnFeb           ReturnMar       
##  Min.   :-3.743836   Min.   :-3.251044   Min.   :-4.07731  
##  1st Qu.:-0.485690   1st Qu.:-0.348951   1st Qu.:-0.40662  
##  Median :-0.066856   Median :-0.006860   Median :-0.05674  
##  Mean   :-0.000419   Mean   :-0.003862   Mean   : 0.00583  
##  3rd Qu.: 0.357729   3rd Qu.: 0.264647   3rd Qu.: 0.35653  
##  Max.   : 8.412973   Max.   : 9.552365   Max.   : 9.00982  
##    ReturnApr          ReturnMay          ReturnJune      
##  Min.   :-4.47865   Min.   :-5.84445   Min.   :-4.73628  
##  1st Qu.:-0.51121   1st Qu.:-0.43819   1st Qu.:-0.44968  
##  Median :-0.11414   Median :-0.05346   Median :-0.02678  
##  Mean   :-0.03638   Mean   : 0.02651   Mean   : 0.04315  
##  3rd Qu.: 0.32742   3rd Qu.: 0.42290   3rd Qu.: 0.43010  
##  Max.   : 6.84589   Max.   : 7.21362   Max.   :29.00534  
##    ReturnJuly          ReturnAug          ReturnSep       
##  Min.   :-5.201454   Min.   :-4.62097   Min.   :-3.57222  
##  1st Qu.:-0.512039   1st Qu.:-0.51546   1st Qu.:-0.38067  
##  Median :-0.026576   Median :-0.10277   Median : 0.08215  
##  Mean   : 0.006016   Mean   :-0.04973   Mean   : 0.02939  
##  3rd Qu.: 0.457193   3rd Qu.: 0.38781   3rd Qu.: 0.45847  
##  Max.   :12.790901   Max.   : 6.66889   Max.   : 7.09106  
##    ReturnOct           ReturnNov        
##  Min.   :-3.807577   Min.   :-4.881463  
##  1st Qu.:-0.393856   1st Qu.:-0.396764  
##  Median : 0.006783   Median :-0.002337  
##  Mean   : 0.029672   Mean   : 0.017128  
##  3rd Qu.: 0.419005   3rd Qu.: 0.424617  
##  Max.   : 7.428466   Max.   :21.007786

We see that the average (mean) return in January is slightly higher in the training set than in the testing set. Since normTest was constructed by subtracting by the mean ReturnJan value from the training set, this explains why the mean value of ReturnJan is slightly negative in normTest.

The distribution of the ReturnJan variable is different in the training and testing set.

k-means clustering, three (3) clusters

set.seed(144)

k <- 3
km <- kmeans(normTrain, centers = k)
str(km)
## List of 9
##  $ cluster     : Named int [1:8106] 1 1 1 3 1 3 2 2 1 3 ...
##   ..- attr(*, "names")= chr [1:8106] "1" "2" "4" "6" ...
##  $ centers     : num [1:3, 1:11] -0.4523 0.2641 0.7425 -0.1497 -0.0415 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "1" "2" "3"
##   .. ..$ : chr [1:11] "ReturnJan" "ReturnFeb" "ReturnMar" "ReturnApr" ...
##  $ totss       : num 89155
##  $ withinss    : num [1:3] 31204 38032 9937
##  $ tot.withinss: num 79173
##  $ betweenss   : num 9982
##  $ size        : int [1:3] 3157 4696 253
##  $ iter        : int 5
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

Cluster sizes

table(km$cluster)
## 
##    1    2    3 
## 3157 4696  253

Clustering stocks

flexclust package contains the object class KCCA, which stands for K-Centroids Cluster Analysis.

Need to convert the information from the clustering algorithm to an object of the class KCCA. This conversion is needed before we can use the predict function on the test set tumorVector.

km.kcca = as.kcca(km, normTrain)
clusterTrain = predict(km.kcca)
clusterTest = predict(km.kcca, newdata=normTest)

Test-set observations cluster assignment

table(clusterTest)
## clusterTest
##    1    2    3 
## 1298 2080   96

Using the subset function, building data frames stocksTrain1, stocksTrain2, and stocksTrain3, containing the elements in the stocksTrain data frame assigned to clusters 1, 2, and 3, respectively

Similarly build stocksTest1, stocksTest2, and stocksTest3 from the stocksTest data frame.

stocksTrain1 <- subset(stocksTrain, clusterTrain == 1)
stocksTrain2 <- subset(stocksTrain, clusterTrain == 2)
stocksTrain3 <- subset(stocksTrain, clusterTrain == 3)

stocksTest1 <- subset(stocksTest, clusterTest == 1)
stocksTest2 <- subset(stocksTest, clusterTest == 2)
stocksTest3 <- subset(stocksTest, clusterTest == 3)
stocksTrain %>%
  mutate(cluster = clusterTrain) %>%
  group_by(cluster) %>%
  summarise(mean(PositiveDec))
## # A tibble: 3 x 2
##   cluster `mean(PositiveDec)`
##     <int>               <dbl>
## 1       1               0.602
## 2       2               0.514
## 3       3               0.439

stocksTrain1 has the observations with the highest average value of the dependent variable.

cluster-specific logistic regression models

Build logistic regression models StocksModel1, StocksModel2, and StocksModel3, which predict PositiveDec using all the other variables as independent variables.

StocksModel1 <- glm(PositiveDec ~ .,
                    data = stocksTrain1, family = "binomial")

summary(StocksModel1)
## 
## Call:
## glm(formula = PositiveDec ~ ., family = "binomial", data = stocksTrain1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7307  -1.2910   0.8878   1.0280   1.5023  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.17224    0.06302   2.733  0.00628 ** 
## ReturnJan    0.02498    0.29306   0.085  0.93206    
## ReturnFeb   -0.37207    0.29123  -1.278  0.20139    
## ReturnMar    0.59555    0.23325   2.553  0.01067 *  
## ReturnApr    1.19048    0.22439   5.305 1.12e-07 ***
## ReturnMay    0.30421    0.22845   1.332  0.18298    
## ReturnJune  -0.01165    0.29993  -0.039  0.96901    
## ReturnJuly   0.19769    0.27790   0.711  0.47685    
## ReturnAug    0.51273    0.30858   1.662  0.09660 .  
## ReturnSep    0.58833    0.28133   2.091  0.03651 *  
## ReturnOct   -1.02254    0.26007  -3.932 8.43e-05 ***
## ReturnNov   -0.74847    0.28280  -2.647  0.00813 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4243.0  on 3156  degrees of freedom
## Residual deviance: 4172.9  on 3145  degrees of freedom
## AIC: 4196.9
## 
## Number of Fisher Scoring iterations: 4
StocksModel2 <- glm(PositiveDec ~ .,
                    data = stocksTrain2, family = "binomial")

summary(StocksModel2)
## 
## Call:
## glm(formula = PositiveDec ~ ., family = "binomial", data = stocksTrain2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2012  -1.1941   0.8583   1.1334   1.9424  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.10293    0.03785   2.719 0.006540 ** 
## ReturnJan    0.88451    0.20276   4.362 1.29e-05 ***
## ReturnFeb    0.31762    0.26624   1.193 0.232878    
## ReturnMar   -0.37978    0.24045  -1.579 0.114231    
## ReturnApr    0.49291    0.22460   2.195 0.028189 *  
## ReturnMay    0.89655    0.25492   3.517 0.000436 ***
## ReturnJune   1.50088    0.26014   5.770 7.95e-09 ***
## ReturnJuly   0.78315    0.26864   2.915 0.003554 ** 
## ReturnAug   -0.24486    0.27080  -0.904 0.365876    
## ReturnSep    0.73685    0.24820   2.969 0.002989 ** 
## ReturnOct   -0.27756    0.18400  -1.509 0.131419    
## ReturnNov   -0.78747    0.22458  -3.506 0.000454 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6506.3  on 4695  degrees of freedom
## Residual deviance: 6362.2  on 4684  degrees of freedom
## AIC: 6386.2
## 
## Number of Fisher Scoring iterations: 4
StocksModel3 <- glm(PositiveDec ~ .,
                    data = stocksTrain3, family = "binomial")

summary(StocksModel3)
## 
## Call:
## glm(formula = PositiveDec ~ ., family = "binomial", data = stocksTrain3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9146  -1.0393  -0.7689   1.1921   1.6939  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.181896   0.325182  -0.559   0.5759  
## ReturnJan   -0.009789   0.448943  -0.022   0.9826  
## ReturnFeb   -0.046883   0.213432  -0.220   0.8261  
## ReturnMar    0.674179   0.564790   1.194   0.2326  
## ReturnApr    1.281466   0.602672   2.126   0.0335 *
## ReturnMay    0.762512   0.647783   1.177   0.2392  
## ReturnJune   0.329434   0.408038   0.807   0.4195  
## ReturnJuly   0.774164   0.729360   1.061   0.2885  
## ReturnAug    0.982605   0.533158   1.843   0.0653 .
## ReturnSep    0.363807   0.627774   0.580   0.5622  
## ReturnOct    0.782242   0.733123   1.067   0.2860  
## ReturnNov   -0.873752   0.738480  -1.183   0.2367  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 346.92  on 252  degrees of freedom
## Residual deviance: 328.29  on 241  degrees of freedom
## AIC: 352.29
## 
## Number of Fisher Scoring iterations: 4

ReturnJan, ReturnFeb, ReturnMar, ReturnJune, ReturnAug, and ReturnOct differ in sign between the models.

cluster-specific predictions

predictTest1 <- predict(StocksModel1, newdata = stocksTest1, type = "response") > .5

confusionmatrix1 <- table(predictTest1, stocksTest1$PositiveDec)
confusionmatrix1
##             
## predictTest1   0   1
##        FALSE  30  23
##        TRUE  471 774
cat("\nAccuracy", sum(diag(confusionmatrix1))/nrow(stocksTest1))
## 
## Accuracy 0.6194145
predictTest2 <- predict(StocksModel2, newdata = stocksTest2, type = "response") > .5

confusionmatrix2 <- table(predictTest2, stocksTest2$PositiveDec)
confusionmatrix2
##             
## predictTest2   0   1
##        FALSE 388 309
##        TRUE  626 757
cat("\nAccuracy", sum(diag(confusionmatrix2))/nrow(stocksTest2))
## 
## Accuracy 0.5504808
predictTest3 <- predict(StocksModel3, newdata = stocksTest3, type = "response") > .5

confusionmatrix3 <- table(predictTest3, stocksTest3$PositiveDec)
confusionmatrix3
##             
## predictTest3  0  1
##        FALSE 49 21
##        TRUE  13 13
cat("\nAccuracy", sum(diag(confusionmatrix3))/nrow(stocksTest3))
## 
## Accuracy 0.6458333

overall cluster-specific prediction accuracy

AllPredictions <- c(predictTest1, predictTest2, predictTest3)
AllOutcomes <- c(stocksTest1$PositiveDec, stocksTest2$PositiveDec, stocksTest3$PositiveDec)

confusionmatrix_all <- table(AllPredictions, AllOutcomes)
confusionmatrix_all
##               AllOutcomes
## AllPredictions    0    1
##          FALSE  467  353
##          TRUE  1110 1544
cat("\nAccuracy", sum(diag(confusionmatrix_all))/nrow(stocksTest))
## 
## Accuracy 0.5788716

We see a modest improvement over the original logistic regression model. Since predicting stock returns is a notoriously hard problem, this is a good increase in accuracy. By investing in stocks for which we are more confident that they will have positive returns (by selecting the ones with higher predicted probabilities), this cluster-then-predict model can give us an edge over the original logistic regression model.