Homework assignment for MITx “The Analytics Edge”
Cluster-then-predict is a methodology in which you first cluster observations and then build cluster-specific prediction models. For example, this methodology can help improve the prediction of heart attack risk. We’ll use cluster-then-predict to predict future stock prices using historical stock data.
When selecting which stocks to invest in, investors seek to obtain good future returns. We will first use clustering to identify clusters of stocks that have similar returns over time. Then, we’ll use logistic regression to predict whether or not the stocks will have positive future returns.
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(knitr)
library(kableExtra)
library(formattable)
library(caTools) # training/test set creation tools
library(caret) # has preProcess command to normalize data
library(flexclust) # obtain training set and testing sut cluster assignments for observations
For this problem, we’ll use StocksCluster.csv, which contains monthly stock returns from the NASDAQ stock exchange. The NASDAQ is the second-largest stock exchange in the world, and it lists many technology companies. The stock price data used in this problem was obtained from infochimps, a website providing access to many datasets.
Each observation in the dataset is the monthly returns of a particular company in a particular year. The years included are 2000-2009. The companies are limited to tickers that were listed on the exchange for the entire period 2000-2009, and whose stock price never fell below $1. So, for example, one observation is for Yahoo in 2000, and another observation is for Yahoo in 2001. Our goal will be to predict whether or not the stock return in December will be positive, using the stock returns for the first 11 months of the year.
stocks <- read.csv("StocksCluster.csv")
str(stocks)
## 'data.frame': 11580 obs. of 12 variables:
## $ ReturnJan : num 0.0807 -0.0107 0.0477 -0.074 -0.031 ...
## $ ReturnFeb : num 0.0663 0.1021 0.036 -0.0482 -0.2127 ...
## $ ReturnMar : num 0.0329 0.1455 0.0397 0.0182 0.0915 ...
## $ ReturnApr : num 0.1831 -0.0844 -0.1624 -0.0247 0.1893 ...
## $ ReturnMay : num 0.13033 -0.3273 -0.14743 -0.00604 -0.15385 ...
## $ ReturnJune : num -0.0176 -0.3593 0.0486 -0.0253 -0.1061 ...
## $ ReturnJuly : num -0.0205 -0.0253 -0.1354 -0.094 0.3553 ...
## $ ReturnAug : num 0.0247 0.2113 0.0334 0.0953 0.0568 ...
## $ ReturnSep : num -0.0204 -0.58 0 0.0567 0.0336 ...
## $ ReturnOct : num -0.1733 -0.2671 0.0917 -0.0963 0.0363 ...
## $ ReturnNov : num -0.0254 -0.1512 -0.0596 -0.0405 -0.0853 ...
## $ PositiveDec: int 0 0 0 1 1 1 1 0 0 0 ...
This dataset contains the following variables:
ReturnJan = the return for the company's stock during January (in the year of the observation).
ReturnFeb = the return for the company's stock during February (in the year of the observation).
ReturnMar = the return for the company's stock during March (in the year of the observation).
ReturnApr = the return for the company's stock during April (in the year of the observation).
ReturnMay = the return for the company's stock during May (in the year of the observation).
ReturnJune = the return for the company's stock during June (in the year of the observation).
ReturnJuly = the return for the company's stock during July (in the year of the observation).
ReturnAug = the return for the company's stock during August (in the year of the observation).
ReturnSep = the return for the company's stock during September (in the year of the observation).
ReturnOct = the return for the company's stock during October (in the year of the observation).
ReturnNov = the return for the company's stock during November (in the year of the observation).
PositiveDec = whether or not the company's stock had a positive return in December (in the year of the observation). This variable takes value 1 if the return was positive, and value 0 if the return was not positive.
For the first 11 variables, the value stored is a proportional change in stock value during that month. For instance, a value of 0.05 means the stock increased in value 5% during the month, while a value of -0.02 means the stock decreased in value 2% during the month.
summary(stocks)
## ReturnJan ReturnFeb ReturnMar
## Min. :-0.7616205 Min. :-0.690000 Min. :-0.712994
## 1st Qu.:-0.0691663 1st Qu.:-0.077748 1st Qu.:-0.046389
## Median : 0.0009965 Median :-0.010626 Median : 0.009878
## Mean : 0.0126316 Mean :-0.007605 Mean : 0.019402
## 3rd Qu.: 0.0732606 3rd Qu.: 0.043600 3rd Qu.: 0.077066
## Max. : 3.0683060 Max. : 6.943694 Max. : 4.008621
## ReturnApr ReturnMay ReturnJune
## Min. :-0.826503 Min. :-0.92207 Min. :-0.717920
## 1st Qu.:-0.054468 1st Qu.:-0.04640 1st Qu.:-0.063966
## Median : 0.009059 Median : 0.01293 Median :-0.000880
## Mean : 0.026308 Mean : 0.02474 Mean : 0.005938
## 3rd Qu.: 0.085338 3rd Qu.: 0.08396 3rd Qu.: 0.061566
## Max. : 2.528827 Max. : 6.93013 Max. : 4.339713
## ReturnJuly ReturnAug ReturnSep
## Min. :-0.7613096 Min. :-0.726800 Min. :-0.839730
## 1st Qu.:-0.0731917 1st Qu.:-0.046272 1st Qu.:-0.074648
## Median :-0.0008047 Median : 0.007205 Median :-0.007616
## Mean : 0.0030509 Mean : 0.016198 Mean :-0.014721
## 3rd Qu.: 0.0718205 3rd Qu.: 0.070783 3rd Qu.: 0.049476
## Max. : 2.5500000 Max. : 3.626609 Max. : 5.863980
## ReturnOct ReturnNov PositiveDec
## Min. :-0.685504 Min. :-0.747171 Min. :0.0000
## 1st Qu.:-0.070915 1st Qu.:-0.054890 1st Qu.:0.0000
## Median : 0.002115 Median : 0.008522 Median :1.0000
## Mean : 0.005651 Mean : 0.011387 Mean :0.5461
## 3rd Qu.: 0.074542 3rd Qu.: 0.076576 3rd Qu.:1.0000
## Max. : 5.665138 Max. : 3.271676 Max. :1.0000
54.61% of observations had positive returns in December.
April has the largest mean value (0.026308), and September has the smallest mean value (-0.014721).
data.frame(cor(stocks)) %>%
round(4) %>%
formattable(list(
area(col = 1:12, row = 1:12) ~
color_tile("white", "red")
))
| ReturnJan | ReturnFeb | ReturnMar | ReturnApr | ReturnMay | ReturnJune | ReturnJuly | ReturnAug | ReturnSep | ReturnOct | ReturnNov | PositiveDec | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ReturnJan | 1.0000 | 0.0668 | -0.0905 | -0.0377 | -0.0444 | 0.0922 | -0.0814 | -0.0228 | -0.0264 | 0.1430 | 0.0676 | 0.0047 |
| ReturnFeb | 0.0668 | 1.0000 | -0.1560 | -0.1914 | -0.0955 | 0.1700 | -0.0618 | 0.1316 | 0.0435 | -0.0873 | -0.1547 | -0.0382 |
| ReturnMar | -0.0905 | -0.1560 | 1.0000 | 0.0097 | -0.0039 | -0.0859 | 0.0034 | -0.0220 | 0.0765 | -0.0119 | 0.0373 | 0.0224 |
| ReturnApr | -0.0377 | -0.1914 | 0.0097 | 1.0000 | 0.0638 | -0.0110 | 0.0806 | -0.0518 | -0.0289 | 0.0485 | 0.0318 | 0.0944 |
| ReturnMay | -0.0444 | -0.0955 | -0.0039 | 0.0638 | 1.0000 | -0.0211 | 0.0909 | -0.0331 | 0.0220 | 0.0172 | 0.0480 | 0.0582 |
| ReturnJune | 0.0922 | 0.1700 | -0.0859 | -0.0110 | -0.0211 | 1.0000 | -0.0292 | 0.0107 | 0.0447 | -0.0226 | -0.0653 | 0.0234 |
| ReturnJuly | -0.0814 | -0.0618 | 0.0034 | 0.0806 | 0.0909 | -0.0292 | 1.0000 | 0.0007 | 0.0689 | -0.0547 | -0.0484 | 0.0744 |
| ReturnAug | -0.0228 | 0.1316 | -0.0220 | -0.0518 | -0.0331 | 0.0107 | 0.0007 | 1.0000 | 0.0007 | -0.0756 | -0.1165 | 0.0042 |
| ReturnSep | -0.0264 | 0.0435 | 0.0765 | -0.0289 | 0.0220 | 0.0447 | 0.0689 | 0.0007 | 1.0000 | -0.0581 | -0.0197 | 0.0416 |
| ReturnOct | 0.1430 | -0.0873 | -0.0119 | 0.0485 | 0.0172 | -0.0226 | -0.0547 | -0.0756 | -0.0581 | 1.0000 | 0.1917 | -0.0526 |
| ReturnNov | 0.0676 | -0.1547 | 0.0373 | 0.0318 | 0.0480 | -0.0653 | -0.0484 | -0.1165 | -0.0197 | 0.1917 | 1.0000 | -0.0623 |
| PositiveDec | 0.0047 | -0.0382 | 0.0224 | 0.0944 | 0.0582 | 0.0234 | 0.0744 | 0.0042 | 0.0416 | -0.0526 | -0.0623 | 1.0000 |
The largest correlation coefficient between two return variables is 0.19167279, between ReturnOct and ReturnNov.
set.seed(144)
spl <- sample.split(stocks$PositiveDec, SplitRatio = 0.7) # 70% in training set
stocksTrain <- subset(stocks, spl == TRUE)
stocksTest <- subset(stocks, spl == FALSE)
StocksModel <- glm(PositiveDec ~ .,
data = stocksTrain, family = "binomial")
summary(StocksModel)
##
## Call:
## glm(formula = PositiveDec ~ ., family = "binomial", data = stocksTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4333 -1.2265 0.9102 1.1006 2.2611
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.14878 0.02384 6.240 4.37e-10 ***
## ReturnJan 0.31742 0.13906 2.283 0.02246 *
## ReturnFeb -0.29349 0.13113 -2.238 0.02521 *
## ReturnMar 0.28716 0.14890 1.928 0.05380 .
## ReturnApr 1.05849 0.14527 7.286 3.18e-13 ***
## ReturnMay 0.75472 0.16438 4.591 4.40e-06 ***
## ReturnJune 0.49435 0.15937 3.102 0.00192 **
## ReturnJuly 0.75114 0.16110 4.662 3.12e-06 ***
## ReturnAug 0.09395 0.17503 0.537 0.59142
## ReturnSep 0.72669 0.17083 4.254 2.10e-05 ***
## ReturnOct -0.60645 0.14452 -4.196 2.71e-05 ***
## ReturnNov -0.84449 0.15698 -5.380 7.46e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11168 on 8105 degrees of freedom
## Residual deviance: 10951 on 8094 degrees of freedom
## AIC: 10975
##
## Number of Fisher Scoring iterations: 4
Accuracy of the model on the training set.
predictlog <- predict(StocksModel, newdata = stocksTrain, type = "response")
confusionmatrix <- table(predictlog > .5, stocksTrain$PositiveDec)
confusionmatrix
##
## 0 1
## FALSE 990 787
## TRUE 2689 3640
cat("\nAccuracy", sum(diag(confusionmatrix))/nrow(stocksTrain))
##
## Accuracy 0.5711818
Accuracy of the model on the test set.
predictlog <- predict(StocksModel, newdata = stocksTest, type = "response")
confusionmatrix <- table(predictlog > .5, stocksTest$PositiveDec)
confusionmatrix
##
## 0 1
## FALSE 417 344
## TRUE 1160 1553
cat("\nAccuracy", sum(diag(confusionmatrix))/nrow(stocksTest))
##
## Accuracy 0.5670697
Accuracy on the test set of the baseline model that always predicts the most common outcome (PositiveDec = 1)
mean(stocksTest$PositiveDec)
## [1] 0.5460564
In cluster-then-predict, our final goal is to predict the dependent variable, which is unknown to us at the time of prediction. Therefore, if we need to know the outcome value to perform the clustering, the methodology is no longer useful for prediction of an unknown outcome value.
This is an important point that is sometimes mistakenly overlooked. If you use the outcome value to cluster, you might conclude your method strongly outperforms a non-clustering alternative. However, this is because it is using the outcome to determine the clusters, which is not valid.
Remove the dependent variable before clustering.
limitedTrain <- stocksTrain
limitedTrain$PositiveDec <- NULL
limitedTest <- stocksTest
limitedTest$PositiveDec <- NULL
Normalize data.
preproc <- preProcess(limitedTrain)
normTrain <- predict(preproc, limitedTrain)
normTest <- predict(preproc, limitedTest)
summary(normTrain)
## ReturnJan ReturnFeb ReturnMar
## Min. :-4.57682 Min. :-3.43004 Min. :-4.54609
## 1st Qu.:-0.48271 1st Qu.:-0.35589 1st Qu.:-0.40758
## Median :-0.07055 Median :-0.01875 Median :-0.05778
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.35898 3rd Qu.: 0.25337 3rd Qu.: 0.36106
## Max. :18.06234 Max. :34.92751 Max. :24.77296
## ReturnApr ReturnMay ReturnJune
## Min. :-5.0227 Min. :-4.96759 Min. :-4.82957
## 1st Qu.:-0.4757 1st Qu.:-0.43045 1st Qu.:-0.45602
## Median :-0.1104 Median :-0.06983 Median :-0.04354
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.3400 3rd Qu.: 0.35906 3rd Qu.: 0.37273
## Max. :14.6959 Max. :42.69158 Max. :10.84515
## ReturnJuly ReturnAug ReturnSep
## Min. :-5.19139 Min. :-5.60378 Min. :-5.47078
## 1st Qu.:-0.51832 1st Qu.:-0.47163 1st Qu.:-0.39604
## Median :-0.02372 Median :-0.07393 Median : 0.04767
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.47735 3rd Qu.: 0.39967 3rd Qu.: 0.42287
## Max. :17.33975 Max. :27.14273 Max. :39.05435
## ReturnOct ReturnNov
## Min. :-3.53719 Min. :-4.31684
## 1st Qu.:-0.42176 1st Qu.:-0.43564
## Median :-0.01891 Median :-0.01878
## Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.37451 3rd Qu.: 0.42560
## Max. :31.25996 Max. :17.18255
summary(normTest)
## ReturnJan ReturnFeb ReturnMar
## Min. :-3.743836 Min. :-3.251044 Min. :-4.07731
## 1st Qu.:-0.485690 1st Qu.:-0.348951 1st Qu.:-0.40662
## Median :-0.066856 Median :-0.006860 Median :-0.05674
## Mean :-0.000419 Mean :-0.003862 Mean : 0.00583
## 3rd Qu.: 0.357729 3rd Qu.: 0.264647 3rd Qu.: 0.35653
## Max. : 8.412973 Max. : 9.552365 Max. : 9.00982
## ReturnApr ReturnMay ReturnJune
## Min. :-4.47865 Min. :-5.84445 Min. :-4.73628
## 1st Qu.:-0.51121 1st Qu.:-0.43819 1st Qu.:-0.44968
## Median :-0.11414 Median :-0.05346 Median :-0.02678
## Mean :-0.03638 Mean : 0.02651 Mean : 0.04315
## 3rd Qu.: 0.32742 3rd Qu.: 0.42290 3rd Qu.: 0.43010
## Max. : 6.84589 Max. : 7.21362 Max. :29.00534
## ReturnJuly ReturnAug ReturnSep
## Min. :-5.201454 Min. :-4.62097 Min. :-3.57222
## 1st Qu.:-0.512039 1st Qu.:-0.51546 1st Qu.:-0.38067
## Median :-0.026576 Median :-0.10277 Median : 0.08215
## Mean : 0.006016 Mean :-0.04973 Mean : 0.02939
## 3rd Qu.: 0.457193 3rd Qu.: 0.38781 3rd Qu.: 0.45847
## Max. :12.790901 Max. : 6.66889 Max. : 7.09106
## ReturnOct ReturnNov
## Min. :-3.807577 Min. :-4.881463
## 1st Qu.:-0.393856 1st Qu.:-0.396764
## Median : 0.006783 Median :-0.002337
## Mean : 0.029672 Mean : 0.017128
## 3rd Qu.: 0.419005 3rd Qu.: 0.424617
## Max. : 7.428466 Max. :21.007786
We see that the average (mean) return in January is slightly higher in the training set than in the testing set. Since normTest was constructed by subtracting by the mean ReturnJan value from the training set, this explains why the mean value of ReturnJan is slightly negative in normTest.
The distribution of the ReturnJan variable is different in the training and testing set.
set.seed(144)
k <- 3
km <- kmeans(normTrain, centers = k)
str(km)
## List of 9
## $ cluster : Named int [1:8106] 1 1 1 3 1 3 2 2 1 3 ...
## ..- attr(*, "names")= chr [1:8106] "1" "2" "4" "6" ...
## $ centers : num [1:3, 1:11] -0.4523 0.2641 0.7425 -0.1497 -0.0415 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "1" "2" "3"
## .. ..$ : chr [1:11] "ReturnJan" "ReturnFeb" "ReturnMar" "ReturnApr" ...
## $ totss : num 89155
## $ withinss : num [1:3] 31204 38032 9937
## $ tot.withinss: num 79173
## $ betweenss : num 9982
## $ size : int [1:3] 3157 4696 253
## $ iter : int 5
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
Cluster sizes
table(km$cluster)
##
## 1 2 3
## 3157 4696 253
flexclust package contains the object class KCCA, which stands for K-Centroids Cluster Analysis.
Need to convert the information from the clustering algorithm to an object of the class KCCA. This conversion is needed before we can use the predict function on the test set tumorVector.
km.kcca = as.kcca(km, normTrain)
clusterTrain = predict(km.kcca)
clusterTest = predict(km.kcca, newdata=normTest)
Test-set observations cluster assignment
table(clusterTest)
## clusterTest
## 1 2 3
## 1298 2080 96
Using the subset function, building data frames stocksTrain1, stocksTrain2, and stocksTrain3, containing the elements in the stocksTrain data frame assigned to clusters 1, 2, and 3, respectively
Similarly build stocksTest1, stocksTest2, and stocksTest3 from the stocksTest data frame.
stocksTrain1 <- subset(stocksTrain, clusterTrain == 1)
stocksTrain2 <- subset(stocksTrain, clusterTrain == 2)
stocksTrain3 <- subset(stocksTrain, clusterTrain == 3)
stocksTest1 <- subset(stocksTest, clusterTest == 1)
stocksTest2 <- subset(stocksTest, clusterTest == 2)
stocksTest3 <- subset(stocksTest, clusterTest == 3)
stocksTrain %>%
mutate(cluster = clusterTrain) %>%
group_by(cluster) %>%
summarise(mean(PositiveDec))
## # A tibble: 3 x 2
## cluster `mean(PositiveDec)`
## <int> <dbl>
## 1 1 0.602
## 2 2 0.514
## 3 3 0.439
stocksTrain1 has the observations with the highest average value of the dependent variable.
Build logistic regression models StocksModel1, StocksModel2, and StocksModel3, which predict PositiveDec using all the other variables as independent variables.
StocksModel1 <- glm(PositiveDec ~ .,
data = stocksTrain1, family = "binomial")
summary(StocksModel1)
##
## Call:
## glm(formula = PositiveDec ~ ., family = "binomial", data = stocksTrain1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7307 -1.2910 0.8878 1.0280 1.5023
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.17224 0.06302 2.733 0.00628 **
## ReturnJan 0.02498 0.29306 0.085 0.93206
## ReturnFeb -0.37207 0.29123 -1.278 0.20139
## ReturnMar 0.59555 0.23325 2.553 0.01067 *
## ReturnApr 1.19048 0.22439 5.305 1.12e-07 ***
## ReturnMay 0.30421 0.22845 1.332 0.18298
## ReturnJune -0.01165 0.29993 -0.039 0.96901
## ReturnJuly 0.19769 0.27790 0.711 0.47685
## ReturnAug 0.51273 0.30858 1.662 0.09660 .
## ReturnSep 0.58833 0.28133 2.091 0.03651 *
## ReturnOct -1.02254 0.26007 -3.932 8.43e-05 ***
## ReturnNov -0.74847 0.28280 -2.647 0.00813 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4243.0 on 3156 degrees of freedom
## Residual deviance: 4172.9 on 3145 degrees of freedom
## AIC: 4196.9
##
## Number of Fisher Scoring iterations: 4
StocksModel2 <- glm(PositiveDec ~ .,
data = stocksTrain2, family = "binomial")
summary(StocksModel2)
##
## Call:
## glm(formula = PositiveDec ~ ., family = "binomial", data = stocksTrain2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2012 -1.1941 0.8583 1.1334 1.9424
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.10293 0.03785 2.719 0.006540 **
## ReturnJan 0.88451 0.20276 4.362 1.29e-05 ***
## ReturnFeb 0.31762 0.26624 1.193 0.232878
## ReturnMar -0.37978 0.24045 -1.579 0.114231
## ReturnApr 0.49291 0.22460 2.195 0.028189 *
## ReturnMay 0.89655 0.25492 3.517 0.000436 ***
## ReturnJune 1.50088 0.26014 5.770 7.95e-09 ***
## ReturnJuly 0.78315 0.26864 2.915 0.003554 **
## ReturnAug -0.24486 0.27080 -0.904 0.365876
## ReturnSep 0.73685 0.24820 2.969 0.002989 **
## ReturnOct -0.27756 0.18400 -1.509 0.131419
## ReturnNov -0.78747 0.22458 -3.506 0.000454 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6506.3 on 4695 degrees of freedom
## Residual deviance: 6362.2 on 4684 degrees of freedom
## AIC: 6386.2
##
## Number of Fisher Scoring iterations: 4
StocksModel3 <- glm(PositiveDec ~ .,
data = stocksTrain3, family = "binomial")
summary(StocksModel3)
##
## Call:
## glm(formula = PositiveDec ~ ., family = "binomial", data = stocksTrain3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9146 -1.0393 -0.7689 1.1921 1.6939
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.181896 0.325182 -0.559 0.5759
## ReturnJan -0.009789 0.448943 -0.022 0.9826
## ReturnFeb -0.046883 0.213432 -0.220 0.8261
## ReturnMar 0.674179 0.564790 1.194 0.2326
## ReturnApr 1.281466 0.602672 2.126 0.0335 *
## ReturnMay 0.762512 0.647783 1.177 0.2392
## ReturnJune 0.329434 0.408038 0.807 0.4195
## ReturnJuly 0.774164 0.729360 1.061 0.2885
## ReturnAug 0.982605 0.533158 1.843 0.0653 .
## ReturnSep 0.363807 0.627774 0.580 0.5622
## ReturnOct 0.782242 0.733123 1.067 0.2860
## ReturnNov -0.873752 0.738480 -1.183 0.2367
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 346.92 on 252 degrees of freedom
## Residual deviance: 328.29 on 241 degrees of freedom
## AIC: 352.29
##
## Number of Fisher Scoring iterations: 4
ReturnJan, ReturnFeb, ReturnMar, ReturnJune, ReturnAug, and ReturnOct differ in sign between the models.
predictTest1 <- predict(StocksModel1, newdata = stocksTest1, type = "response") > .5
confusionmatrix1 <- table(predictTest1, stocksTest1$PositiveDec)
confusionmatrix1
##
## predictTest1 0 1
## FALSE 30 23
## TRUE 471 774
cat("\nAccuracy", sum(diag(confusionmatrix1))/nrow(stocksTest1))
##
## Accuracy 0.6194145
predictTest2 <- predict(StocksModel2, newdata = stocksTest2, type = "response") > .5
confusionmatrix2 <- table(predictTest2, stocksTest2$PositiveDec)
confusionmatrix2
##
## predictTest2 0 1
## FALSE 388 309
## TRUE 626 757
cat("\nAccuracy", sum(diag(confusionmatrix2))/nrow(stocksTest2))
##
## Accuracy 0.5504808
predictTest3 <- predict(StocksModel3, newdata = stocksTest3, type = "response") > .5
confusionmatrix3 <- table(predictTest3, stocksTest3$PositiveDec)
confusionmatrix3
##
## predictTest3 0 1
## FALSE 49 21
## TRUE 13 13
cat("\nAccuracy", sum(diag(confusionmatrix3))/nrow(stocksTest3))
##
## Accuracy 0.6458333
AllPredictions <- c(predictTest1, predictTest2, predictTest3)
AllOutcomes <- c(stocksTest1$PositiveDec, stocksTest2$PositiveDec, stocksTest3$PositiveDec)
confusionmatrix_all <- table(AllPredictions, AllOutcomes)
confusionmatrix_all
## AllOutcomes
## AllPredictions 0 1
## FALSE 467 353
## TRUE 1110 1544
cat("\nAccuracy", sum(diag(confusionmatrix_all))/nrow(stocksTest))
##
## Accuracy 0.5788716
We see a modest improvement over the original logistic regression model. Since predicting stock returns is a notoriously hard problem, this is a good increase in accuracy. By investing in stocks for which we are more confident that they will have positive returns (by selecting the ones with higher predicted probabilities), this cluster-then-predict model can give us an edge over the original logistic regression model.