In the second lecture sequence this week, we heard about cluster-then-predict, a methodology in which you first cluster observations and then build cluster-specific prediction models. In the lecture sequence, we saw how this methodology helped improve the prediction of heart attack risk. In this assignment, we’ll use cluster-then-predict to predict future stock prices using historical stock data.
When selecting which stocks to invest in, investors seek to obtain good future returns. In this problem, we will first use clustering to identify clusters of stocks that have similar returns over time. Then, we’ll use logistic regression to predict whether or not the stocks will have positive future returns.
For this problem, we’ll use StocksCluster.csv, which contains monthly stock returns from the NASDAQ stock exchange. The NASDAQ is the second-largest stock exchange in the world, and it lists many technology companies. The stock price data used in this problem was obtained from infochimps, a website providing access to many datasets.
Each observation in the dataset is the monthly returns of a particular company in a particular year. The years included are 2000-2009. The companies are limited to tickers that were listed on the exchange for the entire period 2000-2009, and whose stock price never fell below $1. So, for example, one observation is for Yahoo in 2000, and another observation is for Yahoo in 2001. Our goal will be to predict whether or not the stock return in December will be positive, using the stock returns for the first 11 months of the year.
This dataset contains the following variables:
ReturnJan = the return for the company’s stock during January (in the year of the observation).
ReturnFeb = the return for the company’s stock during February (in the year of the observation).
ReturnMar = the return for the company’s stock during March (in the year of the observation).
ReturnApr = the return for the company’s stock during April (in the year of the observation).
ReturnMay = the return for the company’s stock during May (in the year of the observation).
ReturnJune = the return for the company’s stock during June (in the year of the observation).
ReturnJuly = the return for the company’s stock during July (in the year of the observation).
ReturnAug = the return for the company’s stock during August (in the year of the observation).
ReturnSep = the return for the company’s stock during September (in the year of the observation).
ReturnOct = the return for the company’s stock during October (in the year of the observation).
ReturnNov = the return for the company’s stock during November (in the year of the observation).
PositiveDec = whether or not the company’s stock had a positive return in December (in the year of the observation). This variable takes value 1 if the return was positive, and value 0 if the return was not positive.
For the first 11 variables, the value stored is a proportional change in stock value during that month. For instance, a value of 0.05 means the stock increased in value 5% during the month, while a value of -0.02 means the stock decreased in value 2% during the month.
Load StocksCluster.csv into a data frame called “stocks”.
# Read the dataset
stocks = read.csv("StocksCluster.csv")# Reports the number of rows on stocks
nrow(stocks)
## [1] 11580There are 11580 observations in this dataset.
# Tabulate the positive december returns
z = table(stocks$PositiveDec)
kable(z)| Var1 | Freq |
|---|---|
| 0 | 5256 |
| 1 | 6324 |
z[2]/(sum(z))
## 1
## 0.5461140.546 of the observations have positive returns in December.
# Obtain a correlation matrix of the data
z = cor(stocks)
kable(z)| ReturnJan | ReturnFeb | ReturnMar | ReturnApr | ReturnMay | ReturnJune | ReturnJuly | ReturnAug | ReturnSep | ReturnOct | ReturnNov | PositiveDec | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ReturnJan | 1.0000000 | 0.0667746 | -0.0904968 | -0.0376780 | -0.0444114 | 0.0922383 | -0.0814298 | -0.0227920 | -0.0264372 | 0.1429772 | 0.0676323 | 0.0047285 |
| ReturnFeb | 0.0667746 | 1.0000000 | -0.1559833 | -0.1913519 | -0.0955209 | 0.1699945 | -0.0617785 | 0.1315598 | 0.0435018 | -0.0873243 | -0.1546583 | -0.0381732 |
| ReturnMar | -0.0904968 | -0.1559833 | 1.0000000 | 0.0097263 | -0.0038928 | -0.0859055 | 0.0033742 | -0.0220054 | 0.0765183 | -0.0119238 | 0.0373235 | 0.0224087 |
| ReturnApr | -0.0376780 | -0.1913519 | 0.0097263 | 1.0000000 | 0.0638225 | -0.0110278 | 0.0806319 | -0.0517561 | -0.0289210 | 0.0485400 | 0.0317618 | 0.0943535 |
| ReturnMay | -0.0444114 | -0.0955209 | -0.0038928 | 0.0638225 | 1.0000000 | -0.0210745 | 0.0908503 | -0.0331257 | 0.0219629 | 0.0171667 | 0.0480466 | 0.0582019 |
| ReturnJune | 0.0922383 | 0.1699945 | -0.0859055 | -0.0110278 | -0.0210745 | 1.0000000 | -0.0291526 | 0.0107105 | 0.0447473 | -0.0226360 | -0.0652705 | 0.0234097 |
| ReturnJuly | -0.0814298 | -0.0617785 | 0.0033742 | 0.0806319 | 0.0908503 | -0.0291526 | 1.0000000 | 0.0007138 | 0.0689478 | -0.0547089 | -0.0483738 | 0.0743642 |
| ReturnAug | -0.0227920 | 0.1315598 | -0.0220054 | -0.0517561 | -0.0331257 | 0.0107105 | 0.0007138 | 1.0000000 | 0.0007407 | -0.0755946 | -0.1164890 | 0.0041670 |
| ReturnSep | -0.0264372 | 0.0435018 | 0.0765183 | -0.0289210 | 0.0219629 | 0.0447473 | 0.0689478 | 0.0007407 | 1.0000000 | -0.0580792 | -0.0197198 | 0.0416303 |
| ReturnOct | 0.1429772 | -0.0873243 | -0.0119238 | 0.0485400 | 0.0171667 | -0.0226360 | -0.0547089 | -0.0755946 | -0.0580792 | 1.0000000 | 0.1916728 | -0.0525750 |
| ReturnNov | 0.0676323 | -0.1546583 | 0.0373235 | 0.0317618 | 0.0480466 | -0.0652705 | -0.0483738 | -0.1164890 | -0.0197198 | 0.1916728 | 1.0000000 | -0.0623466 |
| PositiveDec | 0.0047285 | -0.0381732 | 0.0224087 | 0.0943535 | 0.0582019 | 0.0234097 | 0.0743642 | 0.0041670 | 0.0416303 | -0.0525750 | -0.0623466 | 1.0000000 |
# Obtain a summary of the data
z = summary(stocks)
kable(z)| ReturnJan | ReturnFeb | ReturnMar | ReturnApr | ReturnMay | ReturnJune | ReturnJuly | ReturnAug | ReturnSep | ReturnOct | ReturnNov | PositiveDec | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. :-0.7616205 | Min. :-0.690000 | Min. :-0.712994 | Min. :-0.826503 | Min. :-0.92207 | Min. :-0.717920 | Min. :-0.7613096 | Min. :-0.726800 | Min. :-0.839730 | Min. :-0.685504 | Min. :-0.747171 | Min. :0.0000 | |
| 1st Qu.:-0.0691663 | 1st Qu.:-0.077748 | 1st Qu.:-0.046389 | 1st Qu.:-0.054468 | 1st Qu.:-0.04640 | 1st Qu.:-0.063966 | 1st Qu.:-0.0731917 | 1st Qu.:-0.046272 | 1st Qu.:-0.074648 | 1st Qu.:-0.070915 | 1st Qu.:-0.054890 | 1st Qu.:0.0000 | |
| Median : 0.0009965 | Median :-0.010626 | Median : 0.009878 | Median : 0.009059 | Median : 0.01293 | Median :-0.000880 | Median :-0.0008047 | Median : 0.007205 | Median :-0.007616 | Median : 0.002115 | Median : 0.008522 | Median :1.0000 | |
| Mean : 0.0126316 | Mean :-0.007605 | Mean : 0.019402 | Mean : 0.026308 | Mean : 0.02474 | Mean : 0.005938 | Mean : 0.0030509 | Mean : 0.016198 | Mean :-0.014721 | Mean : 0.005651 | Mean : 0.011387 | Mean :0.5461 | |
| 3rd Qu.: 0.0732606 | 3rd Qu.: 0.043600 | 3rd Qu.: 0.077066 | 3rd Qu.: 0.085338 | 3rd Qu.: 0.08396 | 3rd Qu.: 0.061566 | 3rd Qu.: 0.0718205 | 3rd Qu.: 0.070783 | 3rd Qu.: 0.049476 | 3rd Qu.: 0.074542 | 3rd Qu.: 0.076576 | 3rd Qu.:1.0000 | |
| Max. : 3.0683060 | Max. : 6.943694 | Max. : 4.008621 | Max. : 2.528827 | Max. : 6.93013 | Max. : 4.339713 | Max. : 2.5500000 | Max. : 3.626609 | Max. : 5.863980 | Max. : 5.665138 | Max. : 3.271676 | Max. :1.0000 |
April has the largest mean value (0.026308).
September has the smallest mean value (-0.014721).
Run the following commands to split the data into a training set and testing set, putting 70% of the data in the training set and 30% of the data in the testing set:
library(caTools)
set.seed(144)
# Split the dataset -> Training and Testing
spl = sample.split(stocks$PositiveDec, SplitRatio = 0.7)
stocksTrain = subset(stocks, spl == TRUE)
stocksTest = subset(stocks, spl == FALSE)Then, use the stocksTrain data frame to train a logistic regression model (name it StocksModel) to predict PositiveDec using all the other variables as independent variables. Don’t forget to add the argument family=binomial to your glm command.
# Create Logistic Regression Model
StocksModel= glm(PositiveDec ~ ., data = stocksTrain, family=binomial)
# Predict on the training set
predictTrain = predict(StocksModel, type="response")
# Tabulate the predictions of training set vs our predict function
z = table(stocksTrain$PositiveDec, predictTrain > 0.5)
kable(z)| FALSE | TRUE | |
|---|---|---|
| 0 | 990 | 2689 |
| 1 | 787 | 3640 |
# Compute Accuracy
sum(diag(z))/(sum(z))
## [1] 0.5711818The overall accuracy of the model is 0.571.
# Predict on the test set
predictTest = predict(StocksModel, type="response", newdata=stocksTest)
# Tabulate the predictions of test set vs our predict function
z = table(stocksTest$PositiveDec, predictTest > 0.5)
kable(z)| FALSE | TRUE | |
|---|---|---|
| 0 | 417 | 1160 |
| 1 | 344 | 1553 |
# Compute Accuracy
sum(diag(z))/(sum(z))
## [1] 0.5670697# Tabulate the test set positive december returns
z = table(stocksTest$PositiveDec)
kable(z)| Var1 | Freq |
|---|---|
| 0 | 1577 |
| 1 | 1897 |
# Compute the accuracy of the baseline model
z[2]/(sum(z))
## 1
## 0.5460564The baseline model has an accuracy of 0.5460.
Now, let’s cluster the stocks. The first step in this process is to remove the dependent variable using the following commands:
#Remove Dependent Variable
limitedTrain = stocksTrain
limitedTrain$PositiveDec = NULL
limitedTest = stocksTest
limitedTest$PositiveDec = NULLIn cluster-then-predict, our final goal is to predict the dependent variable, which is unknown to us at the time of prediction. Therefore, if we need to know the outcome value to perform the clustering, the methodology is no longer useful for prediction of an unknown outcome value.
This is an important point that is sometimes mistakenly overlooked. If you use the outcome value to cluster, you might conclude your method strongly outperforms a non-clustering alternative. However, this is because it is using the outcome to determine the clusters, which is not valid.
n the market segmentation assignment in this week’s homework, you were introduced to the preProcess command from the caret package, which normalizes variables by subtracting by the mean and dividing by the standard deviation.
In cases where we have a training and testing set, we’ll want to normalize by the mean and standard deviation of the variables in the training set. We can do this by passing just the training set to the preProcess function:
#Preprocess the data
library(caret)
preproc = preProcess(limitedTrain)
normTrain = predict(preproc, limitedTrain)
normTest = predict(preproc, limitedTest)#Obtains the mean of the ReturnJan variable in normTrain
mean(normTrain$ReturnJan)
## [1] 2.100586e-17The mean is 2.100586 x 10-17.
#Obtains the mean of the ReturnJan variable in normTest
mean(normTest$ReturnJan)
## [1] -0.0004185886The mean is -0.0004185886.
From mean(stocksTrain$ReturnJan) and mean(stocksTest$ReturnJan), we see that the average return in January is slightly higher in the training set than in the testing set. Since normTest was constructed by subtracting by the mean ReturnJan value from the training set, this explains why the mean value of ReturnJan is slightly negative in normTest.
# implemenmt the k-mean cluster
set.seed(144)
kmc = kmeans(normTrain, centers=3)
# Subset the clusters into three different clusters
KmeansCluster1 = subset(normTrain, kmc$cluster == 1)
KmeansCluster2 = subset(normTrain, kmc$cluster == 2)
KmeansCluster3 = subset(normTrain, kmc$cluster == 3)# Number of observations
z = table(kmc$cluster)
kable(z)| Var1 | Freq |
|---|---|
| 1 | 3157 |
| 2 | 4696 |
| 3 | 253 |
# Cluster with the largest number of observations
which.max(z)
## 2
## 2Cluster 2 has the largest number of observations.
Recall from the recitation that we can use the flexclust package to obtain training set and testing set cluster assignments for our observations (note that the call to as.kcca may take a while to complete):
# Flexclust package
library(flexclust)
kmc.kcca = as.kcca(kmc, normTrain)
# Predict the Training set
clusterTrain = predict(kmc.kcca)
# Predict the Test set
clusterTest = predict(kmc.kcca, newdata=normTest)# Number of observations
z = table(clusterTest)
kable(z)| clusterTest | Freq |
|---|---|
| 1 | 1298 |
| 2 | 2080 |
| 3 | 96 |
# Number of observations in cluster 2
z[2]
## 2
## 2080Cluster 2 has 2080 observations.
Using the subset function, build data frames stocksTrain1, stocksTrain2, and stocksTrain3, containing the elements in the stocksTrain data frame assigned to clusters 1, 2, and 3, respectively (be careful to take subsets of stocksTrain, not of normTrain). Similarly build stocksTest1, stocksTest2, and stocksTest3 from the stocksTest data frame.
# Subsetting stocksTrain into 1, 2, and 3 from the respective clusters
stocksTrain1 = subset(stocksTrain, clusterTrain == 1)
stocksTrain2 = subset(stocksTrain, clusterTrain == 2)
stocksTrain3 = subset(stocksTrain, clusterTrain == 3)
# Subsetting stocksTest into 1, 2, and 3 from the respective clusters
stocksTest1 = subset(stocksTest, clusterTest == 1)
stocksTest2 = subset(stocksTest, clusterTest == 2)
stocksTest3 = subset(stocksTest, clusterTest == 3)
# Compute the average value of the Positive December returns in each respective cluster
mean(stocksTrain1$PositiveDec)
## [1] 0.6024707
mean(stocksTrain2$PositiveDec)
## [1] 0.5140545
mean(stocksTrain3$PositiveDec)
## [1] 0.4387352We see that stocksTrain1 has the observations with the highest average value of the dependent variable.
Build logistic regression models StocksModel1, StocksModel2, and StocksModel3, which predict PositiveDec using all the other variables as independent variables. StocksModel1 should be trained on stocksTrain1, StocksModel2 should be trained on stocksTrain2, and StocksModel3 should be trained on stocksTrain3.
Which variables have a positive sign for the coefficient in at least one of StocksModel1, StocksModel2, and StocksModel3 and a negative sign for the coefficient in at least one of StocksModel1, StocksModel2, and StocksModel3? Select all that apply.
# Create Logistic Regression Model
StocksModel1= glm(PositiveDec ~ ., data = stocksTrain1, family=binomial)
StocksModel2= glm(PositiveDec ~ ., data = stocksTrain2, family=binomial)
StocksModel3= glm(PositiveDec ~ ., data = stocksTrain3, family=binomial)### Examine the logistic regression models
summary(StocksModel1)
##
## Call:
## glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7307 -1.2910 0.8878 1.0280 1.5023
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.17224 0.06302 2.733 0.00628 **
## ReturnJan 0.02498 0.29306 0.085 0.93206
## ReturnFeb -0.37207 0.29123 -1.278 0.20139
## ReturnMar 0.59555 0.23325 2.553 0.01067 *
## ReturnApr 1.19048 0.22439 5.305 0.000000112 ***
## ReturnMay 0.30421 0.22845 1.332 0.18298
## ReturnJune -0.01165 0.29993 -0.039 0.96901
## ReturnJuly 0.19769 0.27790 0.711 0.47685
## ReturnAug 0.51273 0.30858 1.662 0.09660 .
## ReturnSep 0.58833 0.28133 2.091 0.03651 *
## ReturnOct -1.02254 0.26007 -3.932 0.000084343 ***
## ReturnNov -0.74847 0.28280 -2.647 0.00813 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4243.0 on 3156 degrees of freedom
## Residual deviance: 4172.9 on 3145 degrees of freedom
## AIC: 4196.9
##
## Number of Fisher Scoring iterations: 4
summary(StocksModel2)
##
## Call:
## glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2012 -1.1941 0.8583 1.1334 1.9424
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.10293 0.03785 2.719 0.006540 **
## ReturnJan 0.88451 0.20276 4.362 0.00001286261 ***
## ReturnFeb 0.31762 0.26624 1.193 0.232878
## ReturnMar -0.37978 0.24045 -1.579 0.114231
## ReturnApr 0.49291 0.22460 2.195 0.028189 *
## ReturnMay 0.89655 0.25492 3.517 0.000436 ***
## ReturnJune 1.50088 0.26014 5.770 0.00000000795 ***
## ReturnJuly 0.78315 0.26864 2.915 0.003554 **
## ReturnAug -0.24486 0.27080 -0.904 0.365876
## ReturnSep 0.73685 0.24820 2.969 0.002989 **
## ReturnOct -0.27756 0.18400 -1.509 0.131419
## ReturnNov -0.78747 0.22458 -3.506 0.000454 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6506.3 on 4695 degrees of freedom
## Residual deviance: 6362.2 on 4684 degrees of freedom
## AIC: 6386.2
##
## Number of Fisher Scoring iterations: 4
summary(StocksModel3)
##
## Call:
## glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9146 -1.0393 -0.7689 1.1921 1.6939
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.181896 0.325182 -0.559 0.5759
## ReturnJan -0.009789 0.448943 -0.022 0.9826
## ReturnFeb -0.046883 0.213432 -0.220 0.8261
## ReturnMar 0.674179 0.564790 1.194 0.2326
## ReturnApr 1.281466 0.602672 2.126 0.0335 *
## ReturnMay 0.762512 0.647783 1.177 0.2392
## ReturnJune 0.329434 0.408038 0.807 0.4195
## ReturnJuly 0.774164 0.729360 1.061 0.2885
## ReturnAug 0.982605 0.533158 1.843 0.0653 .
## ReturnSep 0.363807 0.627774 0.580 0.5622
## ReturnOct 0.782242 0.733123 1.067 0.2860
## ReturnNov -0.873752 0.738480 -1.183 0.2367
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 346.92 on 252 degrees of freedom
## Residual deviance: 328.29 on 241 degrees of freedom
## AIC: 352.29
##
## Number of Fisher Scoring iterations: 4ReturnJan, ReturnFeb, ReturnMar, ReturnJune, ReturnAug, and ReturnOct differ in sign between the models.
# Predict on the test set
predictTest1 = predict(StocksModel1, type="response", newdata=stocksTest1)
# Tabulate the predictions of test set vs our predict function
z = table(stocksTest1$PositiveDec, predictTest1 > 0.5)
kable(z)| FALSE | TRUE | |
|---|---|---|
| 0 | 30 | 471 |
| 1 | 23 | 774 |
# Compute Accuracy
sum(diag(z))/(sum(z))
## [1] 0.6194145The overall accuracy of StocksModel1 is 0.5504808.
# Predict on the test set
predictTest2 = predict(StocksModel2, type="response", newdata=stocksTest2)
# Tabulate the predictions of test set vs our predict function
z = table(stocksTest2$PositiveDec, predictTest2 > 0.5)
kable(z)| FALSE | TRUE | |
|---|---|---|
| 0 | 388 | 626 |
| 1 | 309 | 757 |
# Compute Accuracy
sum(diag(z))/(sum(z))
## [1] 0.5504808The overall accuracy of StocksModel2 is 0.5504808.
# Predict on the test set
predictTest3 = predict(StocksModel3, type="response", newdata=stocksTest3)
# Tabulate the predictions of test set vs our predict function
z = table(stocksTest3$PositiveDec, predictTest3 > 0.5)
kable(z)| FALSE | TRUE | |
|---|---|---|
| 0 | 49 | 13 |
| 1 | 21 | 13 |
# Compute Accuracy
sum(diag(z))/(sum(z))
## [1] 0.6458333The overall accuracy of StocksModel3 is 0.6458333.
# Combine all test-set predictions and outcomes into a single vector
AllPredictions = c(predictTest1, predictTest2, predictTest3)
AllOutcomes = c(stocksTest1$PositiveDec, stocksTest2$PositiveDec, stocksTest3$PositiveDec)
z = table(AllOutcomes, AllPredictions > 0.5)
kable(z)| FALSE | TRUE | |
|---|---|---|
| 0 | 467 | 1110 |
| 1 | 353 | 1544 |
# Compute Accuracy
sum(diag(z))/(sum(z))
## [1] 0.5788716After combining the predictions and outcomes with the provided code, we can compute the overall test-set accuracy by creating a classification matrix: The overall accuracy is 0.5788. We see a modest improvement over the original logistic regression model. Since predicting stock returns is a notoriously hard problem, this is a good increase in accuracy. By investing in stocks for which we are more confident that they will have positive returns (by selecting the ones with higher predicted probabilities), this cluster-then-predict model can give us an edge over the original logistic regression model.