Background Information on the Dataset

In the second lecture sequence this week, we heard about cluster-then-predict, a methodology in which you first cluster observations and then build cluster-specific prediction models. In the lecture sequence, we saw how this methodology helped improve the prediction of heart attack risk. In this assignment, we’ll use cluster-then-predict to predict future stock prices using historical stock data.

When selecting which stocks to invest in, investors seek to obtain good future returns. In this problem, we will first use clustering to identify clusters of stocks that have similar returns over time. Then, we’ll use logistic regression to predict whether or not the stocks will have positive future returns.

For this problem, we’ll use StocksCluster.csv, which contains monthly stock returns from the NASDAQ stock exchange. The NASDAQ is the second-largest stock exchange in the world, and it lists many technology companies. The stock price data used in this problem was obtained from infochimps, a website providing access to many datasets.

Each observation in the dataset is the monthly returns of a particular company in a particular year. The years included are 2000-2009. The companies are limited to tickers that were listed on the exchange for the entire period 2000-2009, and whose stock price never fell below $1. So, for example, one observation is for Yahoo in 2000, and another observation is for Yahoo in 2001. Our goal will be to predict whether or not the stock return in December will be positive, using the stock returns for the first 11 months of the year.

This dataset contains the following variables:

For the first 11 variables, the value stored is a proportional change in stock value during that month. For instance, a value of 0.05 means the stock increased in value 5% during the month, while a value of -0.02 means the stock decreased in value 2% during the month.

R Exercises

Exploring the Dataset

Load StocksCluster.csv into a data frame called “stocks”.

# Read the dataset
stocks = read.csv("StocksCluster.csv")

How many observations are in the dataset?

# Reports the number of rows on stocks
nrow(stocks)
## [1] 11580

There are 11580 observations in this dataset.

What proportion of the observations have positive returns in December?

# Tabulate the positive december returns
z = table(stocks$PositiveDec)
kable(z)
Var1 Freq
0 5256
1 6324
z[2]/(sum(z))
##        1 
## 0.546114

0.546 of the observations have positive returns in December.

What is the maximum correlation between any two return variables in the dataset? You should look at the pairwise correlations between ReturnJan, ReturnFeb, ReturnMar, ReturnApr, ReturnMay, ReturnJune, ReturnJuly, ReturnAug, ReturnSep, ReturnOct, and ReturnNov.

# Obtain a correlation matrix of the data
z = cor(stocks)
kable(z)
ReturnJan ReturnFeb ReturnMar ReturnApr ReturnMay ReturnJune ReturnJuly ReturnAug ReturnSep ReturnOct ReturnNov PositiveDec
ReturnJan 1.0000000 0.0667746 -0.0904968 -0.0376780 -0.0444114 0.0922383 -0.0814298 -0.0227920 -0.0264372 0.1429772 0.0676323 0.0047285
ReturnFeb 0.0667746 1.0000000 -0.1559833 -0.1913519 -0.0955209 0.1699945 -0.0617785 0.1315598 0.0435018 -0.0873243 -0.1546583 -0.0381732
ReturnMar -0.0904968 -0.1559833 1.0000000 0.0097263 -0.0038928 -0.0859055 0.0033742 -0.0220054 0.0765183 -0.0119238 0.0373235 0.0224087
ReturnApr -0.0376780 -0.1913519 0.0097263 1.0000000 0.0638225 -0.0110278 0.0806319 -0.0517561 -0.0289210 0.0485400 0.0317618 0.0943535
ReturnMay -0.0444114 -0.0955209 -0.0038928 0.0638225 1.0000000 -0.0210745 0.0908503 -0.0331257 0.0219629 0.0171667 0.0480466 0.0582019
ReturnJune 0.0922383 0.1699945 -0.0859055 -0.0110278 -0.0210745 1.0000000 -0.0291526 0.0107105 0.0447473 -0.0226360 -0.0652705 0.0234097
ReturnJuly -0.0814298 -0.0617785 0.0033742 0.0806319 0.0908503 -0.0291526 1.0000000 0.0007138 0.0689478 -0.0547089 -0.0483738 0.0743642
ReturnAug -0.0227920 0.1315598 -0.0220054 -0.0517561 -0.0331257 0.0107105 0.0007138 1.0000000 0.0007407 -0.0755946 -0.1164890 0.0041670
ReturnSep -0.0264372 0.0435018 0.0765183 -0.0289210 0.0219629 0.0447473 0.0689478 0.0007407 1.0000000 -0.0580792 -0.0197198 0.0416303
ReturnOct 0.1429772 -0.0873243 -0.0119238 0.0485400 0.0171667 -0.0226360 -0.0547089 -0.0755946 -0.0580792 1.0000000 0.1916728 -0.0525750
ReturnNov 0.0676323 -0.1546583 0.0373235 0.0317618 0.0480466 -0.0652705 -0.0483738 -0.1164890 -0.0197198 0.1916728 1.0000000 -0.0623466
PositiveDec 0.0047285 -0.0381732 0.0224087 0.0943535 0.0582019 0.0234097 0.0743642 0.0041670 0.0416303 -0.0525750 -0.0623466 1.0000000

Summary

# Obtain a summary of the data
z = summary(stocks)
kable(z)
ReturnJan ReturnFeb ReturnMar ReturnApr ReturnMay ReturnJune ReturnJuly ReturnAug ReturnSep ReturnOct ReturnNov PositiveDec
Min. :-0.7616205 Min. :-0.690000 Min. :-0.712994 Min. :-0.826503 Min. :-0.92207 Min. :-0.717920 Min. :-0.7613096 Min. :-0.726800 Min. :-0.839730 Min. :-0.685504 Min. :-0.747171 Min. :0.0000
1st Qu.:-0.0691663 1st Qu.:-0.077748 1st Qu.:-0.046389 1st Qu.:-0.054468 1st Qu.:-0.04640 1st Qu.:-0.063966 1st Qu.:-0.0731917 1st Qu.:-0.046272 1st Qu.:-0.074648 1st Qu.:-0.070915 1st Qu.:-0.054890 1st Qu.:0.0000
Median : 0.0009965 Median :-0.010626 Median : 0.009878 Median : 0.009059 Median : 0.01293 Median :-0.000880 Median :-0.0008047 Median : 0.007205 Median :-0.007616 Median : 0.002115 Median : 0.008522 Median :1.0000
Mean : 0.0126316 Mean :-0.007605 Mean : 0.019402 Mean : 0.026308 Mean : 0.02474 Mean : 0.005938 Mean : 0.0030509 Mean : 0.016198 Mean :-0.014721 Mean : 0.005651 Mean : 0.011387 Mean :0.5461
3rd Qu.: 0.0732606 3rd Qu.: 0.043600 3rd Qu.: 0.077066 3rd Qu.: 0.085338 3rd Qu.: 0.08396 3rd Qu.: 0.061566 3rd Qu.: 0.0718205 3rd Qu.: 0.070783 3rd Qu.: 0.049476 3rd Qu.: 0.074542 3rd Qu.: 0.076576 3rd Qu.:1.0000
Max. : 3.0683060 Max. : 6.943694 Max. : 4.008621 Max. : 2.528827 Max. : 6.93013 Max. : 4.339713 Max. : 2.5500000 Max. : 3.626609 Max. : 5.863980 Max. : 5.665138 Max. : 3.271676 Max. :1.0000
Which month (from January through November) has the largest mean return across all observations in the dataset?

April has the largest mean value (0.026308).

Which month (from January through November) has the smallest mean return across all observations in the dataset?

September has the smallest mean value (-0.014721).

Initial Logistic Regression Model

Run the following commands to split the data into a training set and testing set, putting 70% of the data in the training set and 30% of the data in the testing set:

library(caTools)

set.seed(144)
# Split the dataset -> Training and Testing 
spl = sample.split(stocks$PositiveDec, SplitRatio = 0.7)

stocksTrain = subset(stocks, spl == TRUE)

stocksTest = subset(stocks, spl == FALSE)

Then, use the stocksTrain data frame to train a logistic regression model (name it StocksModel) to predict PositiveDec using all the other variables as independent variables. Don’t forget to add the argument family=binomial to your glm command.

What is the overall accuracy on the training set, using a threshold of 0.5?

# Create Logistic Regression Model
StocksModel= glm(PositiveDec ~ ., data = stocksTrain, family=binomial)
# Predict on the training set
predictTrain = predict(StocksModel, type="response")
# Tabulate the predictions of training set vs our predict function
z = table(stocksTrain$PositiveDec, predictTrain > 0.5)
kable(z)
FALSE TRUE
0 990 2689
1 787 3640
# Compute Accuracy
sum(diag(z))/(sum(z))
## [1] 0.5711818

The overall accuracy of the model is 0.571.

Now obtain test set predictions from StocksModel. What is the overall accuracy of the model on the test, again using a threshold of 0.5?

# Predict on the test set
predictTest = predict(StocksModel, type="response", newdata=stocksTest)
# Tabulate the predictions of test set vs our predict function
z = table(stocksTest$PositiveDec, predictTest > 0.5)
kable(z)
FALSE TRUE
0 417 1160
1 344 1553
# Compute Accuracy
sum(diag(z))/(sum(z))
## [1] 0.5670697

What is the accuracy on the test set of a baseline model that always predicts the most common outcome (PositiveDec = 1)?

# Tabulate the test set positive december returns
z = table(stocksTest$PositiveDec)
kable(z)
Var1 Freq
0 1577
1 1897
# Compute the accuracy of the baseline model
z[2]/(sum(z))
##         1 
## 0.5460564

The baseline model has an accuracy of 0.5460.

Clustering Stocks

Now, let’s cluster the stocks. The first step in this process is to remove the dependent variable using the following commands:

#Remove Dependent Variable
limitedTrain = stocksTrain
limitedTrain$PositiveDec = NULL
limitedTest = stocksTest
limitedTest$PositiveDec = NULL

Why do we need to remove the dependent variable in the clustering phase of the cluster-then-predict methodology?

In cluster-then-predict, our final goal is to predict the dependent variable, which is unknown to us at the time of prediction. Therefore, if we need to know the outcome value to perform the clustering, the methodology is no longer useful for prediction of an unknown outcome value.

This is an important point that is sometimes mistakenly overlooked. If you use the outcome value to cluster, you might conclude your method strongly outperforms a non-clustering alternative. However, this is because it is using the outcome to determine the clusters, which is not valid.

Preprocess the data

n the market segmentation assignment in this week’s homework, you were introduced to the preProcess command from the caret package, which normalizes variables by subtracting by the mean and dividing by the standard deviation.

In cases where we have a training and testing set, we’ll want to normalize by the mean and standard deviation of the variables in the training set. We can do this by passing just the training set to the preProcess function:

#Preprocess the data
library(caret)
preproc = preProcess(limitedTrain)
normTrain = predict(preproc, limitedTrain)
normTest = predict(preproc, limitedTest)
What is the mean of the ReturnJan variable in normTrain?
#Obtains the mean of the ReturnJan variable in normTrain
mean(normTrain$ReturnJan)
## [1] 2.100586e-17

The mean is 2.100586 x 10-17.

What is the mean of the ReturnJan variable in normTrain?
#Obtains the mean of the ReturnJan variable in normTest
mean(normTest$ReturnJan)
## [1] -0.0004185886

The mean is -0.0004185886.

Why is the mean ReturnJan variable much closer to 0 in normTrain than in normTest?

From mean(stocksTrain$ReturnJan) and mean(stocksTest$ReturnJan), we see that the average return in January is slightly higher in the training set than in the testing set. Since normTest was constructed by subtracting by the mean ReturnJan value from the training set, this explains why the mean value of ReturnJan is slightly negative in normTest.

K-means cluster

# implemenmt the k-mean cluster
set.seed(144)
kmc = kmeans(normTrain, centers=3)
# Subset the clusters into three different clusters
KmeansCluster1 = subset(normTrain, kmc$cluster == 1)
KmeansCluster2 = subset(normTrain, kmc$cluster == 2)
KmeansCluster3 = subset(normTrain, kmc$cluster == 3)
Which cluster has the largest number of observations?
# Number of observations
z = table(kmc$cluster)
kable(z)
Var1 Freq
1 3157
2 4696
3 253
# Cluster with the largest number of observations
which.max(z)
## 2 
## 2

Cluster 2 has the largest number of observations.

Flexclust

Recall from the recitation that we can use the flexclust package to obtain training set and testing set cluster assignments for our observations (note that the call to as.kcca may take a while to complete):

# Flexclust package
library(flexclust)
kmc.kcca = as.kcca(kmc, normTrain)
# Predict the Training set
clusterTrain = predict(kmc.kcca)
# Predict the Test set
clusterTest = predict(kmc.kcca, newdata=normTest)
How many test-set observations were assigned to Cluster 2?
# Number of observations 
z = table(clusterTest)
kable(z)
clusterTest Freq
1 1298
2 2080
3 96
# Number of observations in cluster 2
z[2]
##    2 
## 2080

Cluster 2 has 2080 observations.

Cluster-Specific Predictions

Using the subset function, build data frames stocksTrain1, stocksTrain2, and stocksTrain3, containing the elements in the stocksTrain data frame assigned to clusters 1, 2, and 3, respectively (be careful to take subsets of stocksTrain, not of normTrain). Similarly build stocksTest1, stocksTest2, and stocksTest3 from the stocksTest data frame.

Which training set data frame has the highest average value of the dependent variable?

# Subsetting stocksTrain into 1, 2, and 3 from the respective clusters
stocksTrain1 = subset(stocksTrain, clusterTrain == 1)
stocksTrain2 = subset(stocksTrain, clusterTrain == 2)
stocksTrain3 = subset(stocksTrain, clusterTrain == 3)
# Subsetting stocksTest into 1, 2, and 3 from the respective clusters
stocksTest1 = subset(stocksTest, clusterTest == 1)
stocksTest2 = subset(stocksTest, clusterTest == 2)
stocksTest3 = subset(stocksTest, clusterTest == 3)
# Compute the average value of the Positive December returns in each respective cluster
mean(stocksTrain1$PositiveDec)
## [1] 0.6024707
mean(stocksTrain2$PositiveDec)
## [1] 0.5140545
mean(stocksTrain3$PositiveDec)
## [1] 0.4387352

We see that stocksTrain1 has the observations with the highest average value of the dependent variable.

Logistic regression models

Build logistic regression models StocksModel1, StocksModel2, and StocksModel3, which predict PositiveDec using all the other variables as independent variables. StocksModel1 should be trained on stocksTrain1, StocksModel2 should be trained on stocksTrain2, and StocksModel3 should be trained on stocksTrain3.

Which variables have a positive sign for the coefficient in at least one of StocksModel1, StocksModel2, and StocksModel3 and a negative sign for the coefficient in at least one of StocksModel1, StocksModel2, and StocksModel3? Select all that apply.

# Create Logistic Regression Model
StocksModel1= glm(PositiveDec ~ ., data = stocksTrain1, family=binomial)
StocksModel2= glm(PositiveDec ~ ., data = stocksTrain2, family=binomial)
StocksModel3= glm(PositiveDec ~ ., data = stocksTrain3, family=binomial)
Which variables have a positive sign for the coefficient in at least one of StocksModel1, StocksModel2, and StocksModel3 and a negative sign for the coefficient in at least one of StocksModel1, StocksModel2, and StocksModel3? Select all that apply.
### Examine the logistic regression models
summary(StocksModel1)
## 
## Call:
## glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7307  -1.2910   0.8878   1.0280   1.5023  
## 
## Coefficients:
##             Estimate Std. Error z value    Pr(>|z|)    
## (Intercept)  0.17224    0.06302   2.733     0.00628 ** 
## ReturnJan    0.02498    0.29306   0.085     0.93206    
## ReturnFeb   -0.37207    0.29123  -1.278     0.20139    
## ReturnMar    0.59555    0.23325   2.553     0.01067 *  
## ReturnApr    1.19048    0.22439   5.305 0.000000112 ***
## ReturnMay    0.30421    0.22845   1.332     0.18298    
## ReturnJune  -0.01165    0.29993  -0.039     0.96901    
## ReturnJuly   0.19769    0.27790   0.711     0.47685    
## ReturnAug    0.51273    0.30858   1.662     0.09660 .  
## ReturnSep    0.58833    0.28133   2.091     0.03651 *  
## ReturnOct   -1.02254    0.26007  -3.932 0.000084343 ***
## ReturnNov   -0.74847    0.28280  -2.647     0.00813 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4243.0  on 3156  degrees of freedom
## Residual deviance: 4172.9  on 3145  degrees of freedom
## AIC: 4196.9
## 
## Number of Fisher Scoring iterations: 4
summary(StocksModel2)
## 
## Call:
## glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2012  -1.1941   0.8583   1.1334   1.9424  
## 
## Coefficients:
##             Estimate Std. Error z value      Pr(>|z|)    
## (Intercept)  0.10293    0.03785   2.719      0.006540 ** 
## ReturnJan    0.88451    0.20276   4.362 0.00001286261 ***
## ReturnFeb    0.31762    0.26624   1.193      0.232878    
## ReturnMar   -0.37978    0.24045  -1.579      0.114231    
## ReturnApr    0.49291    0.22460   2.195      0.028189 *  
## ReturnMay    0.89655    0.25492   3.517      0.000436 ***
## ReturnJune   1.50088    0.26014   5.770 0.00000000795 ***
## ReturnJuly   0.78315    0.26864   2.915      0.003554 ** 
## ReturnAug   -0.24486    0.27080  -0.904      0.365876    
## ReturnSep    0.73685    0.24820   2.969      0.002989 ** 
## ReturnOct   -0.27756    0.18400  -1.509      0.131419    
## ReturnNov   -0.78747    0.22458  -3.506      0.000454 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6506.3  on 4695  degrees of freedom
## Residual deviance: 6362.2  on 4684  degrees of freedom
## AIC: 6386.2
## 
## Number of Fisher Scoring iterations: 4
summary(StocksModel3)
## 
## Call:
## glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9146  -1.0393  -0.7689   1.1921   1.6939  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.181896   0.325182  -0.559   0.5759  
## ReturnJan   -0.009789   0.448943  -0.022   0.9826  
## ReturnFeb   -0.046883   0.213432  -0.220   0.8261  
## ReturnMar    0.674179   0.564790   1.194   0.2326  
## ReturnApr    1.281466   0.602672   2.126   0.0335 *
## ReturnMay    0.762512   0.647783   1.177   0.2392  
## ReturnJune   0.329434   0.408038   0.807   0.4195  
## ReturnJuly   0.774164   0.729360   1.061   0.2885  
## ReturnAug    0.982605   0.533158   1.843   0.0653 .
## ReturnSep    0.363807   0.627774   0.580   0.5622  
## ReturnOct    0.782242   0.733123   1.067   0.2860  
## ReturnNov   -0.873752   0.738480  -1.183   0.2367  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 346.92  on 252  degrees of freedom
## Residual deviance: 328.29  on 241  degrees of freedom
## AIC: 352.29
## 
## Number of Fisher Scoring iterations: 4

ReturnJan, ReturnFeb, ReturnMar, ReturnJune, ReturnAug, and ReturnOct differ in sign between the models.

What is the overall accuracy of StocksModel1 on the test set stocksTest1, using a threshold of 0.5?
# Predict on the test set
predictTest1 = predict(StocksModel1, type="response", newdata=stocksTest1)
# Tabulate the predictions of test set vs our predict function
z = table(stocksTest1$PositiveDec, predictTest1 > 0.5)
kable(z)
FALSE TRUE
0 30 471
1 23 774
# Compute Accuracy
sum(diag(z))/(sum(z))
## [1] 0.6194145

The overall accuracy of StocksModel1 is 0.5504808.

What is the overall accuracy of StocksModel2 on the test set stocksTest2, using a threshold of 0.5?
# Predict on the test set
predictTest2 = predict(StocksModel2, type="response", newdata=stocksTest2)
# Tabulate the predictions of test set vs our predict function
z = table(stocksTest2$PositiveDec, predictTest2 > 0.5)
kable(z)
FALSE TRUE
0 388 626
1 309 757
# Compute Accuracy
sum(diag(z))/(sum(z))
## [1] 0.5504808

The overall accuracy of StocksModel2 is 0.5504808.

What is the overall accuracy of StocksModel1 on the test set stocksTest1, using a threshold of 0.5?
# Predict on the test set
predictTest3 = predict(StocksModel3, type="response", newdata=stocksTest3)
# Tabulate the predictions of test set vs our predict function
z = table(stocksTest3$PositiveDec, predictTest3 > 0.5)
kable(z)
FALSE TRUE
0 49 13
1 21 13
# Compute Accuracy
sum(diag(z))/(sum(z))
## [1] 0.6458333

The overall accuracy of StocksModel3 is 0.6458333.

To compute the overall test-set accuracy of the cluster-then-predict approach, we can combine all the test-set predictions into a single vector and all the true outcomes into a single vector:
# Combine all test-set predictions and outcomes into a single vector
AllPredictions = c(predictTest1, predictTest2, predictTest3)
AllOutcomes = c(stocksTest1$PositiveDec, stocksTest2$PositiveDec, stocksTest3$PositiveDec)
z = table(AllOutcomes, AllPredictions > 0.5)
kable(z)
FALSE TRUE
0 467 1110
1 353 1544
# Compute Accuracy
sum(diag(z))/(sum(z))
## [1] 0.5788716

After combining the predictions and outcomes with the provided code, we can compute the overall test-set accuracy by creating a classification matrix: The overall accuracy is 0.5788. We see a modest improvement over the original logistic regression model. Since predicting stock returns is a notoriously hard problem, this is a good increase in accuracy. By investing in stocks for which we are more confident that they will have positive returns (by selecting the ones with higher predicted probabilities), this cluster-then-predict model can give us an edge over the original logistic regression model.