Analytics Edge: Unit 6 - Predicting Stock Returns with Cluster-Then-Predict

Background Information on the Dataset

In the second lecture sequence this week, we heard about cluster-then-predict, a methodology in which you first cluster observations and then build cluster-specific prediction models. In the lecture sequence, we saw how this methodology helped improve the prediction of heart attack risk. In this assignment, we’ll use cluster-then-predict to predict future stock prices using historical stock data.

When selecting which stocks to invest in, investors seek to obtain good future returns. In this problem, we will first use clustering to identify clusters of stocks that have similar returns over time. Then, we’ll use logistic regression to predict whether or not the stocks will have positive future returns.

For this problem, we’ll use StocksCluster.csv, which contains monthly stock returns from the NASDAQ stock exchange. The NASDAQ is the second-largest stock exchange in the world, and it lists many technology companies. The stock price data used in this problem was obtained from infochimps, a website providing access to many datasets.

Each observation in the dataset is the monthly returns of a particular company in a particular year. The years included are 2000-2009. The companies are limited to tickers that were listed on the exchange for the entire period 2000-2009, and whose stock price never fell below $1. So, for example, one observation is for Yahoo in 2000, and another observation is for Yahoo in 2001. Our goal will be to predict whether or not the stock return in December will be positive, using the stock returns for the first 11 months of the year.

This dataset contains the following variables:

ReturnJan = the return for the company’s stock during January (in the year of the observation).
ReturnFeb = the return for the company’s stock during February (in the year of the observation).
ReturnMar = the return for the company’s stock during March (in the year of the observation).
ReturnApr = the return for the company’s stock during April (in the year of the observation).
ReturnMay = the return for the company’s stock during May (in the year of the observation).
ReturnJune = the return for the company’s stock during June (in the year of the observation).
ReturnJuly = the return for the company’s stock during July (in the year of the observation).
ReturnAug = the return for the company’s stock during August (in the year of the observation).
ReturnSep = the return for the company’s stock during September (in the year of the observation).
ReturnOct = the return for the company’s stock during October (in the year of the observation).
ReturnNov = the return for the company’s stock during November (in the year of the observation).
PositiveDec = whether or not the company’s stock had a positive return in December (in the year of the observation). This variable takes value 1 if the return was positive, and value 0 if the return was not positive.

For the first 11 variables, the value stored is a proportional change in stock value during that month. For instance, a value of 0.05 means the stock increased in value 5% during the month, while a value of -0.02 means the stock decreased in value 2% during the month.

R Exercises

Exploring the Dataset

Load StocksCluster.csv into a data frame called “stocks”.

# Read the dataset
stocks = read.csv("StocksCluster.csv")

How many observations are in the dataset?

# Reports the number of rows on stocks
nrow(stocks)
## [1] 11580

There are 11580 observations in this dataset.

What proportion of the observations have positive returns in December?

# Tabulate the positive december returns
z = table(stocks$PositiveDec)
kable(z)

Var1	Freq
0	5256
1	6324

z[2]/(sum(z))
##        1 
## 0.546114

0.546 of the observations have positive returns in December.

What is the maximum correlation between any two return variables in the dataset? You should look at the pairwise correlations between ReturnJan, ReturnFeb, ReturnMar, ReturnApr, ReturnMay, ReturnJune, ReturnJuly, ReturnAug, ReturnSep, ReturnOct, and ReturnNov.

# Obtain a correlation matrix of the data
z = cor(stocks)
kable(z)

	ReturnJan	ReturnFeb	ReturnMar	ReturnApr	ReturnMay	ReturnJune	ReturnJuly	ReturnAug	ReturnSep	ReturnOct	ReturnNov	PositiveDec
ReturnJan	1.0000000	0.0667746	-0.0904968	-0.0376780	-0.0444114	0.0922383	-0.0814298	-0.0227920	-0.0264372	0.1429772	0.0676323	0.0047285
ReturnFeb	0.0667746	1.0000000	-0.1559833	-0.1913519	-0.0955209	0.1699945	-0.0617785	0.1315598	0.0435018	-0.0873243	-0.1546583	-0.0381732
ReturnMar	-0.0904968	-0.1559833	1.0000000	0.0097263	-0.0038928	-0.0859055	0.0033742	-0.0220054	0.0765183	-0.0119238	0.0373235	0.0224087
ReturnApr	-0.0376780	-0.1913519	0.0097263	1.0000000	0.0638225	-0.0110278	0.0806319	-0.0517561	-0.0289210	0.0485400	0.0317618	0.0943535
ReturnMay	-0.0444114	-0.0955209	-0.0038928	0.0638225	1.0000000	-0.0210745	0.0908503	-0.0331257	0.0219629	0.0171667	0.0480466	0.0582019
ReturnJune	0.0922383	0.1699945	-0.0859055	-0.0110278	-0.0210745	1.0000000	-0.0291526	0.0107105	0.0447473	-0.0226360	-0.0652705	0.0234097
ReturnJuly	-0.0814298	-0.0617785	0.0033742	0.0806319	0.0908503	-0.0291526	1.0000000	0.0007138	0.0689478	-0.0547089	-0.0483738	0.0743642
ReturnAug	-0.0227920	0.1315598	-0.0220054	-0.0517561	-0.0331257	0.0107105	0.0007138	1.0000000	0.0007407	-0.0755946	-0.1164890	0.0041670
ReturnSep	-0.0264372	0.0435018	0.0765183	-0.0289210	0.0219629	0.0447473	0.0689478	0.0007407	1.0000000	-0.0580792	-0.0197198	0.0416303
ReturnOct	0.1429772	-0.0873243	-0.0119238	0.0485400	0.0171667	-0.0226360	-0.0547089	-0.0755946	-0.0580792	1.0000000	0.1916728	-0.0525750
ReturnNov	0.0676323	-0.1546583	0.0373235	0.0317618	0.0480466	-0.0652705	-0.0483738	-0.1164890	-0.0197198	0.1916728	1.0000000	-0.0623466
PositiveDec	0.0047285	-0.0381732	0.0224087	0.0943535	0.0582019	0.0234097	0.0743642	0.0041670	0.0416303	-0.0525750	-0.0623466	1.0000000

Summary

# Obtain a summary of the data
z = summary(stocks)
kable(z)

ReturnJan	ReturnFeb	ReturnMar	ReturnApr	ReturnMay	ReturnJune	ReturnJuly	ReturnAug	ReturnSep	ReturnOct	ReturnNov	PositiveDec
Min. :-0.7616205	Min. :-0.690000	Min. :-0.712994	Min. :-0.826503	Min. :-0.92207	Min. :-0.717920	Min. :-0.7613096	Min. :-0.726800	Min. :-0.839730	Min. :-0.685504	Min. :-0.747171	Min. :0.0000
1st Qu.:-0.0691663	1st Qu.:-0.077748	1st Qu.:-0.046389	1st Qu.:-0.054468	1st Qu.:-0.04640	1st Qu.:-0.063966	1st Qu.:-0.0731917	1st Qu.:-0.046272	1st Qu.:-0.074648	1st Qu.:-0.070915	1st Qu.:-0.054890	1st Qu.:0.0000
Median : 0.0009965	Median :-0.010626	Median : 0.009878	Median : 0.009059	Median : 0.01293	Median :-0.000880	Median :-0.0008047	Median : 0.007205	Median :-0.007616	Median : 0.002115	Median : 0.008522	Median :1.0000
Mean : 0.0126316	Mean :-0.007605	Mean : 0.019402	Mean : 0.026308	Mean : 0.02474	Mean : 0.005938	Mean : 0.0030509	Mean : 0.016198	Mean :-0.014721	Mean : 0.005651	Mean : 0.011387	Mean :0.5461
3rd Qu.: 0.0732606	3rd Qu.: 0.043600	3rd Qu.: 0.077066	3rd Qu.: 0.085338	3rd Qu.: 0.08396	3rd Qu.: 0.061566	3rd Qu.: 0.0718205	3rd Qu.: 0.070783	3rd Qu.: 0.049476	3rd Qu.: 0.074542	3rd Qu.: 0.076576	3rd Qu.:1.0000
Max. : 3.0683060	Max. : 6.943694	Max. : 4.008621	Max. : 2.528827	Max. : 6.93013	Max. : 4.339713	Max. : 2.5500000	Max. : 3.626609	Max. : 5.863980	Max. : 5.665138	Max. : 3.271676	Max. :1.0000

Which month (from January through November) has the largest mean return across all observations in the dataset?

April has the largest mean value (0.026308).

Which month (from January through November) has the smallest mean return across all observations in the dataset?

September has the smallest mean value (-0.014721).

Initial Logistic Regression Model

Run the following commands to split the data into a training set and testing set, putting 70% of the data in the training set and 30% of the data in the testing set:

library(caTools)

set.seed(144)
# Split the dataset -> Training and Testing 
spl = sample.split(stocks$PositiveDec, SplitRatio = 0.7)

stocksTrain = subset(stocks, spl == TRUE)

stocksTest = subset(stocks, spl == FALSE)

Then, use the stocksTrain data frame to train a logistic regression model (name it StocksModel) to predict PositiveDec using all the other variables as independent variables. Don’t forget to add the argument family=binomial to your glm command.

What is the overall accuracy on the training set, using a threshold of 0.5?

# Create Logistic Regression Model
StocksModel= glm(PositiveDec ~ ., data = stocksTrain, family=binomial)
# Predict on the training set
predictTrain = predict(StocksModel, type="response")
# Tabulate the predictions of training set vs our predict function
z = table(stocksTrain$PositiveDec, predictTrain > 0.5)
kable(z)

	FALSE	TRUE
0	990	2689
1	787	3640

# Compute Accuracy
sum(diag(z))/(sum(z))
## [1] 0.5711818

The overall accuracy of the model is 0.571.

Now obtain test set predictions from StocksModel. What is the overall accuracy of the model on the test, again using a threshold of 0.5?

# Predict on the test set
predictTest = predict(StocksModel, type="response", newdata=stocksTest)
# Tabulate the predictions of test set vs our predict function
z = table(stocksTest$PositiveDec, predictTest > 0.5)
kable(z)

	FALSE	TRUE
0	417	1160
1	344	1553

# Compute Accuracy
sum(diag(z))/(sum(z))
## [1] 0.5670697

What is the accuracy on the test set of a baseline model that always predicts the most common outcome (PositiveDec = 1)?

# Tabulate the test set positive december returns
z = table(stocksTest$PositiveDec)
kable(z)

Var1	Freq
0	1577
1	1897

# Compute the accuracy of the baseline model
z[2]/(sum(z))
##         1 
## 0.5460564

The baseline model has an accuracy of 0.5460.

Clustering Stocks

Now, let’s cluster the stocks. The first step in this process is to remove the dependent variable using the following commands:

#Remove Dependent Variable
limitedTrain = stocksTrain
limitedTrain$PositiveDec = NULL
limitedTest = stocksTest
limitedTest$PositiveDec = NULL

Why do we need to remove the dependent variable in the clustering phase of the cluster-then-predict methodology?

In cluster-then-predict, our final goal is to predict the dependent variable, which is unknown to us at the time of prediction. Therefore, if we need to know the outcome value to perform the clustering, the methodology is no longer useful for prediction of an unknown outcome value.

This is an important point that is sometimes mistakenly overlooked. If you use the outcome value to cluster, you might conclude your method strongly outperforms a non-clustering alternative. However, this is because it is using the outcome to determine the clusters, which is not valid.

Preprocess the data

n the market segmentation assignment in this week’s homework, you were introduced to the preProcess command from the caret package, which normalizes variables by subtracting by the mean and dividing by the standard deviation.

In cases where we have a training and testing set, we’ll want to normalize by the mean and standard deviation of the variables in the training set. We can do this by passing just the training set to the preProcess function:

#Preprocess the data
library(caret)
preproc = preProcess(limitedTrain)
normTrain = predict(preproc, limitedTrain)
normTest = predict(preproc, limitedTest)

What is the mean of the ReturnJan variable in normTrain?

#Obtains the mean of the ReturnJan variable in normTrain
mean(normTrain$ReturnJan)
## [1] 2.100586e-17

The mean is 2.100586 x 10^-17.

What is the mean of the ReturnJan variable in normTrain?

#Obtains the mean of the ReturnJan variable in normTest
mean(normTest$ReturnJan)
## [1] -0.0004185886

The mean is -0.0004185886.

Why is the mean ReturnJan variable much closer to 0 in normTrain than in normTest?

From mean(stocksTrain$ReturnJan) and mean(stocksTest$ReturnJan), we see that the average return in January is slightly higher in the training set than in the testing set. Since normTest was constructed by subtracting by the mean ReturnJan value from the training set, this explains why the mean value of ReturnJan is slightly negative in normTest.

K-means cluster

# implemenmt the k-mean cluster
set.seed(144)
kmc = kmeans(normTrain, centers=3)
# Subset the clusters into three different clusters
KmeansCluster1 = subset(normTrain, kmc$cluster == 1)
KmeansCluster2 = subset(normTrain, kmc$cluster == 2)
KmeansCluster3 = subset(normTrain, kmc$cluster == 3)

Which cluster has the largest number of observations?

# Number of observations
z = table(kmc$cluster)
kable(z)

Var1	Freq
1	3157
2	4696
3	253

# Cluster with the largest number of observations
which.max(z)
## 2 
## 2

Cluster 2 has the largest number of observations.

Flexclust

Recall from the recitation that we can use the flexclust package to obtain training set and testing set cluster assignments for our observations (note that the call to as.kcca may take a while to complete):

# Flexclust package
library(flexclust)
kmc.kcca = as.kcca(kmc, normTrain)
# Predict the Training set
clusterTrain = predict(kmc.kcca)
# Predict the Test set
clusterTest = predict(kmc.kcca, newdata=normTest)

How many test-set observations were assigned to Cluster 2?

# Number of observations 
z = table(clusterTest)
kable(z)

clusterTest	Freq
1	1298
2	2080
3	96

# Number of observations in cluster 2
z[2]
##    2 
## 2080

Cluster 2 has 2080 observations.

Cluster-Specific Predictions

Using the subset function, build data frames stocksTrain1, stocksTrain2, and stocksTrain3, containing the elements in the stocksTrain data frame assigned to clusters 1, 2, and 3, respectively (be careful to take subsets of stocksTrain, not of normTrain). Similarly build stocksTest1, stocksTest2, and stocksTest3 from the stocksTest data frame.

Which training set data frame has the highest average value of the dependent variable?

# Subsetting stocksTrain into 1, 2, and 3 from the respective clusters
stocksTrain1 = subset(stocksTrain, clusterTrain == 1)
stocksTrain2 = subset(stocksTrain, clusterTrain == 2)
stocksTrain3 = subset(stocksTrain, clusterTrain == 3)
# Subsetting stocksTest into 1, 2, and 3 from the respective clusters
stocksTest1 = subset(stocksTest, clusterTest == 1)
stocksTest2 = subset(stocksTest, clusterTest == 2)
stocksTest3 = subset(stocksTest, clusterTest == 3)
# Compute the average value of the Positive December returns in each respective cluster
mean(stocksTrain1$PositiveDec)
## [1] 0.6024707
mean(stocksTrain2$PositiveDec)
## [1] 0.5140545
mean(stocksTrain3$PositiveDec)
## [1] 0.4387352

We see that stocksTrain1 has the observations with the highest average value of the dependent variable.

Logistic regression models

Build logistic regression models StocksModel1, StocksModel2, and StocksModel3, which predict PositiveDec using all the other variables as independent variables. StocksModel1 should be trained on stocksTrain1, StocksModel2 should be trained on stocksTrain2, and StocksModel3 should be trained on stocksTrain3.

Which variables have a positive sign for the coefficient in at least one of StocksModel1, StocksModel2, and StocksModel3 and a negative sign for the coefficient in at least one of StocksModel1, StocksModel2, and StocksModel3? Select all that apply.

# Create Logistic Regression Model
StocksModel1= glm(PositiveDec ~ ., data = stocksTrain1, family=binomial)
StocksModel2= glm(PositiveDec ~ ., data = stocksTrain2, family=binomial)
StocksModel3= glm(PositiveDec ~ ., data = stocksTrain3, family=binomial)

Which variables have a positive sign for the coefficient in at least one of StocksModel1, StocksModel2, and StocksModel3 and a negative sign for the coefficient in at least one of StocksModel1, StocksModel2, and StocksModel3? Select all that apply.

### Examine the logistic regression models
summary(StocksModel1)
## 
## Call:
## glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7307  -1.2910   0.8878   1.0280   1.5023  
## 
## Coefficients:
##             Estimate Std. Error z value    Pr(>|z|)    
## (Intercept)  0.17224    0.06302   2.733     0.00628 ** 
## ReturnJan    0.02498    0.29306   0.085     0.93206    
## ReturnFeb   -0.37207    0.29123  -1.278     0.20139    
## ReturnMar    0.59555    0.23325   2.553     0.01067 *  
## ReturnApr    1.19048    0.22439   5.305 0.000000112 ***
## ReturnMay    0.30421    0.22845   1.332     0.18298    
## ReturnJune  -0.01165    0.29993  -0.039     0.96901    
## ReturnJuly   0.19769    0.27790   0.711     0.47685    
## ReturnAug    0.51273    0.30858   1.662     0.09660 .  
## ReturnSep    0.58833    0.28133   2.091     0.03651 *  
## ReturnOct   -1.02254    0.26007  -3.932 0.000084343 ***
## ReturnNov   -0.74847    0.28280  -2.647     0.00813 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4243.0  on 3156  degrees of freedom
## Residual deviance: 4172.9  on 3145  degrees of freedom
## AIC: 4196.9
## 
## Number of Fisher Scoring iterations: 4
summary(StocksModel2)
## 
## Call:
## glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2012  -1.1941   0.8583   1.1334   1.9424  
## 
## Coefficients:
##             Estimate Std. Error z value      Pr(>|z|)    
## (Intercept)  0.10293    0.03785   2.719      0.006540 ** 
## ReturnJan    0.88451    0.20276   4.362 0.00001286261 ***
## ReturnFeb    0.31762    0.26624   1.193      0.232878    
## ReturnMar   -0.37978    0.24045  -1.579      0.114231    
## ReturnApr    0.49291    0.22460   2.195      0.028189 *  
## ReturnMay    0.89655    0.25492   3.517      0.000436 ***
## ReturnJune   1.50088    0.26014   5.770 0.00000000795 ***
## ReturnJuly   0.78315    0.26864   2.915      0.003554 ** 
## ReturnAug   -0.24486    0.27080  -0.904      0.365876    
## ReturnSep    0.73685    0.24820   2.969      0.002989 ** 
## ReturnOct   -0.27756    0.18400  -1.509      0.131419    
## ReturnNov   -0.78747    0.22458  -3.506      0.000454 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6506.3  on 4695  degrees of freedom
## Residual deviance: 6362.2  on 4684  degrees of freedom
## AIC: 6386.2
## 
## Number of Fisher Scoring iterations: 4
summary(StocksModel3)
## 
## Call:
## glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9146  -1.0393  -0.7689   1.1921   1.6939  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.181896   0.325182  -0.559   0.5759  
## ReturnJan   -0.009789   0.448943  -0.022   0.9826  
## ReturnFeb   -0.046883   0.213432  -0.220   0.8261  
## ReturnMar    0.674179   0.564790   1.194   0.2326  
## ReturnApr    1.281466   0.602672   2.126   0.0335 *
## ReturnMay    0.762512   0.647783   1.177   0.2392  
## ReturnJune   0.329434   0.408038   0.807   0.4195  
## ReturnJuly   0.774164   0.729360   1.061   0.2885  
## ReturnAug    0.982605   0.533158   1.843   0.0653 .
## ReturnSep    0.363807   0.627774   0.580   0.5622  
## ReturnOct    0.782242   0.733123   1.067   0.2860  
## ReturnNov   -0.873752   0.738480  -1.183   0.2367  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 346.92  on 252  degrees of freedom
## Residual deviance: 328.29  on 241  degrees of freedom
## AIC: 352.29
## 
## Number of Fisher Scoring iterations: 4

ReturnJan, ReturnFeb, ReturnMar, ReturnJune, ReturnAug, and ReturnOct differ in sign between the models.

What is the overall accuracy of StocksModel1 on the test set stocksTest1, using a threshold of 0.5?

# Predict on the test set
predictTest1 = predict(StocksModel1, type="response", newdata=stocksTest1)
# Tabulate the predictions of test set vs our predict function
z = table(stocksTest1$PositiveDec, predictTest1 > 0.5)
kable(z)

	FALSE	TRUE
0	30	471
1	23	774

# Compute Accuracy
sum(diag(z))/(sum(z))
## [1] 0.6194145

The overall accuracy of StocksModel1 is 0.5504808.

What is the overall accuracy of StocksModel2 on the test set stocksTest2, using a threshold of 0.5?

# Predict on the test set
predictTest2 = predict(StocksModel2, type="response", newdata=stocksTest2)
# Tabulate the predictions of test set vs our predict function
z = table(stocksTest2$PositiveDec, predictTest2 > 0.5)
kable(z)

	FALSE	TRUE
0	388	626
1	309	757

# Compute Accuracy
sum(diag(z))/(sum(z))
## [1] 0.5504808

The overall accuracy of StocksModel2 is 0.5504808.

What is the overall accuracy of StocksModel1 on the test set stocksTest1, using a threshold of 0.5?

# Predict on the test set
predictTest3 = predict(StocksModel3, type="response", newdata=stocksTest3)
# Tabulate the predictions of test set vs our predict function
z = table(stocksTest3$PositiveDec, predictTest3 > 0.5)
kable(z)

	FALSE	TRUE
0	49	13
1	21	13

# Compute Accuracy
sum(diag(z))/(sum(z))
## [1] 0.6458333

The overall accuracy of StocksModel3 is 0.6458333.

To compute the overall test-set accuracy of the cluster-then-predict approach, we can combine all the test-set predictions into a single vector and all the true outcomes into a single vector:

# Combine all test-set predictions and outcomes into a single vector
AllPredictions = c(predictTest1, predictTest2, predictTest3)
AllOutcomes = c(stocksTest1$PositiveDec, stocksTest2$PositiveDec, stocksTest3$PositiveDec)
z = table(AllOutcomes, AllPredictions > 0.5)
kable(z)

	FALSE	TRUE
0	467	1110
1	353	1544

# Compute Accuracy
sum(diag(z))/(sum(z))
## [1] 0.5788716

After combining the predictions and outcomes with the provided code, we can compute the overall test-set accuracy by creating a classification matrix: The overall accuracy is 0.5788. We see a modest improvement over the original logistic regression model. Since predicting stock returns is a notoriously hard problem, this is a good increase in accuracy. By investing in stocks for which we are more confident that they will have positive returns (by selecting the ones with higher predicted probabilities), this cluster-then-predict model can give us an edge over the original logistic regression model.

Analytics Edge: Unit 6 - Predicting Stock Returns with Cluster-Then-Predict

Sulman Khan

October 27, 2018

Background Information on the Dataset

R Exercises

Exploring the Dataset

How many observations are in the dataset?

What proportion of the observations have positive returns in December?

What is the maximum correlation between any two return variables in the dataset? You should look at the pairwise correlations between ReturnJan, ReturnFeb, ReturnMar, ReturnApr, ReturnMay, ReturnJune, ReturnJuly, ReturnAug, ReturnSep, ReturnOct, and ReturnNov.

Summary

Which month (from January through November) has the largest mean return across all observations in the dataset?

Which month (from January through November) has the smallest mean return across all observations in the dataset?

Initial Logistic Regression Model

What is the overall accuracy on the training set, using a threshold of 0.5?

Now obtain test set predictions from StocksModel. What is the overall accuracy of the model on the test, again using a threshold of 0.5?

What is the accuracy on the test set of a baseline model that always predicts the most common outcome (PositiveDec = 1)?

Clustering Stocks

Why do we need to remove the dependent variable in the clustering phase of the cluster-then-predict methodology?

Preprocess the data

What is the mean of the ReturnJan variable in normTrain?

What is the mean of the ReturnJan variable in normTrain?

Why is the mean ReturnJan variable much closer to 0 in normTrain than in normTest?

K-means cluster

Which cluster has the largest number of observations?

Flexclust

How many test-set observations were assigned to Cluster 2?

Cluster-Specific Predictions

Which training set data frame has the highest average value of the dependent variable?

Logistic regression models

Which variables have a positive sign for the coefficient in at least one of StocksModel1, StocksModel2, and StocksModel3 and a negative sign for the coefficient in at least one of StocksModel1, StocksModel2, and StocksModel3? Select all that apply.

What is the overall accuracy of StocksModel1 on the test set stocksTest1, using a threshold of 0.5?

What is the overall accuracy of StocksModel2 on the test set stocksTest2, using a threshold of 0.5?

What is the overall accuracy of StocksModel1 on the test set stocksTest1, using a threshold of 0.5?

To compute the overall test-set accuracy of the cluster-then-predict approach, we can combine all the test-set predictions into a single vector and all the true outcomes into a single vector: