主要議題:預測股票的投資報酬
學習重點:
rm(list=ls(all=T))
Sys.setlocale("LC_ALL","C")
[1] "C"
options(digits=5, scipen=12)
library(dplyr)
package 'dplyr' was built under R version 3.4.4
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(caTools)
package 'caTools' was built under R version 3.4.4
library(caret)
package 'caret' was built under R version 3.4.4Loading required package: lattice
Loading required package: ggplot2
package 'ggplot2' was built under R version 3.4.4
library(flexclust)
package 'flexclust' was built under R version 3.4.4Loading required package: grid
Loading required package: modeltools
Loading required package: stats4
Load StocksCluster.csv into a data frame called “stocks”.
A = read.csv('data/StocksCluster.csv')
nrow(A)
[1] 11580
How many observations are in the dataset?
mean(A$PositiveDec) #12月有正報酬的比例
[1] 0.54611
What proportion of the observations have positive returns in December?
rr cor(A[1:11]) %>% sort %>% unique %>% tail %>% round(2)
[1] 0.09 0.13 0.14 0.17 0.19 1.00
What is the maximum correlation between any two return variables in the dataset? You should look at the pairwise correlations between ReturnJan, ReturnFeb, ReturnMar, ReturnApr, ReturnMay, ReturnJune, ReturnJuly, ReturnAug, ReturnSep, ReturnOct, and ReturnNov.
rr colMeans(A[,1:11]) %>% sort %>% barplot(las=2, cex.names=0.8, cex.axis=0.8)
Which month (from January through November) has the largest mean return across all observations in the dataset?
Which month (from January through November) has the smallest mean return across all observations in the dataset?
Run the following commands to split the data into a training set and testing set, putting 70% of the data in the training set and 30% of the data in the testing set:
set.seed(144)
spl = sample.split(stocks$PositiveDec, SplitRatio = 0.7)
stocksTrain = subset(stocks, spl == TRUE)
stocksTest = subset(stocks, spl == FALSE)
rr library(caTools) set.seed(144) spl = sample.split(A\(PositiveDec,0.7) TR = subset(A, spl) TS = subset(A, !spl) sapply(list(A, TR, TS), function(x) mean(x\)PositiveDec))
[1] 0.54611 0.54614 0.54606
Then, use the stocksTrain data frame to train a logistic regression model (name it StocksModel) to predict PositiveDec using all the other variables as independent variables. Don’t forget to add the argument family=binomial to your glm command.
rr glm1 = glm(PositiveDec ~ ., TR, family=binomial) pred = predict(glm1, type=‘response’) table(TR$Pos, pred > 0.5) %>% {sum(diag(.))/sum(.)}
[1] 0.57118
What is the overall accuracy on the training set, using a threshold of 0.5?
rr pred = predict(glm1, TS, type=‘response’) table(TS$Pos, pred > 0.5) %>% {sum(diag(.))/sum(.)}
[1] 0.56707
Now obtain test set predictions from StocksModel. What is the overall accuracy of the model on the test, again using a threshold of 0.5?
rr mean(TS$PositiveDec)
[1] 0.54606
What is the accuracy on the test set of a baseline model that always predicts the most common outcome (PositiveDec = 1)?
Now, let’s cluster the stocks. The first step in this process is to remove the dependent variable using the following commands:
rr LTR = TR[,1:11] LTS = TS[,1:11]
Why do we need to remove the dependent variable in the clustering phase of the cluster-then-predict methodology?
+若將目標變數代入集群分析,會有過度配適的問題,沒辦法將預測模型一般化
In the market segmentation assignment in this week’s homework, you were introduced to the preProcess command from the caret package, which normalizes variables by subtracting by the mean and dividing by the standard deviation.
In cases where we have a training and testing set, we’ll want to normalize by the mean and standard deviation of the variables in the training set. We can do this by passing just the training set to the preProcess function:
rr library(caret) preproc = preProcess(LTR) NTR = predict(preproc, LTR) NTS = predict(preproc, LTS)
rr mean(NTR$ReturnJan)
[1] 2.1006e-17
What is the mean of the ReturnJan variable in normTrain?
rr mean(NTS$ReturnJan)
[1] -0.00041859
What is the mean of the ReturnJan variable in normTrain?
Why is the mean ReturnJan variable much closer to 0 in normTrain than in normTest?
Set the random seed to 144 (it is important to do this again, even though we did it earlier). Run k-means clustering with 3 clusters on normTrain, storing the result in an object called km.
rr set.seed(144) km <- kmeans(NTR, 3)
rr table(km$cluster)
1 2 3
3157 4696 253
Which cluster has the largest number of observations?
Recall from the recitation that we can use the flexclust package to obtain training set and testing set cluster assignments for our observations (note that the call to as.kcca may take a while to complete):
rr library(flexclust) km.kcca = as.kcca(km, NTR) CTR = predict(km.kcca) CTS = predict(km.kcca, newdata=NTS)
rr table(CTS)
CTS
1 2 3
1298 2080 96
How many test-set observations were assigned to Cluster 2?
Using the subset function, build data frames stocksTrain1, stocksTrain2, and stocksTrain3, containing the elements in the stocksTrain data frame assigned to clusters 1, 2, and 3, respectively (be careful to take subsets of stocksTrain, not of normTrain). Similarly build stocksTest1, stocksTest2, and stocksTest3 from the stocksTest data frame.
tapply(TR$PositiveDec, CTR, mean)
1 2 3
0.60247 0.51405 0.43874
Which training set data frame has the highest average value of the dependent variable?
+stocksTrain1
Build logistic regression models StocksModel1, StocksModel2, and StocksModel3, which predict PositiveDec using all the other variables as independent variables. StocksModel1 should be trained on stocksTrain1, StocksModel2 should be trained on stocksTrain2, and StocksModel3 should be trained on stocksTrain3.
M = lapply(split(TR, CTR), function(x) #將TR依據CTR的值分3組,並計算邏輯斯回歸
glm(PositiveDec~., data=x, family=binomial) )
sapply(M, function(x) coef(summary(x))[,1])
1 2 3
(Intercept) 0.172240 0.10293 -0.1818958
ReturnJan 0.024984 0.88451 -0.0097893
ReturnFeb -0.372074 0.31762 -0.0468833
ReturnMar 0.595550 -0.37978 0.6741795
ReturnApr 1.190478 0.49291 1.2814662
ReturnMay 0.304209 0.89655 0.7625116
ReturnJune -0.011654 1.50088 0.3294339
ReturnJuly 0.197692 0.78315 0.7741644
ReturnAug 0.512729 -0.24486 0.9826054
ReturnSep 0.588327 0.73685 0.3638068
ReturnOct -1.022535 -0.27756 0.7822421
ReturnNov -0.748472 -0.78747 -0.8737521
Which variables have a positive sign for the coefficient in at least one model and a negative sign for the coefficient in at least one model? Select all that apply.
Using StocksModel1, make test-set predictions called PredictTest1 on the data frame stocksTest1. Using StocksModel2, make test-set predictions called PredictTest2 on the data frame stocksTest2. Using StocksModel3, make test-set predictions called PredictTest3 on the data frame stocksTest3.
Pred = lapply(1:3, function(i)
predict(M[[i]], TS[CTS==i,], type='response') )
sapply(1:3, function(i)
table(TS$Pos[CTS==i], Pred[[i]] > 0.5) %>% {sum(diag(.))/sum(.)} )
[1] 0.61941 0.55048 0.64583
What is the overall accuracy of StocksModel1 on the test set stocksTest1, using a threshold of 0.5?
What is the overall accuracy of StocksModel2 on the test set stocksTest3, using a threshold of 0.5?
What is the overall accuracy of StocksModel3 on the test set stocksTest3, using a threshold of 0.5?
To compute the overall test-set accuracy of the cluster-then-predict approach, we can combine all the test-set predictions into a single vector and all the true outcomes into a single vector:
table( do.call(c, split(TS$Pos,CTS)), do.call(c, Pred) > 0.5 ) %>%
{sum(diag(.))/sum(.)} #Split出來為List,因此用do.call將分群出來的值合成為一向量,才能使用Table函數
[1] 0.57887
What is the overall test-set accuracy of the cluster-then-predict approach, again using a threshold of 0.5?
We see a modest improvement over the original logistic regression model. Since predicting stock returns is a notoriously hard problem, this is a good increase in accuracy. By investing in stocks for which we are more confident that they will have positive returns (by selecting the ones with higher predicted probabilities), this cluster-then-predict model can give us an edge over the original logistic regression model.