主要議題:預測股票的投資報酬

學習重點:

小組討論:

rm(list=ls(all=T))
Sys.setlocale("LC_ALL","C")
[1] "C"
options(digits=5, scipen=12)
library(dplyr)
library(caTools)
package 'caTools' was built under R version 3.4.4
library(caret)
package 'caret' was built under R version 3.4.4
library(flexclust)



1. 資料探索

1.1

Load StocksCluster.csv into a data frame called “stocks”.

A = read.csv('data/StocksCluster.csv')
nrow(A)
[1] 11580

How many observations are in the dataset?

11580

1.2
mean(A$PositiveDec)
[1] 0.54611

What proportion of the observations have positive returns in December?

0.54611

1.3
cor(A[1:11]) %>% sort %>% unique %>% tail %>% round(2)
[1] 0.09 0.13 0.14 0.17 0.19 1.00

What is the maximum correlation between any two return variables in the dataset?
0.19

1.3 小組討論 :

“1”意指變數自己與自己的相關係數,相關係數最大的兩個變數為ReturnNov及ReturnOct。

1.4
colMeans(A[,1:11]) %>% sort %>% barplot(las=2, cex.names=0.8, cex.axis=0.8)

Which month (from January through November) has the largest mean return across all observations in the dataset?

April

Which month (from January through November) has the smallest mean return across all observations in the dataset?

September



2. 邏輯式回歸,單一模型

分割訓練、測試資料

Run the following commands to split the data into a training set and testing set, putting 70% of the data in the training set and 30% of the data in the testing set:

set.seed(144)

spl = sample.split(stocks$PositiveDec, SplitRatio = 0.7)

stocksTrain = subset(stocks, spl == TRUE)

stocksTest = subset(stocks, spl == FALSE)

library(caTools)
set.seed(144)
spl = sample.split(A$PositiveDec,0.7)
TR = subset(A, spl)
TS = subset(A, !spl)
sapply(list(A, TR, TS), function(x) mean(x$PositiveDec))
[1] 0.54611 0.54614 0.54606
2.1 單一模型:訓練準確率,\(\text{acc}_{train}\)

Then, use the stocksTrain data frame to train a logistic regression model (name it StocksModel) to predict PositiveDec using all the other variables as independent variables. Don’t forget to add the argument family=binomial to your glm command.

glm1 = glm(PositiveDec ~ .,  TR, family=binomial)
pred = predict(glm1, type='response')
table(TR$Pos, pred > 0.5) %>% {sum(diag(.))/sum(.)} 
[1] 0.57118

What is the overall accuracy on the training set, using a threshold of 0.5?

0.57118

2.1 小組討論:

正確率為計算「False為0」及「True為1」的加總佔總數量的比例。

2.2 單一模型:測試準確率,\(\text{acc}_{test}\)
pred = predict(glm1, TS, type='response')
table(TS$Pos, pred > 0.5) %>% {sum(diag(.))/sum(.)} 
[1] 0.56707

Now obtain test set predictions from StocksModel. What is the overall accuracy of the model on the test, again using a threshold of 0.5?

0.56707

2.3 單一模型:底線準確率,\(\text{acc}_{baseline}\)
mean(TS$PositiveDec)
[1] 0.54606

What is the accuracy on the test set of a baseline model that always predicts the most common outcome (PositiveDec = 1)?

0.54606



3. 集群分析

3.1 移除目標變數

Now, let’s cluster the stocks. The first step in this process is to remove the dependent variable using the following commands:

LTR = TR[,1:11]
LTS = TS[,1:11]

Why do we need to remove the dependent variable in the clustering phase of the cluster-then-predict methodology?

Needing to know the dependent variable value to assign an observation to a cluster defeats the purpose of the methodology

3.1 小組討論:


因為我們要利用集群分析來找出資料中相似的觀察值,然後來預測未知的變數y,如果我們的原始資料
直接就有我們要用來預測的變數Y的話,R會直接根據Y來進行分群,這樣的集群分析是沒有意義的

3.2 區隔變數常態化

In the market segmentation assignment in this week’s homework, you were introduced to the preProcess command from the caret package, which normalizes variables by subtracting by the mean and dividing by the standard deviation.

In cases where we have a training and testing set, we’ll want to normalize by the mean and standard deviation of the variables in the training set. We can do this by passing just the training set to the preProcess function:

library(caret)
preproc = preProcess(LTR)
NTR = predict(preproc, LTR)
NTS = predict(preproc, LTS)
mean(NTR$ReturnJan)
[1] 2.1006e-17

What is the mean of the ReturnJan variable in normTrain?

2.1006e-17

mean(NTS$ReturnJan)
[1] -0.00041859

What is the mean of the ReturnJan variable in normTrain?

-0.00041859

3.3 測試資料的常態化結果

Why is the mean ReturnJan variable much closer to 0 in normTrain than in normTest?
The distribution of the ReturnJan variable is different in the training and testing set

3.3 小組討論:

因為在train跟test的資料中觀察值的分布不一樣,而標準化的過程中所用的都是train的平均值,所以train的觀察值在標準化之後,平均會比test的觀察值平均更接近0

3.4 K-Means集群

Set the random seed to 144 (it is important to do this again, even though we did it earlier). Run k-means clustering with 3 clusters on normTrain, storing the result in an object called km.

set.seed(144)
km <- kmeans(NTR, 3)
table(km$cluster)

   1    2    3 
3157 4696  253 

Which cluster has the largest number of observations?

Cluster 2

3.5

Recall from the recitation that we can use the flexclust package to obtain training set and testing set cluster assignments for our observations (note that the call to as.kcca may take a while to complete):

library(flexclust)
km.kcca = as.kcca(km, NTR)
Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
Also defined by 'kernlab'
Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
Also defined by 'kernlab'
CTR = predict(km.kcca)
Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'
Also defined by 'kernlab'
CTS = predict(km.kcca, newdata=NTS)
table(CTS)
CTS
   1    2    3 
1298 2080   96 

How many test-set observations were assigned to Cluster 2?

2080

3.5 小組討論

這個code是將分群的資訊做成模型,然後運用這個模型來對test資料中的觀測值進行分群,CTS就是對test資料中的觀測值以訓練資料中分群的方式進行分類


4. 邏輯式回歸,分群模型

4.1 依集群分析的結果切割資料

Using the subset function, build data frames stocksTrain1, stocksTrain2, and stocksTrain3, containing the elements in the stocksTrain data frame assigned to clusters 1, 2, and 3, respectively (be careful to take subsets of stocksTrain, not of normTrain). Similarly build stocksTest1, stocksTest2, and stocksTest3 from the stocksTest data frame.

tapply(TR$PositiveDec, CTR, mean)
      1       2       3 
0.60247 0.51405 0.43874 

Which training set data frame has the highest average value of the dependent variable?

第一群的training set data frame

4.2 分群模型,模型係數

Build logistic regression models StocksModel1, StocksModel2, and StocksModel3, which predict PositiveDec using all the other variables as independent variables. StocksModel1 should be trained on stocksTrain1, StocksModel2 should be trained on stocksTrain2, and StocksModel3 should be trained on stocksTrain3.

M = lapply(split(TR, CTR), function(x) 
  glm(PositiveDec~., data=x, family=binomial) )
sapply(M, function(x) coef(summary(x))[,1])
                    1        2          3
(Intercept)  0.172240  0.10293 -0.1818958
ReturnJan    0.024984  0.88451 -0.0097893
ReturnFeb   -0.372074  0.31762 -0.0468833
ReturnMar    0.595550 -0.37978  0.6741795
ReturnApr    1.190478  0.49291  1.2814662
ReturnMay    0.304209  0.89655  0.7625116
ReturnJune  -0.011654  1.50088  0.3294339
ReturnJuly   0.197692  0.78315  0.7741644
ReturnAug    0.512729 -0.24486  0.9826054
ReturnSep    0.588327  0.73685  0.3638068
ReturnOct   -1.022535 -0.27756  0.7822421
ReturnNov   -0.748472 -0.78747 -0.8737521

Which variables have a positive sign for the coefficient in at least one model and a negative sign for the coefficient in at least one model? Select all that apply.

ReturnJan ,ReturnFeb,ReturnMar ,ReturnJune,ReturnAug,ReturnOct

4.2 小組討論

M是把TR根據CTR分成三個部分,然後將每一部份的資料依序拿去建立一個邏輯式回歸模型
下面的sapply則是把這三個模型依序拿去做summary,再透過coef[,1]來取出各個模型的係數評估值

4.3 分群模型:分群測試準確率,\(\text{acc}_{test}^{1,2,3}\)

Using StocksModel1, make test-set predictions called PredictTest1 on the data frame stocksTest1. Using StocksModel2, make test-set predictions called PredictTest2 on the data frame stocksTest2. Using StocksModel3, make test-set predictions called PredictTest3 on the data frame stocksTest3.

Pred = lapply(1:3, function(i) 
  predict(M[[i]], TS[CTS==i,], type='response') )
sapply(1:3, function(i) 
  table(TS$Pos[CTS==i], Pred[[i]] > 0.5) %>% {sum(diag(.))/sum(.)}  )
[1] 0.61941 0.55048 0.64583

What is the overall accuracy of StocksModel1 on the test set stocksTest1, using a threshold of 0.5?

0.61941

What is the overall accuracy of StocksModel2 on the test set stocksTest3, using a threshold of 0.5?

0.55048

What is the overall accuracy of StocksModel3 on the test set stocksTest3, using a threshold of 0.5?

0.64583

4.3 小組討論

上面的Pred是把TS依照CTS分成三個部分,再把這三群的資料依序跟M的三個模型中對照的拿去做預測
下面的sapply則是把實際的結果依照CTS分成三個部分,再把每個部分跟Pred預測的結果對照的部分做成混淆矩陣,然後計算準確率

4.4 分群模型:整體測試準確率,\(\text{acc}_{test}^{1+2+3}\)

To compute the overall test-set accuracy of the cluster-then-predict approach, we can combine all the test-set predictions into a single vector and all the true outcomes into a single vector:

table( do.call(c, split(TS$Pos,CTS)), do.call(c, Pred) > 0.5 ) %>%
  {sum(diag(.))/sum(.)}
[1] 0.57887

What is the overall test-set accuracy of the cluster-then-predict approach, again using a threshold of 0.5?

0.57887

4.4 小組討論

這裡的do.call是把模型預測出來的三個部分的觀察值,變成一整個的向量,另一方面,TS$Pos的值也先依照CTS分割成三個部分,來後再變成向量

最後再把這兩筆資料變成混淆矩陣,並計算整體的準確度

We see a modest improvement over the original logistic regression model. Since predicting stock returns is a notoriously hard problem, this is a good increase in accuracy. By investing in stocks for which we are more confident that they will have positive returns (by selecting the ones with higher predicted probabilities), this cluster-then-predict model can give us an edge over the original logistic regression model.








---
title: "AS6-3 Predicting Stock Returns with Cluster-Then-Predict"
author: "Gruop1 2018/07/30 "
output: html_notebook
---

<br>

**主要議題：預測股票的投資報酬**

**學習重點：**

+ 先分群以後、再做預測性模型
+ 集群分析的模型與預測方法

**小組討論：**
<br> 

+ <a href='#D1.3'>1.3</a>
+ <a href='#D2.1'>2.1</a>
+ <a href='#D3.1'>3.1</a>
+ <a href='#D3.3'>3.3</a>
+ <a href='#D3.5'>3.5</a>
+ <a href='#D4.2'>4.2</a>
+ <a href='#D4.3'>4.3</a>
+ <a href='#D4.4'>4.4</a>


```{r echo=T, message=F, cache=F, warning=F}
rm(list=ls(all=T))
Sys.setlocale("LC_ALL","C")
options(digits=5, scipen=12)
library(dplyr)
library(caTools)
library(caret)
library(flexclust)
```
<br>


- - -

### 1. 資料探索

##### 1.1 
Load StocksCluster.csv into a data frame called "stocks".
```{r}
A = read.csv('data/StocksCluster.csv')
nrow(A)
```
_How many observations are in the dataset?_

11580

##### 1.2 
```{r}
mean(A$PositiveDec)
```
_What proportion of the observations have positive returns in December?_

0.54611


##### 1.3
```{r}
cor(A[1:11]) %>% sort %>% unique %>% tail %>% round(2)
```
_What is the maximum correlation between any two return variables in the dataset?_ 
<br>
0.19


###### <span id='D1.3'>1.3 小組討論 : </span>
"1"意指變數自己與自己的相關係數，相關係數最大的兩個變數為ReturnNov及ReturnOct。


##### 1.4
```{r fig.height=3, fig.width=6.4}
colMeans(A[,1:11]) %>% sort %>% barplot(las=2, cex.names=0.8, cex.axis=0.8)
```
_Which month (from January through November) has the largest mean return across all observations in the dataset?_

April


_Which month (from January through November) has the smallest mean return across all observations in the dataset?_

September


<br>

- - -

### 2. 邏輯式回歸，單一模型

##### 分割訓練、測試資料
Run the following commands to split the data into a training set and testing set, putting 70% of the data in the training set and 30% of the data in the testing set:

set.seed(144)

spl = sample.split(stocks$PositiveDec, SplitRatio = 0.7)

stocksTrain = subset(stocks, spl == TRUE)

stocksTest = subset(stocks, spl == FALSE)

```{r}
library(caTools)
set.seed(144)
spl = sample.split(A$PositiveDec,0.7)
TR = subset(A, spl)
TS = subset(A, !spl)
sapply(list(A, TR, TS), function(x) mean(x$PositiveDec))
```

##### 2.1 單一模型：訓練準確率，$\text{acc}_{train}$
Then, use the stocksTrain data frame to train a logistic regression model (name it StocksModel) to predict PositiveDec using all the other variables as independent variables. Don't forget to add the argument family=binomial to your glm command.

```{r}
glm1 = glm(PositiveDec ~ .,  TR, family=binomial)
pred = predict(glm1, type='response')
table(TR$Pos, pred > 0.5) %>% {sum(diag(.))/sum(.)} 
```
_What is the overall accuracy on the training set, using a threshold of 0.5?_

0.57118


###### <span id='D2.1'>2.1 小組討論：</span>
正確率為計算「False為0」及「True為1」的加總佔總數量的比例。

##### 2.2 單一模型：測試準確率，$\text{acc}_{test}$
```{r}
pred = predict(glm1, TS, type='response')
table(TS$Pos, pred > 0.5) %>% {sum(diag(.))/sum(.)} 
```
_Now obtain test set predictions from StocksModel. What is the overall accuracy of the model on the test, again using a threshold of 0.5?_

0.56707


##### 2.3 單一模型：底線準確率，$\text{acc}_{baseline}$
```{r}
mean(TS$PositiveDec)
```
_What is the accuracy on the test set of a baseline model that always predicts the most common outcome (PositiveDec = 1)?_

0.54606


<br>

- - -

### 3. 集群分析

##### 3.1 移除目標變數
Now, let's cluster the stocks. The first step in this process is to remove the dependent variable using the following commands:
```{r}
LTR = TR[,1:11]
LTS = TS[,1:11]
```
_Why do we need to remove the dependent variable in the clustering phase of the cluster-then-predict methodology?_

Needing to know the dependent variable value to assign an observation to a cluster defeats the purpose of the methodology

###### <span id='D3.1'>3.1 小組討論：</span>
<br>
因為我們要利用集群分析來找出資料中相似的觀察值，然後來預測未知的變數y，如果我們的原始資料
<br>
直接就有我們要用來預測的變數Y的話，R會直接根據Y來進行分群，這樣的集群分析是沒有意義的

##### 3.2 區隔變數常態化
In the market segmentation assignment in this week's homework, you were introduced to the preProcess command from the caret package, which normalizes variables by subtracting by the mean and dividing by the standard deviation.

In cases where we have a training and testing set, we'll want to normalize by the mean and standard deviation of the variables in the training set. We can do this by passing just the training set to the preProcess function:
```{r}
library(caret)
preproc = preProcess(LTR)
NTR = predict(preproc, LTR)
NTS = predict(preproc, LTS)
```

```{r}
mean(NTR$ReturnJan)
```
_What is the mean of the ReturnJan variable in normTrain?_

2.1006e-17


```{r}
mean(NTS$ReturnJan)
```
_What is the mean of the ReturnJan variable in normTrain?_

-0.00041859


##### 3.3 測試資料的常態化結果
_Why is the mean ReturnJan variable much closer to 0 in normTrain than in normTest?_
<br>
The distribution of the ReturnJan variable is different in the training and testing set

###### <span id='D3.3'>3.3 小組討論：</span>
因為在train跟test的資料中觀察值的分布不一樣，而標準化的過程中所用的都是train的平均值，所以train的觀察值在標準化之後，平均會比test的觀察值平均更接近0


##### 3.4 K-Means集群
Set the random seed to 144 (it is important to do this again, even though we did it earlier). Run k-means clustering with 3 clusters on normTrain, storing the result in an object called km.
```{r}
set.seed(144)
km <- kmeans(NTR, 3)
```

```{r}
table(km$cluster)
```
_Which cluster has the largest number of observations?_

Cluster 2

##### 3.5
Recall from the recitation that we can use the flexclust package to obtain training set and testing set cluster assignments for our observations (note that the call to as.kcca may take a while to complete):
```{r}
library(flexclust)
km.kcca = as.kcca(km, NTR)
CTR = predict(km.kcca)
CTS = predict(km.kcca, newdata=NTS)
```

```{r}
table(CTS)
```
_How many test-set observations were assigned to Cluster 2?_

2080

###### <span id='D3.5'>3.5 小組討論</span>
這個code是將分群的資訊做成模型，然後運用這個模型來對test資料中的觀測值進行分群，CTS就是對test資料中的觀測值以訓練資料中分群的方式進行分類

- - -

### 4. 邏輯式回歸，分群模型

##### 4.1 依集群分析的結果切割資料
Using the subset function, build data frames stocksTrain1, stocksTrain2, and stocksTrain3, containing the elements in the stocksTrain data frame assigned to clusters 1, 2, and 3, respectively (be careful to take subsets of stocksTrain, not of normTrain). Similarly build stocksTest1, stocksTest2, and stocksTest3 from the stocksTest data frame.

```{r}
tapply(TR$PositiveDec, CTR, mean)
```
_Which training set data frame has the highest average value of the dependent variable?_

第一群的training set data frame


##### 4.2 分群模型，模型係數
Build logistic regression models StocksModel1, StocksModel2, and StocksModel3, which predict PositiveDec using all the other variables as independent variables. StocksModel1 should be trained on stocksTrain1, StocksModel2 should be trained on stocksTrain2, and StocksModel3 should be trained on stocksTrain3.
```{r}
M = lapply(split(TR, CTR), function(x) 
  glm(PositiveDec~., data=x, family=binomial) )
sapply(M, function(x) coef(summary(x))[,1])
```

_Which variables have a positive sign for the coefficient in at least one model and a negative sign for the coefficient in at least one model?_ Select all that apply.

ReturnJan ,ReturnFeb,ReturnMar ,ReturnJune,ReturnAug,ReturnOct

###### <span id='D4.2'>4.2 小組討論</span>
M是把TR根據CTR分成三個部分，然後將每一部份的資料依序拿去建立一個邏輯式回歸模型
<br>
下面的sapply則是把這三個模型依序拿去做summary，再透過coef[，1]來取出各個模型的係數評估值


##### 4.3 分群模型：分群測試準確率，$\text{acc}_{test}^{1,2,3}$
Using StocksModel1, make test-set predictions called PredictTest1 on the data frame stocksTest1. Using StocksModel2, make test-set predictions called PredictTest2 on the data frame stocksTest2. Using StocksModel3, make test-set predictions called PredictTest3 on the data frame stocksTest3.
```{r}
Pred = lapply(1:3, function(i) 
  predict(M[[i]], TS[CTS==i,], type='response') )
sapply(1:3, function(i) 
  table(TS$Pos[CTS==i], Pred[[i]] > 0.5) %>% {sum(diag(.))/sum(.)}  )
```
_What is the overall accuracy of StocksModel1 on the test set stocksTest1, using a threshold of 0.5?_

0.61941


_What is the overall accuracy of StocksModel2 on the test set stocksTest3, using a threshold of 0.5?_

0.55048


_What is the overall accuracy of StocksModel3 on the test set stocksTest3, using a threshold of 0.5?_

0.64583


###### <span id='D4.3'>4.3 小組討論</span>
上面的Pred是把TS依照CTS分成三個部分，再把這三群的資料依序跟M的三個模型中對照的拿去做預測
<br>
下面的sapply則是把實際的結果依照CTS分成三個部分，再把每個部分跟Pred預測的結果對照的部分做成混淆矩陣，然後計算準確率



##### 4.4 分群模型：整體測試準確率，$\text{acc}_{test}^{1+2+3}$
To compute the overall test-set accuracy of the cluster-then-predict approach, we can combine all the test-set predictions into a single vector and all the true outcomes into a single vector:
```{r}
table( do.call(c, split(TS$Pos,CTS)), do.call(c, Pred) > 0.5 ) %>%
  {sum(diag(.))/sum(.)}
```

_What is the overall test-set accuracy of the cluster-then-predict approach, again using a threshold of 0.5?_

0.57887


###### <span id='D4.4'>4.4 小組討論</span>
這裡的do.call是把模型預測出來的三個部分的觀察值，變成一整個的向量，另一方面，TS$Pos的值也先依照CTS分割成三個部分，來後再變成向量

最後再把這兩筆資料變成混淆矩陣，並計算整體的準確度












We see a modest improvement over the original logistic regression model. Since predicting stock returns is a notoriously hard problem, this is a good increase in accuracy. By investing in stocks for which we are more confident that they will have positive returns (by selecting the ones with higher predicted probabilities), this cluster-then-predict model can give us an edge over the original logistic regression model.

<br>

- - -

<br><br><br><br><br>

<style>
.caption {
  color: #777;
  margin-top: 10px;
}
p code {
  white-space: inherit;
}
pre {
  word-break: normal;
  word-wrap: normal;
  line-height: 1;
}
pre code {
  white-space: inherit;
}
p,li {
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

.r{
  line-height: 1.2;
}

title{
  color: #cc0000;
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

body{
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

h1,h2,h3,h4,h5{
  color: #008800;
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

h3{
  color: #b36b00;
  background: #ffe0b3;
  line-height: 2;
  font-weight: bold;
}

h5{
  color: #006000;
  background: #ffffe0;
  line-height: 2;
  font-weight: bold;
}

h6{
  color: #006000;
  background: #00ffff;
  line-height: 2;
  font-weight: bold;
}

em{
  color: #0000c0;
  background: #f0f0f0;
  }
</style>

