Processing math: 100%


主要議題:預測股票的投資報酬

學習重點:

rm(list=ls(all=T))
Sys.setlocale("LC_ALL","C")
[1] "C/C/C/C/C/en_US.UTF-8"
options(digits=5, scipen=12)
library(dplyr)
library(caTools)
library(caret)
library(flexclust)



1. 資料探索

1.1

Load StocksCluster.csv into a data frame called “stocks”.

stocks = read.csv("StocksCluster.csv")
str(stocks)  #11580
'data.frame':   11580 obs. of  12 variables:
 $ ReturnJan  : num  0.0807 -0.0107 0.0477 -0.074 -0.031 ...
 $ ReturnFeb  : num  0.0663 0.1021 0.036 -0.0482 -0.2127 ...
 $ ReturnMar  : num  0.0329 0.1455 0.0397 0.0182 0.0915 ...
 $ ReturnApr  : num  0.1831 -0.0844 -0.1624 -0.0247 0.1893 ...
 $ ReturnMay  : num  0.13033 -0.3273 -0.14743 -0.00604 -0.15385 ...
 $ ReturnJune : num  -0.0176 -0.3593 0.0486 -0.0253 -0.1061 ...
 $ ReturnJuly : num  -0.0205 -0.0253 -0.1354 -0.094 0.3553 ...
 $ ReturnAug  : num  0.0247 0.2113 0.0334 0.0953 0.0568 ...
 $ ReturnSep  : num  -0.0204 -0.58 0 0.0567 0.0336 ...
 $ ReturnOct  : num  -0.1733 -0.2671 0.0917 -0.0963 0.0363 ...
 $ ReturnNov  : num  -0.0254 -0.1512 -0.0596 -0.0405 -0.0853 ...
 $ PositiveDec: int  0 0 0 1 1 1 1 0 0 0 ...

How many observations are in the dataset?

  • 11580
1.2
mean(stocks$PositiveDec)
[1] 0.54611

What proportion of the observations have positive returns in December?

  • 0.54611
1.3
cor(stocks[c(1:11)]) #使用cor()查看一月到十一月的所有資料關聯
           ReturnJan ReturnFeb  ReturnMar  ReturnApr  ReturnMay ReturnJune
ReturnJan   1.000000  0.066775 -0.0904968 -0.0376780 -0.0444114   0.092238
ReturnFeb   0.066775  1.000000 -0.1559833 -0.1913519 -0.0955209   0.169994
ReturnMar  -0.090497 -0.155983  1.0000000  0.0097263 -0.0038928  -0.085905
ReturnApr  -0.037678 -0.191352  0.0097263  1.0000000  0.0638225  -0.011028
ReturnMay  -0.044411 -0.095521 -0.0038928  0.0638225  1.0000000  -0.021075
ReturnJune  0.092238  0.169994 -0.0859055 -0.0110278 -0.0210745   1.000000
ReturnJuly -0.081430 -0.061779  0.0033742  0.0806319  0.0908503  -0.029153
ReturnAug  -0.022792  0.131560 -0.0220054 -0.0517561 -0.0331257   0.010711
ReturnSep  -0.026437  0.043502  0.0765183 -0.0289210  0.0219629   0.044747
ReturnOct   0.142977 -0.087324 -0.0119238  0.0485400  0.0171667  -0.022636
ReturnNov   0.067632 -0.154658  0.0373235  0.0317618  0.0480466  -0.065271
            ReturnJuly   ReturnAug   ReturnSep ReturnOct ReturnNov
ReturnJan  -0.08142976 -0.02279202 -0.02643715  0.142977  0.067632
ReturnFeb  -0.06177851  0.13155979  0.04350177 -0.087324 -0.154658
ReturnMar   0.00337416 -0.02200540  0.07651833 -0.011924  0.037324
ReturnApr   0.08063193 -0.05175605 -0.02892097  0.048540  0.031762
ReturnMay   0.09085026 -0.03312566  0.02196286  0.017167  0.048047
ReturnJune -0.02915260  0.01071053  0.04474727 -0.022636 -0.065271
ReturnJuly  1.00000000  0.00071376  0.06894780 -0.054709 -0.048374
ReturnAug   0.00071376  1.00000000  0.00074071 -0.075595 -0.116489
ReturnSep   0.06894780  0.00074071  1.00000000 -0.058079 -0.019720
ReturnOct  -0.05470891 -0.07559456 -0.05807924  1.000000  0.191673
ReturnNov  -0.04837384 -0.11648903 -0.01971980  0.191673  1.000000

What is the maximum correlation between any two return variables in the dataset? You should look at the pairwise correlations between ReturnJan, ReturnFeb, ReturnMar, ReturnApr, ReturnMay, ReturnJune, ReturnJuly, ReturnAug, ReturnSep, ReturnOct, and ReturnNov.

  • ReturnOct 與 ReturnNov
  • 0.191673
1.4
colMeans(stocks[c(1:11)]) #以欄位作為主執行mean()
 ReturnJan  ReturnFeb  ReturnMar  ReturnApr  ReturnMay ReturnJune ReturnJuly 
 0.0126316 -0.0076048  0.0194023  0.0263081  0.0247366  0.0059379  0.0030509 
 ReturnAug  ReturnSep  ReturnOct  ReturnNov 
 0.0161983 -0.0147208  0.0056508  0.0113874 

Which month (from January through November) has the largest mean return across all observations in the dataset?

  • 四月,0.0263081

Which month (from January through November) has the smallest mean return across all observations in the dataset?

  • 九月,-0.0147208



2. 邏輯式回歸,單一模型

分割訓練、測試資料

Run the following commands to split the data into a training set and testing set, putting 70% of the data in the training set and 30% of the data in the testing set:

set.seed(144)

spl = sample.split(stocks$PositiveDec, SplitRatio = 0.7)

stocksTrain = subset(stocks, spl == TRUE)

stocksTest = subset(stocks, spl == FALSE)

library(caTools)
set.seed(144)
spl = sample.split(stocks$PositiveDec , SplitRatio = 0.7)
stocksTrain = subset(stocks , spl == TRUE)
stocksTest = subset(stocks , spl == FALSE)
2.1 單一模型:訓練準確率,acctrain

Then, use the stocksTrain data frame to train a logistic regression model (name it StocksModel) to predict PositiveDec using all the other variables as independent variables. Don’t forget to add the argument family=binomial to your glm command.

StocksModel = glm(PositiveDec ~ ., stocksTrain , family = binomial)
pred = predict(StocksModel, type= "response")
table(stocksTrain$PositiveDec,pred > 0.5) %>% {sum(diag(.))/sum(.)}
[1] 0.57118

What is the overall accuracy on the training set, using a threshold of 0.5?

  • 0.57118
2.2 單一模型:測試準確率,acctest
pred = predict(StocksModel,newdata = stocksTest, type= "response")
table(stocksTest$PositiveDec,pred > 0.5) %>% {sum(diag(.))/sum(.)}
[1] 0.56707

Now obtain test set predictions from StocksModel. What is the overall accuracy of the model on the test, again using a threshold of 0.5?

  • 0.56707
2.3 單一模型:底線準確率,accbaseline
mean(stocksTest$PositiveDec)
[1] 0.54606

What is the accuracy on the test set of a baseline model that always predicts the most common outcome (PositiveDec = 1)?

  • 0.54606



3. 集群分析

3.1 移除目標變數

Now, let’s cluster the stocks. The first step in this process is to remove the dependent variable using the following commands:

l_train = stocksTrain[,1:11]
l_test = stocksTest[,1:11]

Why do we need to remove the dependent variable in the clustering phase of the cluster-then-predict methodology?

  • 分群適用於預測未知的情況,因此當原始資料中含有目標預測資料時,便無法驗證預測模型的精準程度。
3.2 區隔變數常態化

In the market segmentation assignment in this week’s homework, you were introduced to the preProcess command from the caret package, which normalizes variables by subtracting by the mean and dividing by the standard deviation.

In cases where we have a training and testing set, we’ll want to normalize by the mean and standard deviation of the variables in the training set. We can do this by passing just the training set to the preProcess function:

library(caret)
preproc = preProcess(l_train)
normTrain = predict(preproc, l_train)
normTest = predict(preproc, l_test)
mean(normTrain$ReturnJan)
[1] 2.1006e-17

What is the mean of the ReturnJan variable in normTrain?

  • 2.1006e-17
mean(normTest$ReturnJan)
[1] -0.00041859

What is the mean of the ReturnJan variable in normTest?

  • -0.00041859
3.3 測試資料的常態化結果

Why is the mean ReturnJan variable much closer to 0 in normTrain than in normTest?

  • normTrain與normTest皆是由preProcess(l_train)進行中心化與標準化後的模型預測而來,而在執行preProcess()時,這兩個資料所減去的是訓練資料集的平均值資料,因此當我們將normTrain與normTest中的ReturnJan拿出來進行mean()時,便會發現測試資料集會變成負的(訓練資料比測試資料多)。
3.4 K-Means集群

Set the random seed to 144 (it is important to do this again, even though we did it earlier). Run k-means clustering with 3 clusters on normTrain, storing the result in an object called km.

set.seed(144) #設定亂數種子
km = kmeans(normTrain , 3) #將normTrain以Kmean()方式分三群
table(km$cluster) #查看各分群數量

   1    2    3 
3157 4696  253 

Which cluster has the largest number of observations?

  • 第二群
  • 4696筆資料
3.5

Recall from the recitation that we can use the flexclust package to obtain training set and testing set cluster assignments for our observations (note that the call to as.kcca may take a while to complete):

library(flexclust) #flexclust套件可用於取得我們所觀察的資料集與測試集分群狀況
#使用as.kcca()訓練模型,其來源含經過kmean()分群的結果,以及訓練資料集
km.kcca = as.kcca(km, normTrain)
clusterTrain = predict(km.kcca)
clusterTest = predict(km.kcca, newdata=normTest)
table(clusterTest)
clusterTest
   1    2    3 
1298 2080   96 

How many test-set observations were assigned to Cluster 2?

  • 2080筆資料



4. 邏輯式回歸,分群模型

4.1 依集群分析的結果切割資料

Using the subset function, build data frames stocksTrain1, stocksTrain2, and stocksTrain3, containing the elements in the stocksTrain data frame assigned to clusters 1, 2, and 3, respectively (be careful to take subsets of stocksTrain, not of normTrain). Similarly build stocksTest1, stocksTest2, and stocksTest3 from the stocksTest data frame.

#使用tapply()針對目標關聯變數,並依據clusterTrain分群結果進行平均值計算
tapply(stocksTrain$PositiveDec , clusterTrain , mean)
      1       2       3 
0.60247 0.51405 0.43874 
#以下為依照題目敘述進行解答
#將訓練資料集依照分群切割資料
stocksTrain1 = subset(stocksTrain , clusterTrain == 1)
stocksTrain2 = subset(stocksTrain , clusterTrain == 2)
stocksTrain3 = subset(stocksTrain , clusterTrain == 3)
#將測試資料集依照分群切割資料
stocksTest1 = subset(stocksTest , clusterTest == 1)
stocksTest2 = subset(stocksTest , clusterTest == 2)
stocksTest3 = subset(stocksTest , clusterTest == 3)
#使用mean()計算關聯變數PositiveDec的平均值
mean(stocksTrain1$PositiveDec)
[1] 0.60247
mean(stocksTrain2$PositiveDec)
[1] 0.51405
mean(stocksTrain3$PositiveDec)
[1] 0.43874

Which training set data frame has the highest average value of the dependent variable?

  • 第一群
  • 0.60247
4.2 分群模型,模型係數

Build logistic regression models StocksModel1, StocksModel2, and StocksModel3, which predict PositiveDec using all the other variables as independent variables. StocksModel1 should be trained on stocksTrain1, StocksModel2 should be trained on stocksTrain2, and StocksModel3 should be trained on stocksTrain3.

#使用lapply()針對所有分群進行建立模型,先以split()來依照分群結果切割資料集,再將分割後的資料分別執行邏輯回歸建模
lapply_train = lapply(split(stocksTrain , clusterTrain) , function(x) glm(PositiveDec ~ . , data = x , family=binomial))
#使用sapply()分別將lapply()的結果來執行關聯分析
sapply(lapply_train , function(x) coef(summary(x))[,1])
                    1        2          3
(Intercept)  0.172240  0.10293 -0.1818958
ReturnJan    0.024984  0.88451 -0.0097893
ReturnFeb   -0.372074  0.31762 -0.0468833
ReturnMar    0.595550 -0.37978  0.6741795
ReturnApr    1.190478  0.49291  1.2814662
ReturnMay    0.304209  0.89655  0.7625116
ReturnJune  -0.011654  1.50088  0.3294339
ReturnJuly   0.197692  0.78315  0.7741644
ReturnAug    0.512729 -0.24486  0.9826054
ReturnSep    0.588327  0.73685  0.3638068
ReturnOct   -1.022535 -0.27756  0.7822421
ReturnNov   -0.748472 -0.78747 -0.8737521
#以下為依照題目敘述進行解答
#分別針對資料集建模
StocksModel1 = glm(PositiveDec ~ . , data = stocksTrain1 , family=binomial) 
StocksModel2 = glm(PositiveDec ~ . , data = stocksTrain2 , family=binomial) 
StocksModel3 = glm(PositiveDec ~ . , data = stocksTrain3 , family=binomial) 
#顯示各模型詳細資料
summary(StocksModel1)

Call:
glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain1)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-2.731  -1.291   0.888   1.028   1.502  

Coefficients:
            Estimate Std. Error z value   Pr(>|z|)    
(Intercept)   0.1722     0.0630    2.73     0.0063 ** 
ReturnJan     0.0250     0.2931    0.09     0.9321    
ReturnFeb    -0.3721     0.2912   -1.28     0.2014    
ReturnMar     0.5955     0.2332    2.55     0.0107 *  
ReturnApr     1.1905     0.2244    5.31 0.00000011 ***
ReturnMay     0.3042     0.2284    1.33     0.1830    
ReturnJune   -0.0117     0.2999   -0.04     0.9690    
ReturnJuly    0.1977     0.2779    0.71     0.4768    
ReturnAug     0.5127     0.3086    1.66     0.0966 .  
ReturnSep     0.5883     0.2813    2.09     0.0365 *  
ReturnOct    -1.0225     0.2601   -3.93 0.00008434 ***
ReturnNov    -0.7485     0.2828   -2.65     0.0081 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4243.0  on 3156  degrees of freedom
Residual deviance: 4172.9  on 3145  degrees of freedom
AIC: 4197

Number of Fisher Scoring iterations: 4
summary(StocksModel2)

Call:
glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain2)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-2.201  -1.194   0.858   1.133   1.942  

Coefficients:
            Estimate Std. Error z value     Pr(>|z|)    
(Intercept)   0.1029     0.0379    2.72      0.00654 ** 
ReturnJan     0.8845     0.2028    4.36 0.0000128626 ***
ReturnFeb     0.3176     0.2662    1.19      0.23288    
ReturnMar    -0.3798     0.2405   -1.58      0.11423    
ReturnApr     0.4929     0.2246    2.19      0.02819 *  
ReturnMay     0.8965     0.2549    3.52      0.00044 ***
ReturnJune    1.5009     0.2601    5.77 0.0000000079 ***
ReturnJuly    0.7831     0.2686    2.92      0.00355 ** 
ReturnAug    -0.2449     0.2708   -0.90      0.36588    
ReturnSep     0.7369     0.2482    2.97      0.00299 ** 
ReturnOct    -0.2776     0.1840   -1.51      0.13142    
ReturnNov    -0.7875     0.2246   -3.51      0.00045 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6506.3  on 4695  degrees of freedom
Residual deviance: 6362.2  on 4684  degrees of freedom
AIC: 6386

Number of Fisher Scoring iterations: 4
summary(StocksModel3)

Call:
glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain3)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.915  -1.039  -0.769   1.192   1.694  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) -0.18190    0.32518   -0.56    0.576  
ReturnJan   -0.00979    0.44894   -0.02    0.983  
ReturnFeb   -0.04688    0.21343   -0.22    0.826  
ReturnMar    0.67418    0.56479    1.19    0.233  
ReturnApr    1.28147    0.60267    2.13    0.033 *
ReturnMay    0.76251    0.64778    1.18    0.239  
ReturnJune   0.32943    0.40804    0.81    0.419  
ReturnJuly   0.77416    0.72936    1.06    0.288  
ReturnAug    0.98261    0.53316    1.84    0.065 .
ReturnSep    0.36381    0.62777    0.58    0.562  
ReturnOct    0.78224    0.73312    1.07    0.286  
ReturnNov   -0.87375    0.73848   -1.18    0.237  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 346.92  on 252  degrees of freedom
Residual deviance: 328.29  on 241  degrees of freedom
AIC: 352.3

Number of Fisher Scoring iterations: 4

Which variables have a positive sign for the coefficient in at least one model and a negative sign for the coefficient in at least one model? Select all that apply.

  • 具備至少一正一負相關之系數為ReturnJan, ReturnFeb, ReturnMar, ReturnJune, ReturnAug, ReturnOct
4.3 分群模型:分群測試準確率,acc1,2,3test

Using StocksModel1, make test-set predictions called PredictTest1 on the data frame stocksTest1. Using StocksModel2, make test-set predictions called PredictTest2 on the data frame stocksTest2. Using StocksModel3, make test-set predictions called PredictTest3 on the data frame stocksTest3.

#使用lapply()針對三個群,將上一階段建模的結果依照群進行預測
Pred = lapply(1:3, function(x) predict(lapply_train[[x]], stocksTest[clusterTest == x ,], type='response') )
# lapply 回傳list(不管是什麼東西,都可以放進list當中)
#使用sapply()依據三個群來產生混淆矩陣,並計算其準確率
sapply(1:3, function(x) table(stocksTest$PositiveDec[clusterTest==x], Pred[[x]] > 0.5) %>% {sum(diag(.))/sum(.)}  )
[1] 0.61941 0.55048 0.64583
# sapply 回傳matrix
#以下為依照題目敘述進行解答
#分別使用測試資料為各模型進行預測
pred1 = predict(StocksModel1 , newdata = stocksTest1 , type = "response")
pred2 = predict(StocksModel2 , newdata = stocksTest2 , type = "response")
pred3 = predict(StocksModel3 , newdata = stocksTest3 , type = "response")
#產生混淆矩陣並計算其準確率
table(stocksTest1$PositiveDec , pred1 > 0.5) %>% {sum(diag(.))/sum(.)}
[1] 0.61941
table(stocksTest2$PositiveDec , pred2 > 0.5) %>% {sum(diag(.))/sum(.)}
[1] 0.55048
table(stocksTest3$PositiveDec , pred3 > 0.5) %>% {sum(diag(.))/sum(.)}
[1] 0.64583

What is the overall accuracy of StocksModel1 on the test set stocksTest1, using a threshold of 0.5?

  • 0.61941

What is the overall accuracy of StocksModel2 on the test set stocksTest3, using a threshold of 0.5?

  • 0.55048

What is the overall accuracy of StocksModel3 on the test set stocksTest3, using a threshold of 0.5?

  • 0.64583
4.4 分群模型:整體測試準確率,acc1+2+3test

To compute the overall test-set accuracy of the cluster-then-predict approach, we can combine all the test-set predictions into a single vector and all the true outcomes into a single vector:

#將三個預測結果整理成vector
AllPredictions = c(pred1, pred2, pred3)
#將三個測試資料中的關系變數整理成vector
AllOutcomes = c(stocksTest1$PositiveDec, stocksTest2$PositiveDec, stocksTest3$PositiveDec)
#計算所有測試模型的準確率
table(AllOutcomes , AllPredictions > 0.5) %>% {sum(diag(.))/sum(.)}
[1] 0.57887

What is the overall test-set accuracy of the cluster-then-predict approach, again using a threshold of 0.5?

  • 0.57887

We see a modest improvement over the original logistic regression model. Since predicting stock returns is a notoriously hard problem, this is a good increase in accuracy. By investing in stocks for which we are more confident that they will have positive returns (by selecting the ones with higher predicted probabilities), this cluster-then-predict model can give us an edge over the original logistic regression model.








---
title: "AS6-3 預測股票的投資報酬"
author: "GROUP5——施采彣、陳怡安、楊凱倫、唐思琪、凌偉誠"
output: html_notebook
---

<br>

**主要議題：預測股票的投資報酬**

**學習重點：**

+ 先分群以後、再做預測性模型
+ 集群分析的模型與預測方法


```{r echo=T, message=F, cache=F, warning=F}
rm(list=ls(all=T))
Sys.setlocale("LC_ALL","C")
options(digits=5, scipen=12)
library(dplyr)
library(caTools)
library(caret)
library(flexclust)
```
<br>


- - -

### 1. 資料探索

##### 1.1 
Load StocksCluster.csv into a data frame called "stocks".
```{r}
stocks = read.csv("StocksCluster.csv")
str(stocks)  #11580

```
_How many observations are in the dataset?_

+ 11580

##### 1.2 
```{r}
mean(stocks$PositiveDec)

```
_What proportion of the observations have positive returns in December?_

+ 0.54611

##### 1.3
```{r}

cor(stocks[c(1:11)]) #使用cor()查看一月到十一月的所有資料關聯

```
_What is the maximum correlation between any two return variables in the dataset?_ You should look at the pairwise correlations between ReturnJan, ReturnFeb, ReturnMar, ReturnApr, ReturnMay, ReturnJune, ReturnJuly, ReturnAug, ReturnSep, ReturnOct, and ReturnNov.

+ ReturnOct 與 ReturnNov
+ 0.191673 


##### 1.4
```{r fig.height=3, fig.width=6.4}

colMeans(stocks[c(1:11)]) #以欄位作為主執行mean()

```
_Which month (from January through November) has the largest mean return across all observations in the dataset?_

+ 四月，0.0263081

_Which month (from January through November) has the smallest mean return across all observations in the dataset?_

+ 九月，-0.0147208

<br>

- - -

### 2. 邏輯式回歸，單一模型

##### 分割訓練、測試資料
Run the following commands to split the data into a training set and testing set, putting 70% of the data in the training set and 30% of the data in the testing set:

set.seed(144)

spl = sample.split(stocks$PositiveDec, SplitRatio = 0.7)

stocksTrain = subset(stocks, spl == TRUE)

stocksTest = subset(stocks, spl == FALSE)

```{r}

library(caTools)
set.seed(144)
spl = sample.split(stocks$PositiveDec , SplitRatio = 0.7)
stocksTrain = subset(stocks , spl == TRUE)
stocksTest = subset(stocks , spl == FALSE)


```

##### 2.1 單一模型：訓練準確率，$\text{acc}_{train}$
Then, use the stocksTrain data frame to train a logistic regression model (name it StocksModel) to predict PositiveDec using all the other variables as independent variables. Don't forget to add the argument family=binomial to your glm command.

```{r}

StocksModel = glm(PositiveDec ~ ., stocksTrain , family = binomial)
pred = predict(StocksModel, type= "response")
table(stocksTrain$PositiveDec,pred > 0.5) %>% {sum(diag(.))/sum(.)}
```
_What is the overall accuracy on the training set, using a threshold of 0.5?_

+ 0.57118


##### 2.2 單一模型：測試準確率，$\text{acc}_{test}$
```{r}

pred = predict(StocksModel,newdata = stocksTest, type= "response")
table(stocksTest$PositiveDec,pred > 0.5) %>% {sum(diag(.))/sum(.)}

```
_Now obtain test set predictions from StocksModel. What is the overall accuracy of the model on the test, again using a threshold of 0.5?_

+ 0.56707

##### 2.3 單一模型：底線準確率，$\text{acc}_{baseline}$
```{r}

mean(stocksTest$PositiveDec)

```
_What is the accuracy on the test set of a baseline model that always predicts the most common outcome (PositiveDec = 1)?_

+ 0.54606

<br>

- - -

### 3. 集群分析

##### 3.1 移除目標變數
Now, let's cluster the stocks. The first step in this process is to remove the dependent variable using the following commands:
```{r}

l_train = stocksTrain[,1:11]
l_test = stocksTest[,1:11]


```
_Why do we need to remove the dependent variable in the clustering phase of the cluster-then-predict methodology?_

+ 分群適用於預測未知的情況，因此當原始資料中含有目標預測資料時，便無法驗證預測模型的精準程度。

##### 3.2 區隔變數常態化
In the market segmentation assignment in this week's homework, you were introduced to the preProcess command from the caret package, which normalizes variables by subtracting by the mean and dividing by the standard deviation.

In cases where we have a training and testing set, we'll want to normalize by the mean and standard deviation of the variables in the training set. We can do this by passing just the training set to the preProcess function:
```{r}

library(caret)
preproc = preProcess(l_train)
normTrain = predict(preproc, l_train)
normTest = predict(preproc, l_test)



```

```{r}

mean(normTrain$ReturnJan)
```
_What is the mean of the ReturnJan variable in normTrain?_

+ 2.1006e-17

```{r}

mean(normTest$ReturnJan)
```
_What is the mean of the ReturnJan variable in normTest?_

+ -0.00041859

##### 3.3 測試資料的常態化結果
_Why is the mean ReturnJan variable much closer to 0 in normTrain than in normTest?_

+ normTrain與normTest皆是由preProcess(l_train)進行中心化與標準化後的模型預測而來，而在執行preProcess()時，這兩個資料所減去的是訓練資料集的平均值資料，因此當我們將normTrain與normTest中的ReturnJan拿出來進行mean()時，便會發現測試資料集會變成負的(訓練資料比測試資料多)。

##### 3.4 K-Means集群
Set the random seed to 144 (it is important to do this again, even though we did it earlier). Run k-means clustering with 3 clusters on normTrain, storing the result in an object called km.
```{r}

set.seed(144) #設定亂數種子
km = kmeans(normTrain , 3) #將normTrain以Kmean()方式分三群

```

```{r}

table(km$cluster) #查看各分群數量

```
_Which cluster has the largest number of observations?_

+ 第二群
+ 4696筆資料

##### 3.5
Recall from the recitation that we can use the flexclust package to obtain training set and testing set cluster assignments for our observations (note that the call to as.kcca may take a while to complete):
```{r}
library(flexclust) #flexclust套件可用於取得我們所觀察的資料集與測試集分群狀況
#使用as.kcca()訓練模型，其來源含經過kmean()分群的結果，以及訓練資料集
km.kcca = as.kcca(km, normTrain)
clusterTrain = predict(km.kcca)
clusterTest = predict(km.kcca, newdata=normTest)

```

```{r}
table(clusterTest)
```
_How many test-set observations were assigned to Cluster 2?_

+ 2080筆資料


<br>

- - -

### 4. 邏輯式回歸，分群模型

##### 4.1 依集群分析的結果切割資料
Using the subset function, build data frames stocksTrain1, stocksTrain2, and stocksTrain3, containing the elements in the stocksTrain data frame assigned to clusters 1, 2, and 3, respectively (be careful to take subsets of stocksTrain, not of normTrain). Similarly build stocksTest1, stocksTest2, and stocksTest3 from the stocksTest data frame.

```{r}
#使用tapply()針對目標關聯變數，並依據clusterTrain分群結果進行平均值計算
tapply(stocksTrain$PositiveDec , clusterTrain , mean)

#以下為依照題目敘述進行解答
#將訓練資料集依照分群切割資料
stocksTrain1 = subset(stocksTrain , clusterTrain == 1)
stocksTrain2 = subset(stocksTrain , clusterTrain == 2)
stocksTrain3 = subset(stocksTrain , clusterTrain == 3)

#將測試資料集依照分群切割資料
stocksTest1 = subset(stocksTest , clusterTest == 1)
stocksTest2 = subset(stocksTest , clusterTest == 2)
stocksTest3 = subset(stocksTest , clusterTest == 3)

#使用mean()計算關聯變數PositiveDec的平均值
mean(stocksTrain1$PositiveDec)
mean(stocksTrain2$PositiveDec)
mean(stocksTrain3$PositiveDec)
```
_Which training set data frame has the highest average value of the dependent variable?_

+ 第一群
+ 0.60247

##### 4.2 分群模型，模型係數
Build logistic regression models StocksModel1, StocksModel2, and StocksModel3, which predict PositiveDec using all the other variables as independent variables. StocksModel1 should be trained on stocksTrain1, StocksModel2 should be trained on stocksTrain2, and StocksModel3 should be trained on stocksTrain3.
```{r}
#使用lapply()針對所有分群進行建立模型，先以split()來依照分群結果切割資料集，再將分割後的資料分別執行邏輯回歸建模
lapply_train = lapply(split(stocksTrain , clusterTrain) , function(x) glm(PositiveDec ~ . , data = x , family=binomial))
#使用sapply()分別將lapply()的結果來執行關聯分析
sapply(lapply_train , function(x) coef(summary(x))[,1])

#以下為依照題目敘述進行解答
#分別針對資料集建模
StocksModel1 = glm(PositiveDec ~ . , data = stocksTrain1 , family=binomial) 
StocksModel2 = glm(PositiveDec ~ . , data = stocksTrain2 , family=binomial) 
StocksModel3 = glm(PositiveDec ~ . , data = stocksTrain3 , family=binomial) 
#顯示各模型詳細資料
summary(StocksModel1)
summary(StocksModel2)
summary(StocksModel3)
```

_Which variables have a positive sign for the coefficient in at least one model and a negative sign for the coefficient in at least one model?_ Select all that apply.

+ 具備至少一正一負相關之系數為ReturnJan, ReturnFeb, ReturnMar, ReturnJune, ReturnAug, ReturnOct


##### 4.3 分群模型：分群測試準確率，$\text{acc}_{test}^{1,2,3}$
Using StocksModel1, make test-set predictions called PredictTest1 on the data frame stocksTest1. Using StocksModel2, make test-set predictions called PredictTest2 on the data frame stocksTest2. Using StocksModel3, make test-set predictions called PredictTest3 on the data frame stocksTest3.
```{r}
#使用lapply()針對三個群,將上一階段建模的結果依照群進行預測
Pred = lapply(1:3, function(x) predict(lapply_train[[x]], stocksTest[clusterTest == x ,], type='response') )
# lapply 回傳list(不管是什麼東西，都可以放進list當中)
#使用sapply()依據三個群來產生混淆矩陣，並計算其準確率
sapply(1:3, function(x) table(stocksTest$PositiveDec[clusterTest==x], Pred[[x]] > 0.5) %>% {sum(diag(.))/sum(.)}  )
# sapply 回傳matrix


#以下為依照題目敘述進行解答
#分別使用測試資料為各模型進行預測
pred1 = predict(StocksModel1 , newdata = stocksTest1 , type = "response")
pred2 = predict(StocksModel2 , newdata = stocksTest2 , type = "response")
pred3 = predict(StocksModel3 , newdata = stocksTest3 , type = "response")
#產生混淆矩陣並計算其準確率
table(stocksTest1$PositiveDec , pred1 > 0.5) %>% {sum(diag(.))/sum(.)}
table(stocksTest2$PositiveDec , pred2 > 0.5) %>% {sum(diag(.))/sum(.)}
table(stocksTest3$PositiveDec , pred3 > 0.5) %>% {sum(diag(.))/sum(.)}

```
_What is the overall accuracy of StocksModel1 on the test set stocksTest1, using a threshold of 0.5?_

+ 0.61941


_What is the overall accuracy of StocksModel2 on the test set stocksTest3, using a threshold of 0.5?_

+ 0.55048


_What is the overall accuracy of StocksModel3 on the test set stocksTest3, using a threshold of 0.5?_

+ 0.64583


##### 4.4 分群模型：整體測試準確率，$\text{acc}_{test}^{1+2+3}$
To compute the overall test-set accuracy of the cluster-then-predict approach, we can combine all the test-set predictions into a single vector and all the true outcomes into a single vector:
```{r}
#將三個預測結果整理成vector
AllPredictions = c(pred1, pred2, pred3)
#將三個測試資料中的關系變數整理成vector
AllOutcomes = c(stocksTest1$PositiveDec, stocksTest2$PositiveDec, stocksTest3$PositiveDec)
#計算所有測試模型的準確率
table(AllOutcomes , AllPredictions > 0.5) %>% {sum(diag(.))/sum(.)}
```

_What is the overall test-set accuracy of the cluster-then-predict approach, again using a threshold of 0.5?_

+ 0.57887



We see a modest improvement over the original logistic regression model. Since predicting stock returns is a notoriously hard problem, this is a good increase in accuracy. By investing in stocks for which we are more confident that they will have positive returns (by selecting the ones with higher predicted probabilities), this cluster-then-predict model can give us an edge over the original logistic regression model.

<br>

- - -

<br><br><br><br><br>

<style>
.caption {
  color: #777;
  margin-top: 10px;
}
p code {
  white-space: inherit;
}
pre {
  word-break: normal;
  word-wrap: normal;
  line-height: 1;
}
pre code {
  white-space: inherit;
}
p,li {
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

.r{
  line-height: 1.2;
}

title{
  color: #cc0000;
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

body{
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

h1,h2,h3,h4,h5{
  color: #008800;
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

h3{
  color: #b36b00;
  background: #ffe0b3;
  line-height: 2;
  font-weight: bold;
}

h5{
  color: #006000;
  background: #ffffe0;
  line-height: 2;
  font-weight: bold;
}

em{
  color: #0000c0;
  background: #f0f0f0;
  }
</style>

