目錄

1. Logistic Regression
2. Decision Tree
3. Random Forest
4. Ensambling


1. 邏輯式迴歸分析(Logistic Regression)


Figure 1 - The Flowcart

Figure 1 - The Flowcart



1.1 Prepare the Dataset


【檢查缺項】

summary(X)
is.na(X$a)


【分析缺項:移除 or 填補】

missing_data = subset(is.na(x$a|x$c|x$e)) 
#X裡具至少一個以上NA的子項目(a,c,e)挑出來
#不僅是缺項單位格,該缺項所在一整列資料(row)
summary(missing_data) #包含原先X的所有自變數(a,b,c,d,e)欄位 #即沒有NA值的欄位(b,d)也在
nrow(missing_data)
table(missing$Y) #找出應變數Y,檢查要移除還是要填補NA項
table中是NA且Y=1的值/全部有幾個NA


【補缺項工具】

library(mice)
set.seed(144) #Note that the imputation results are not the same even if you set the random seed.
vars.for.imputation = setdiff(names(X), "Y") #設定vars.for.imputation來移除Y
imputed = complete(mice(X[vars.for.imputation])) #只用所有自變數來估算NA值


Imputation predicts missing variable values for a given observation using the variable values that are reported. We called the imputation on a data frame with the dependent Y removed, so we predicted the missing values using only other independent variables.


1.2 Random Split

library(caTools)
set.seed()
sample.split(X$Y, SplitRatio=)
TR = subset(X, spl==TRUE)
TS = subset(X, spl==FALSE)


Now that we have prepared the dataset, we need to split it into a training and testing set. To ensure everybody obtains the same split, set the random seed to 144 (even though you already did so earlier in the problem).


【隨機種子的功用】

  • Different seed -> get different splits.
  • Different splitRatio -> get different splits.
  • Re-ran all steps -> get the same result: set a random seed, split, set the seed again to the same value, and then split again, you will get the same split.
  • Re-ran step3~5 -> get different result: set the seed and then split twice, you will get different splits.


【從迴歸係數估計邊際效用】

  • ln(oddA) = ln(oddB) + 1.61
  • oddA = exp(1.61) * oddB
  • Rule: e^(a+b) = e^a * e^b.

Q: Application A has FICO credit score 700 while the borrower in Application B has FICO credit score 710. What is the value of Logit(A) - Logit(B)? What is the value of O(A)/O(B)?

# the difference of logits
logitA = -0.00931679*700 #係數相同,x值不同
logitB = -0.00931679*710
logitA - logitB #0.09317
# the ratio of odds
oddA = exp(logitA) #法1
oddB = exp(logitB)
oddA/oddB #1.098
exp(logitA - logitB) or exp(0.09317) #法2

1.3 Build the Model

modelLog = glm(Y~ x1+x2+..., data=TR, family=binomial)
pred = predict(modelLog, data=, type="response") #response:回傳機率


【訓練模型的AUC】

library(ROCR)
modelTR = glm(Y~ x1+x2+..., data=TR, family=binomial)
PredTR = predict(modelTR, type = "response")
PredROCR = prediction(PredTR, TR$Y)
PerfROCR = performance(PredROCR, "tpr","fpr") #為了畫AUC圖
par(cex=0.8)
plot(PerfROCR, colorize=TRUE, print.cutoffs.at=seq(0,1,0.1), text.adj=c(-0.2,1,7)) #AUC圖 #text.adj加了美觀,讓ROC線上的值右移一點,不重疊在線上
as.numeric(performance(PredROCR, "auc")@y.values) #AUC實際值


【預測模型的AUC】

library(ROCR)
modelTR = glm(Y~ x1+x2+..., data=TR, family=binomial)
predTS = predict(modelTR, newdata=TS, type="response")
predROCR = prediction(predTS, TS$Y)
as.numeric(performance(predROCR, "auc")@y.values) #AUC實際值


【Confusion Matrix】
※ 只有拿訓練模型去適應Testing Data(即預測模型)後才會做

table(TS$Y, predTS >= 0.5) #threshold=0.5 

1.4 Verify the Model


【準確性:訓練模型 V.S. Baseline】
※ ACC從Confusion Matrix得來

predTS = predict(modelTR, newdata=TS, type="response")
table(TS$Y, predTS >= 0.45) #Confusion Matrix(由pred和TS data得來)
ACC = (TN+TP)/(TN+FP+FN+TP)
#baseline ACC
table(TS$Y)
ACC = (TN+TP)/(TN+FP+FN+TP)


【敏感性 & 明確性】

#Using the values from Confusion Matrix: table(TS$Y, predTS >= 0.45)
sensitivity = TP/(TP+FN)
specificity = TN/(TN+FP)


在不同的threshold條件下,會產生出不同Confusion Matrix結果;結果中包含許多資訊,如:準確率(Accuracy)、辨識率(AUC)、敏滿性(Sensitivity)與明確性(Specificity)。
依照所欲達到的目的,或是一件事情發生結果的嚴重性,都可以有不同的決策結果,故即使模型不能夠大幅度增加ACC,但能夠以不同的角度來解釋問題並且對於決策有所助益,該模型就是一個好的模型。


1.5 Make the Decision

risk=-100
action=-60
cost=-10
ExpPayoff = TN*0 + TP*action + FN*risk + Total*cost
Figure 2 - The Excepted Payoff

Figure 2 - The Excepted Payoff






2. 決策樹分析(Decision Tree)


2.1 Build the Model

library(rpart)
cart -> rpart(Y~ x1+x2+..., data=TR, method="class", minbucket=25)  #class:讓結果輸出為類別 #prob:讓結果輸出為機率,都不寫也是機率(做regression時用) #minbucket:數小樹大;數大樹小
prp(cart) #看樹長什麼樣,簡化版
library(rpart.plot)
rpart.plot(caret, cex=0.6) #看樹長什麼樣,進階版


【預測模型的ACC】

predCart = predict(cart, newdata=TS, type = "class")
table(predCart, TS$Y)
ACC = (TN+TP)/(TN+FP+FN+TP)


【預測模型的AUC】

library(ROCR)
predCart = predict(cart, newdata=TS) #把type拿掉改顯示成機率
predCart #看一下此時呈現的機率會分成Y=0及Y=1兩行
predCart = predCart[,2] #我們只需要Y=1的機率, 將Y=0得捨去
predROCR = prediction(predCart, TS$Y)
predROCR = prediction(predCart[,2], TS$Y) #法2(簡化版)
plot(perf)
as.numeric(performance(predROCR, "auc")@y.values)

2.2 Cross Validation & Parameter Tuning


Let us select the cp parameter for our CART model using k-fold cross validation, with k = 10 folds.

library(caret)
library(e1071)
numFolds = trainControl(method="cv", number=10) #10 folds cv
cpGrid = expand.grid(.cp=seq(0.01,0.5,0.01)) #parameter
cp = train(Y~., data=TR, method="rpart",
            trControl=numFolds,tuneGrid=cpGrid)
cp #selecting cp by cross-validation

2.3 The Final Model

modelCV = rpart(Y~., data=TR, method="class", cp=0.18) #cp從上式得來
predCV = predict(modelCV, newdata=TS, type="class")
table(TS$Y, predCV)
ACC = (TN+TP)/(TN+FP+FN+TP)
prp(modelCV)


Compared to the original accuracy using the default value of cp, this new CART model is an improvement, and so we should clearly favor this new model over the old one – or should we? Plot the CART tree for this model. How many splits are there?

  • In some applications, such an improvement in accuracy would be worth the loss in interpretability.
  • In others, we may prefer a less accurate model that is simpler to understand and describe over a more accurate (but more complicated) model.






3. 隨機森林分析(Random Forest)


3.1 Build the Model

※ 通常順序:建立邏輯式迴歸模型→ CART模型→ 隨機森林模型→ 交叉驗證(CART2)→ 比較上述模型並選擇其中之一。

library(randomForest)
Forest = randomForest(Y~ x1+x2+..., data = TR, nodesize=25, ntree= 200)
predForest = predict(Forest, newdata = TS)
table(TS$Y,predForest)
ACC = (TN+TP)/(TN+FP+FN+TP)


【AUC和ACC精簡公式】

colAUC(pred, TS$a)
table(TS$a, pred > 0.5) %>% {sum(diag(.))/sum(.)} 






4. 組合模型(Ensambling)


4.1 Compare ACC

px = cbind(glm=predglm, cart1=pred1, cart2=pred2, cart3=pred3, cart4 = pred4, cart5 = pred5, cart6 = pred6, cart7 = pred7, rf = PredictForest)
apply(px, 2, function(x) {
  table(TS$buy, x > 0.5) %>% {sum(diag(.))/sum(.)} 
  }) %>% sort


4.2 Compare AUC

par(cex=0.7,mar=c(6,6,5,3))
colAUC(px, TS$buy, T)


4.3 Correlation & Ensambling

  • 兩個相關性高的模型不見得是最好的組合
  • 兩個極端相反的模型組合起來反而能幫忙解釋更多不同的角度
cor(px)
glm_cart = (px[,"glm"] + px[,"cart6"])/2
glm_rf = (px[,"glm"] + px[,"rf"])/2


4.4 ACC & AUC Table

px2 = cbind(px, glm_cart, glm_rf)
rbind(apply(px2, 2, function(x) {
        table(TS$buy, x > 0.5) %>% {sum(diag(.))/sum(.)} }),
      colAUC(px2, TS$buy)) %>% t %>% 
      data.frame %>% setNames(c("Accuracy", "AUC"))        
px3 = cbind(lm=RSquarelm, cart9=RSquarerp)
rbind(px3) %>% data.frame %>% setNames(c("lm","cart"))
---
title: "商業數據分析 模型製作筆記"
author: "唐思琪 B041010004"
output: html_notebook
---
<br>

### <b>目錄</b>

[1. Logistic Regression](#n1) <br>
[2. Decision Tree](#n2) <br>
[3. Random Forest](#n3) <br>
[4. Ensambling](#n4) <br>

- - -

### <a id="n1"></a> <b>1. 邏輯式迴歸分析(Logistic Regression)</b>
<br>

![Figure 1 - The Flowcart](solution.jpg)

<br><br>

#### <b>1.1 Prepare the Dataset</b>
<br>
【檢查缺項】
```{r}
summary(X)
is.na(X$a)
```
<br>
【分析缺項：移除 or 填補】
```{r}
missing_data = subset(is.na(x$a|x$c|x$e)) 
#X裡具至少一個以上NA的子項目(a,c,e)挑出來
#不僅是缺項單位格,該缺項所在一整列資料(row)
summary(missing_data) #包含原先X的所有自變數(a,b,c,d,e)欄位 #即沒有NA值的欄位(b,d)也在
nrow(missing_data)
table(missing$Y) #找出應變數Y,檢查要移除還是要填補NA項
table中是NA且Y=1的值/全部有幾個NA
```
<br>
【補缺項工具】
```{r}
library(mice)
set.seed(144) #Note that the imputation results are not the same even if you set the random seed.
vars.for.imputation = setdiff(names(X), "Y") #設定vars.for.imputation來移除Y
imputed = complete(mice(X[vars.for.imputation])) #只用所有自變數來估算NA值
```
<br>
Imputation predicts missing variable values for a given observation using the variable values that are reported. We called the imputation on a data frame with the dependent Y removed, so we predicted the missing values using only other independent variables.

- - -

#### <b>1.2 Random Split</b>
```{r}
library(caTools)
set.seed()
sample.split(X$Y, SplitRatio=)
TR = subset(X, spl==TRUE)
TS = subset(X, spl==FALSE)
```
<br>
Now that we have prepared the dataset, we need to split it into a training and testing set. To ensure everybody obtains the same split, set the random seed to 144 (even though you already did so earlier in the problem).

<br>
【隨機種子的功用】

+ Different seed -> get different splits.
+ Different splitRatio -> get different splits.
+ Re-ran all steps -> get the same result: set a random seed, split, set the seed again to the same value, and then split again, you will get the same split.
+ Re-ran step3~5 -> get different result: set the seed and then split twice, you will get different splits.

<br>
【從迴歸係數估計邊際效用】

+ ln(oddA) = ln(oddB) + 1.61
+ oddA = exp(1.61) * oddB
+ Rule: e^(a+b) = e^a * e^b.

Q: Application A has FICO credit score 700 while the borrower in Application B has FICO credit score 710. What is the value of Logit(A) - Logit(B)? What is the value of O(A)/O(B)?
```{r}
# the difference of logits
logitA = -0.00931679*700 #係數相同,x值不同
logitB = -0.00931679*710
logitA - logitB #0.09317
# the ratio of odds
oddA = exp(logitA) #法1
oddB = exp(logitB)
oddA/oddB #1.098
exp(logitA - logitB) or exp(0.09317) #法2
```

- - -

#### <b>1.3 Build the Model</b>

```{r}
modelLog = glm(Y~ x1+x2+..., data=TR, family=binomial)
pred = predict(modelLog, data=, type="response") #response:回傳機率
```
<br>
【訓練模型的AUC】
```{r}
library(ROCR)
modelTR = glm(Y~ x1+x2+..., data=TR, family=binomial)
PredTR = predict(modelTR, type = "response")
PredROCR = prediction(PredTR, TR$Y)
PerfROCR = performance(PredROCR, "tpr","fpr") #為了畫AUC圖
par(cex=0.8)
plot(PerfROCR, colorize=TRUE, print.cutoffs.at=seq(0,1,0.1), text.adj=c(-0.2,1,7)) #AUC圖 #text.adj加了美觀,讓ROC線上的值右移一點,不重疊在線上
as.numeric(performance(PredROCR, "auc")@y.values) #AUC實際值
```
<br>
【預測模型的AUC】
```{r}
library(ROCR)
modelTR = glm(Y~ x1+x2+..., data=TR, family=binomial)
predTS = predict(modelTR, newdata=TS, type="response")
predROCR = prediction(predTS, TS$Y)
as.numeric(performance(predROCR, "auc")@y.values) #AUC實際值
```
<br>
【Confusion Matrix】<br>
※ 只有拿訓練模型去適應Testing Data（即預測模型）後才會做
```{r}
table(TS$Y, predTS >= 0.5) #threshold=0.5 
```

- - -

#### <b>1.4 Verify the Model</b>
<br>
【準確性:訓練模型 V.S. Baseline】<br>
※ ACC從Confusion Matrix得來<br>
```{r}
predTS = predict(modelTR, newdata=TS, type="response")
table(TS$Y, predTS >= 0.45) #Confusion Matrix(由pred和TS data得來)
ACC = (TN+TP)/(TN+FP+FN+TP)
#baseline ACC
table(TS$Y)
ACC = (TN+TP)/(TN+FP+FN+TP)
```
<br>
【敏感性 & 明確性】
```{r}
#Using the values from Confusion Matrix: table(TS$Y, predTS >= 0.45)
sensitivity = TP/(TP+FN)
specificity = TN/(TN+FP)
```
<br>
<font face="微軟正黑體">
在不同的threshold條件下，會產生出不同Confusion Matrix結果；結果中包含許多資訊，如：準確率（Accuracy）、辨識率（AUC）、敏滿性（Sensitivity）與明確性（Specificity）。<br>
依照所欲達到的目的，或是一件事情發生結果的嚴重性，都可以有不同的決策結果，故即使模型不能夠大幅度增加ACC,但能夠以不同的角度來解釋問題並且對於決策有所助益，該模型就是一個好的模型。
</font>

- - -

#### <b>1.5 Make the Decision</b>

```{r}
risk=-100
action=-60
cost=-10
ExpPayoff = TN*0 + TP*action + FN*risk + Total*cost
```
![Figure 2 - The Excepted Payoff](payoff.jpg)

<br><br>

- - -

<br><br>


### <a id="n2"></a> <b>2. 決策樹分析(Decision Tree)</b>

<br>

#### <b>2.1 Build the Model</b>
```{r}
library(rpart)
cart -> rpart(Y~ x1+x2+..., data=TR, method="class", minbucket=25)  #class:讓結果輸出為類別 #prob:讓結果輸出為機率,都不寫也是機率(做regression時用) #minbucket:數小樹大;數大樹小
prp(cart) #看樹長什麼樣,簡化版
library(rpart.plot)
rpart.plot(caret, cex=0.6) #看樹長什麼樣,進階版
```
<br>
【預測模型的ACC】
```{r}
predCart = predict(cart, newdata=TS, type = "class")
table(predCart, TS$Y)
ACC = (TN+TP)/(TN+FP+FN+TP)
```
<br>
【預測模型的AUC】
```{r}
library(ROCR)
predCart = predict(cart, newdata=TS) #把type拿掉改顯示成機率
predCart #看一下此時呈現的機率會分成Y=0及Y=1兩行
predCart = predCart[,2] #我們只需要Y=1的機率, 將Y=0得捨去
predROCR = prediction(predCart, TS$Y)
predROCR = prediction(predCart[,2], TS$Y) #法2(簡化版)
plot(perf)
as.numeric(performance(predROCR, "auc")@y.values)
```

- - -

#### <b>2.2 Cross Validation & Parameter Tuning</b>
<br>
Let us select the cp parameter for our CART model using k-fold cross validation, with k = 10 folds.
```{r}
library(caret)
library(e1071)
numFolds = trainControl(method="cv", number=10) #10 folds cv
cpGrid = expand.grid(.cp=seq(0.01,0.5,0.01)) #parameter
cp = train(Y~., data=TR, method="rpart",
            trControl=numFolds,tuneGrid=cpGrid)
cp #selecting cp by cross-validation
```

- - -

#### <b>2.3 The Final Model</b>

```{r}
modelCV = rpart(Y~., data=TR, method="class", cp=0.18) #cp從上式得來
predCV = predict(modelCV, newdata=TS, type="class")
table(TS$Y, predCV)
ACC = (TN+TP)/(TN+FP+FN+TP)
prp(modelCV)
```
<br>
<font face="微軟正黑體">
Compared to the original accuracy using the default value of cp, this new CART model is an improvement, and so we should clearly favor this new model over the old one -- or should we? Plot the CART tree for this model. How many splits are there?

+ In some applications, such an improvement in accuracy would be worth the loss in interpretability.
+ In others, we may prefer a less accurate model that is simpler to understand and describe over a more accurate (but more complicated) model.
</font>

<br><br>

- - -

<br><br>

### <a id="n3"></a> <b>3. 隨機森林分析(Random Forest)</b>

<br>

#### <b>3.1 Build the Model</b>
※ 通常順序：建立邏輯式迴歸模型→ CART模型→ 隨機森林模型→ 交叉驗證（CART2）→ 比較上述模型並選擇其中之一。
```{r}
library(randomForest)
Forest = randomForest(Y~ x1+x2+..., data = TR, nodesize=25, ntree= 200)
predForest = predict(Forest, newdata = TS)
table(TS$Y,predForest)
ACC = (TN+TP)/(TN+FP+FN+TP)
```
<br>
【AUC和ACC精簡公式】
```{r}
colAUC(pred, TS$a)
table(TS$a, pred > 0.5) %>% {sum(diag(.))/sum(.)} 
```

<br><br>

- - -

<br><br>

### <a id="n4"></a> <b>4. 組合模型(Ensambling)</b>

<br>

#### <b>4.1 Compare ACC</b>

```{r}
px = cbind(glm=predglm, cart1=pred1, cart2=pred2, cart3=pred3, cart4 = pred4, cart5 = pred5, cart6 = pred6, cart7 = pred7, rf = PredictForest)
apply(px, 2, function(x) {
  table(TS$buy, x > 0.5) %>% {sum(diag(.))/sum(.)} 
  }) %>% sort
```


<br>

#### <b>4.2 Compare AUC</b>

```{r}
par(cex=0.7,mar=c(6,6,5,3))
colAUC(px, TS$buy, T)
```
<br>

#### <b>4.3 Correlation & Ensambling</b>

+ 兩個相關性高的模型不見得是最好的組合
+ 兩個極端相反的模型組合起來反而能幫忙解釋更多不同的角度

```{r}
cor(px)
glm_cart = (px[,"glm"] + px[,"cart6"])/2
glm_rf = (px[,"glm"] + px[,"rf"])/2
```
<br>

#### <b>4.4 ACC & AUC Table</b>

```{r}
px2 = cbind(px, glm_cart, glm_rf)
rbind(apply(px2, 2, function(x) {
        table(TS$buy, x > 0.5) %>% {sum(diag(.))/sum(.)} }),
      colAUC(px2, TS$buy)) %>% t %>% 
      data.frame %>% setNames(c("Accuracy", "AUC"))        
px3 = cbind(lm=RSquarelm, cart9=RSquarerp)
rbind(px3) %>% data.frame %>% setNames(c("lm","cart"))
```
