Introduction
議題:用申請假釋者的屬性,預測他會不會違反假釋規定
學習重點:
- 設定隨機種子set.seed(),依比例分割資料
- 從邏輯式回歸的係數推算自變數的邊際效果
- 勝率和機率、勝率比和機率差 之間的換算
- 臨界機率對混淆矩陣(期望報酬)的影響payoff = matrix(c(0,-100,-10,-50),2,2); payoff
- AUC的實質意義payoff = matrix(c(100, -80, -20, 100),2,2); payoff
- 如何(從報酬矩陣)決定臨界機率
- 什麼是抽樣偏差,如何避免、如何修正
1 資料處理 Loading the Dataset
【1.1 讀進資料】How many parolees are contained in the dataset?
parole=read.csv("data/parole.csv")
nrow(parole)
[1] 675
【1.2 底線機率】How many of the parolees in the dataset violated the terms of their parole?
sum(parole$violator==1)
[1] 78
2 整理資料框 Creating Our Prediction Model
【2.1 類別變數】Which variables in this dataset are unordered factors with at least three levels?
parole$state = factor(parole$state)
parole$crime = factor(parole$crime)
summary(parole)
male race age state time.served max.sentence multiple.offenses crime
Min. :0.000 Min. :1.00 Min. :18.4 1:143 Min. :0.00 Min. : 1.0 Min. :0.000 1:315
1st Qu.:1.000 1st Qu.:1.00 1st Qu.:25.4 2:120 1st Qu.:3.25 1st Qu.:12.0 1st Qu.:0.000 2:106
Median :1.000 Median :1.00 Median :33.7 3: 82 Median :4.40 Median :12.0 Median :1.000 3:153
Mean :0.807 Mean :1.42 Mean :34.5 4:330 Mean :4.20 Mean :13.1 Mean :0.536 4:101
3rd Qu.:1.000 3rd Qu.:2.00 3rd Qu.:42.5 3rd Qu.:5.20 3rd Qu.:15.0 3rd Qu.:1.000
Max. :1.000 Max. :2.00 Max. :67.0 Max. :6.00 Max. :18.0 Max. :1.000
violator
Min. :0.000
1st Qu.:0.000
Median :0.000
Mean :0.116
3rd Qu.:0.000
Max. :1.000
【2.2 資料框摘要】How does the output of summary() change for a factor variable as compared to a numerical variable?
set.seed(144)
library(caTools)
split = sample.split(parole$violator, SplitRatio = 0.7)
train = subset(parole, split == TRUE)
test = subset(parole, split == FALSE)
"0.7 training 0.3 testing"
[1] "0.7 training 0.3 testing"
3 資料分割 Splitting into a Training and Testing Set
【3.1 指定隨機種子、依比例分割資料】Roughly what proportion of parolees have been allocated to the training and testing sets?
paroleglm=glm(violator~.,train,family="binomial")
summary(paroleglm)
Call:
glm(formula = violator ~ ., family = "binomial", data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.704 -0.424 -0.272 -0.169 2.837
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.241157 1.293885 -3.28 0.001 **
male 0.386990 0.437961 0.88 0.377
race 0.886719 0.395066 2.24 0.025 *
age -0.000176 0.016085 -0.01 0.991
state2 0.443301 0.481662 0.92 0.357
state3 0.834980 0.556270 1.50 0.133
state4 -3.396788 0.611586 -5.55 0.000000028 ***
time.served -0.123887 0.120423 -1.03 0.304
max.sentence 0.080295 0.055375 1.45 0.147
multiple.offenses 1.611992 0.385305 4.18 0.000028683 ***
crime2 0.683714 0.500355 1.37 0.172
crime3 -0.278105 0.432836 -0.64 0.521
crime4 -0.011763 0.571304 -0.02 0.984
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 340.04 on 472 degrees of freedom
Residual deviance: 251.48 on 460 degrees of freedom
AIC: 277.5
Number of Fisher Scoring iterations: 6
"race state4 multiple.offenses are significant"
[1] "race state4 multiple.offenses are significant"
【3.2 隨機種子的功用】Now, suppose you re-ran lines [1]-[5] of Problem 3.1. What would you expect? If you instead ONLY re-ran lines [3]-[5], what would you expect? If you instead called set.seed() with a different number and then re-ran lines [3]-[5] of Problem 3.1, what would you expect?
odd=-4.241157+0.386990+0.886719+50*(-0.000176)+3*(-0.123887)+12*0.080295+0.683714
exp(odd)
[1] 0.1826
1/(1+exp(-odd))
[1] 0.1544
4 建立模型 Building a Logistic Regression Model
【4.1 顯著性】What variables are significant in this model?
x=table(actual = test$violator, predict = predictmd>=0.5)
x
predict
actual FALSE TRUE
0 167 12
1 11 12
sensitivity=12/(12+11)
specificity=167/(167+12)
accuracy=(12+167)/(12+167+12+11)
sensitivity
[1] 0.5217
specificity
[1] 0.933
accuracy
[1] 0.8861
【4.2 從回歸係數估計邊際效用】What can we say based on the coefficient of the multiple.offenses variable?
table(test$violator)
0 1
179 23
179/(179+23)
[1] 0.8861
【4.3 從預測值估計勝率和機率】According to the model, what are the odds this individual is a violator? What is the probability this individual is a violator?
library(caTools)
colAUC(predictmd, test$violator)
[,1]
0 vs. 1 0.8946
5 驗證模型 Evaluating the Model on the Testing Set
【5.1 從測試資料預測機率】What is the maximum predicted probability of a violation?
max(predict(model1,newdata = test, type = "response"))
[1] 0.9073
【5.2 從混淆矩陣計算敏感性、明確性、正確率】What is the model’s sensitivity, specificity, accuracy?
preResult <- predict(model1,newdata = test, type = "response")
table(test$violator, as.numeric(preResult >= 0.5))
0 1
0 167 12
1 11 12
12/(11+12)
[1] 0.5217
167/(167+12)
[1] 0.933
(167+12)/(167+12+11+12)
[1] 0.8861
# 利用 table() 計算出 confusion matrix
# 並利用 confusion matrix 計算 Sensitivity、Specificity 以及 Accuracy
【5.3 底線機率】What is the accuracy of a simple model that predicts that every parolee is a non-violator?
table(test$violator)
0 1
179 23
179/(179+23)
[1] 0.8861
# 底線機率單純看其 violater 變數項是否為 1
【5.4 根據報償矩陣調整臨界機率】Which of the following most likely describes their preferences and best course of action?
print("The board assigns more cost to a false negative than a false positive, and should therefore use a logistic regression cutoff less than 0.5. ", quote = FALSE)
[1] The board assigns more cost to a false negative than a false positive, and should therefore use a logistic regression cutoff less than 0.5.
# 放錯人比關錯人的成本要大得多,因此評估該假釋犯是否會違反規定的標準應該要嚴格一些
【5.5 正確率 vs 辨識率】Which of the following is the most accurate assessment of the value of the logistic regression model with a cutoff 0.5 to a parole board, based on the model’s accuracy as compared to the simple baseline model?
print("The model is likely of value to the board, and using a different logistic regression cutoff is likely to improve the model's value. ", quote = FALSE)
[1] The model is likely of value to the board, and using a different logistic regression cutoff is likely to improve the model's value.
# 可以將低放錯人的次數
【5.6 計算辨識率】Using the ROCR package, what is the AUC value for the model?
library(ROCR)
pred <- prediction(preResult, test$violator)
as.numeric(performance(pred, "auc")@y.values)
[1] 0.8946
【5.7 辨識率的定義】Describe the meaning of AUC in this context.
print("The probability the model can correctly differentiate between a randomly selected parole violator and a randomly selected parole non-violator.", quote = FALSE)
[1] The probability the model can correctly differentiate between a randomly selected parole violator and a randomly selected parole non-violator.
# 模型能夠猜對假釋犯不會違反規定與會違反規定的概率
6 抽樣偏差 Identifying Bias in Observational Data
【6.1 如何避免、診斷、修正抽樣偏差】How could we improve our dataset to best address selection bias?
print("We should use a dataset tracking a group of parolees from the start of their parole until either they violated parole or they completed their term.", quote = FALSE)
[1] We should use a dataset tracking a group of parolees from the start of their parole until either they violated parole or they completed their term.
# 面對缺漏值
# 若是將該假釋犯的 violator 自動補為 0
# 會造成模型偏向放大不會犯罪的可能性
# 若是將該假釋犯的 violator 自動補為 NA
# R 會自動在建立模型的時候剔除這些觀測值
# 最好的方式就是去追蹤這些假釋犯至其假釋期間結束
---
title: "AS3-2 Predicting Parole Violators"
author: "Group 2"
output: html_notebook
---

```{r echo=T, message=F, cache=F, warning=F}
rm(list=ls(all=T))
options(digits=4, scipen=12)
library(magrittr)
```

- - -

### Introduction

**議題：用申請假釋者的屬性，預測他會不會違反假釋規定**

**學習重點：**

+ 設定隨機種子set.seed()，依比例分割資料
+ 從邏輯式回歸的係數推算自變數的邊際效果
+ 勝率和機率、勝率比和機率差 之間的換算 
+ 臨界機率對混淆矩陣(期望報酬)的影響payoff = matrix(c(0,-100,-10,-50),2,2); payoff
+ AUC的實質意義payoff = matrix(c(100, -80, -20, 100),2,2); payoff
+ 如何(從報酬矩陣)決定臨界機率 
+ 什麼是抽樣偏差,如何避免、如何修正

<br>

- - -

#### 1 資料處理 Loading the Dataset

【**1.1 讀進資料**】How many parolees are contained in the dataset?
```{r}

parole <- read.csv("data/parole.csv")
nrow(parole)

```

【**1.2 底線機率**】How many of the parolees in the dataset violated the terms of their parole?
```{r}

nrow(parole[parole$violator == 1,])

```
<br>

- - -

#### 2 整理資料框 Creating Our Prediction Model

【**2.1 類別變數**】Which variables in this dataset are unordered factors with at least three levels? 
```{r}

parole2 <- data.frame(sapply(parole,as.factor))
summary(parole2)
print("state",quote = FALSE)
print("crime",quote = FALSE)

parole$male <- parole2$male
parole$race <- parole2$race
parole$state <- parole2$state
parole$multiple.offenses <- parole2$multiple.offenses
parole$crime <- parole2$crime
parole$violator <- parole2$violator

# 將原始資料先全部轉成 factor 存到另一個 data.frame
# 觀察這個 data.frame summary() 的結果
# 挑出是類別的變數並更新原始 data.frame 中那幾個變數為 factor 資料

```

【**2.2 資料框摘要**】How does the output of `summary()` change for a factor variable as compared to a numerical variable? 
```{r}

print("The output becomes similar to that of the table() function applied to that variable .",quote = FALSE)

# factor 變數的 summary 會是各個類別的出現次數

```
<br>

- - -

#### 3 資料分割 Splitting into a Training and Testing Set

【**3.1 指定隨機種子、依比例分割資料**】Roughly what proportion of parolees have been allocated to the training and testing sets?
```{r}

set.seed(144)
library(caTools)
split <- sample.split(parole$violator, SplitRatio = 0.7)
train <- subset(parole, split == TRUE)
test <- subset(parole, split == FALSE)
nrow(train)/(nrow(train)+nrow(test))
print("70% to the training set, 30% to the testing set . ",quote = FALSE)

# 指定隨機種子並指定比例來分割資料
# 七成為 training data、三成為 testing data

```

【**3.2 隨機種子的功用**】Now, suppose you re-ran lines [1]-[5] of Problem 3.1. What would you expect? If you instead ONLY re-ran lines [3]-[5], what would you expect? If you instead called set.seed() with a different number and then re-ran lines [3]-[5] of Problem 3.1, what would you expect?
```{r}

print("The exact same training/testing set split as the first execution of [1]-[5] .",quote = FALSE)
print("A different training/testing set split from the first execution of [1]-[5] . ",quote = FALSE)
print("A different training/testing set split from the first execution of [1]-[5] . ",quote = FALSE)

# 若是有指定好隨機種子
# 接下來在使用相關函式的時候都會得到相同的結果
# 因次若是沒有設定隨機種子的數字、或是指定了不同的種子
# 同樣的函式將會有不同的產出

```
<br>

- - -

#### 4 建立模型 Building a Logistic Regression Model

【**4.1 顯著性**】What variables are significant in this model?
```{r}

model1 <- glm(violator ~ . , data = train , family = "binomial")
summary(model1)
print("race , state4 , multiple.offenses",quote = FALSE)

# 利用 glm 函式搭配 family 指定為 binomial 來建立邏輯斯回歸模型
# 這邊建立的模型為 violator 為應變數
# 其他 data.frame 中的變數通通為自變數

```

【**4.2 從回歸係數估計邊際效用**】What can we say based on the coefficient of the `multiple.offenses` variable?
```{r}

exp(1.611992)
print("Our model predicts that a parolee who committed multiple offenses has 5.01 times higher odds of being a violator than a parolee who did not commit multiple offenses but is otherwise identical. ", quote = FALSE)

# 擁有 multiple offenses 紀錄的假釋者
# 有 5.01 倍的機率違反假釋期間的規定

```

【**4.3 從預測值估計勝率和機率**】According to the model, what are the odds this individual is a violator?  What is the probability this individual is a violator?
```{r}

logit <- predict(model1,newdata = data.frame(male = "1",race = "1",age = 50, state = "1", time.served = 3 , max.sentence = 12, multiple.offenses = "0",crime = "2"))
names(logit) <- NULL
exp(logit)
1/(1+exp(-logit))

# 將該假釋者的資訊丟入模型預測後
# 把結果之 logit 計算成 odds 以及 probability

```
<br>

- - -

#### 5 驗證模型 Evaluating the Model on the Testing Set

【**5.1 從測試資料預測機率**】What is the maximum predicted probability of a violation?
```{r}

max(predict(model1,newdata = test, type = "response"))

```

【**5.2 從混淆矩陣計算敏感性、明確性、正確率**】What is the model's `sensitivity`, `specificity`, `accuracy`?
```{r}

preResult <- predict(model1,newdata = test, type = "response")
table(test$violator, as.numeric(preResult >= 0.5))
12/(11+12)
167/(167+12)
(167+12)/(167+12+11+12)

# 利用 table() 計算出 confusion matrix
# 並利用 confusion matrix 計算 Sensitivity、Specificity 以及 Accuracy

```

【**5.3 底線機率**】What is the accuracy of a simple model that predicts that every parolee is a non-violator?
```{r}

table(test$violator)
179/(179+23)

# 底線機率單純看其 violater 變數項是否為 1

```

【**5.4 根據報償矩陣調整臨界機率**】Which of the following most likely describes their preferences and best course of action?
```{r}

print("The board assigns more cost to a false negative than a false positive, and should therefore use a logistic regression cutoff less than 0.5. ", quote = FALSE)

# 放錯人比關錯人的成本要大得多，因此評估該假釋犯是否會違反規定的標準應該要嚴格一些

```

【**5.5 正確率 vs 辨識率**】Which of the following is the most accurate assessment of the value of the logistic regression model with a cutoff 0.5 to a parole board, based on the model's accuracy as compared to the simple baseline model?
```{r}

print("The model is likely of value to the board, and using a different logistic regression cutoff is likely to improve the model's value. ", quote = FALSE)

# 可以將低放錯人的次數

```

【**5.6 計算辨識率**】Using the `ROCR` package, what is the AUC value for the model?
```{r}

library(ROCR)
pred <- prediction(preResult, test$violator)
as.numeric(performance(pred, "auc")@y.values)

```

【**5.7 辨識率的定義**】Describe the meaning of AUC in this context.
```{r}

print("The probability the model can correctly differentiate between a randomly selected parole violator and a randomly selected parole non-violator.", quote = FALSE)

# 模型能夠猜對假釋犯不會違反規定與會違反規定的概率

```
<br>

- - -

#### 6 抽樣偏差 Identifying Bias in Observational Data

【**6.1 如何避免、診斷、修正抽樣偏差**】How could we improve our dataset to best address selection bias?
```{r}

print("We should use a dataset tracking a group of parolees from the start of their parole until either they violated parole or they completed their term.", quote = FALSE)

# 面對缺漏值
# 若是將該假釋犯的 violator 自動補為 0
# 會造成模型偏向放大不會犯罪的可能性
# 若是將該假釋犯的 violator 自動補為 NA
# R 會自動在建立模型的時候剔除這些觀測值

# 最好的方式就是去追蹤這些假釋犯至其假釋期間結束

```
<br>

- - -

<br><br><br>
