Introduction
議題:用申請假釋者的屬性,預測他會不會違反假釋規定
學習重點:
- 設定隨機種子set.seed(),依比例分割資料
- 從邏輯式回歸的係數推算自變數的邊際效果
- 勝率和機率、勝率比和機率差 之間的換算
- 臨界機率對混淆矩陣(期望報酬)的影響payoff = matrix(c(0,-100,-10,-50),2,2); payoff
- AUC的實質意義payoff = matrix(c(100, -80, -20, 100),2,2); payoff
- 如何(從報酬矩陣)決定臨界機率
- 什麼是抽樣偏差,如何避免、如何修正
1 資料處理 Loading the Dataset
【1.1 讀進資料】How many parolees are contained in the dataset?
parole <- read.csv("parole.csv")
str(parole)
'data.frame': 675 obs. of 9 variables:
$ male : int 1 0 1 1 1 1 1 0 0 1 ...
$ race : int 1 1 2 1 2 2 1 1 1 2 ...
$ age : num 33.2 39.7 29.5 22.4 21.6 46.7 31 24.6 32.6 29.1 ...
$ state : int 1 1 1 1 1 1 1 1 1 1 ...
$ time.served : num 5.5 5.4 5.6 5.7 5.4 6 6 4.8 4.5 4.7 ...
$ max.sentence : int 18 12 12 18 12 18 18 12 13 12 ...
$ multiple.offenses: int 0 0 0 0 0 0 0 0 0 0 ...
$ crime : int 4 3 3 1 1 4 3 1 3 2 ...
$ violator : int 0 0 0 0 0 0 0 0 0 0 ...
summary(parole)
male race age state time.served
Min. :0.000 Min. :1.00 Min. :18.4 Min. :1.00 Min. :0.00
1st Qu.:1.000 1st Qu.:1.00 1st Qu.:25.4 1st Qu.:2.00 1st Qu.:3.25
Median :1.000 Median :1.00 Median :33.7 Median :3.00 Median :4.40
Mean :0.807 Mean :1.42 Mean :34.5 Mean :2.89 Mean :4.20
3rd Qu.:1.000 3rd Qu.:2.00 3rd Qu.:42.5 3rd Qu.:4.00 3rd Qu.:5.20
Max. :1.000 Max. :2.00 Max. :67.0 Max. :4.00 Max. :6.00
max.sentence multiple.offenses crime violator
Min. : 1.0 Min. :0.000 Min. :1.00 Min. :0.000
1st Qu.:12.0 1st Qu.:0.000 1st Qu.:1.00 1st Qu.:0.000
Median :12.0 Median :1.000 Median :2.00 Median :0.000
Mean :13.1 Mean :0.536 Mean :2.06 Mean :0.116
3rd Qu.:15.0 3rd Qu.:1.000 3rd Qu.:3.00 3rd Qu.:0.000
Max. :18.0 Max. :1.000 Max. :4.00 Max. :1.000
【1.2 底線機率】How many of the parolees in the dataset violated the terms of their parole?
table(parole$violator)
0 1
597 78
2 整理資料框 Creating Our Prediction Model
【2.1 類別變數】Which variables in this dataset are unordered factors with at least three levels?
### (4) state
### (8) crime
【2.2 資料框摘要】How does the output of summary() change for a factor variable as compared to a numerical variable?
### (1) The output becomes similar to that of the table() function applied to that variable
parole$state = as.factor(parole$state)
parole$crime = as.factor(parole$crime)
class(parole$state)
[1] "factor"
class(parole$crime)
[1] "factor"
3 資料分割 Splitting into a Training and Testing Set
【3.1 指定隨機種子、依比例分割資料】Roughly what proportion of parolees have been allocated to the training and testing sets?
### (1) 70% to the training set, 30% to the testing set
set.seed(144)
library(caTools)
split = sample.split(parole$violator, SplitRatio = 0.7)
train = subset(parole, split == TRUE)
test = subset(parole, split == FALSE)
【3.2 隨機種子的功用】Now, suppose you re-ran lines [1]-[5] of Problem 3.1. What would you expect? If you instead ONLY re-ran lines [3]-[5], what would you expect? If you instead called set.seed() with a different number and then re-ran lines [3]-[5] of Problem 3.1, what would you expect?
### (2) A different training/testing set split from the first execution of [1]-[5]
### (1) The exact same training/testing set split as the first execution of [1]-[5]
### (2) A different training/testing set split from the first execution of [1]-[5]
split = sample.split(parole$violator, SplitRatio = 0.7)
train = subset(parole, split == TRUE)
test = subset(parole, split == FALSE)
4 建立模型 Building a Logistic Regression Model
【4.1 顯著性】What variables are significant in this model?
### race/ state4/ multiple.offenses
set.seed(144)
library(caTools)
split = sample.split(parole$violator, SplitRatio = 0.7)
train = subset(parole, split == TRUE)
test = subset(parole, split == FALSE)
paroleLog <- glm(violator ~., data = parole, family = "binomial")
【4.2 從回歸係數估計邊際效用】What can we say based on the coefficient of the multiple.offenses variable?
# (4) Our model predicts that a parolee who committed multiple offenses has 5.01 times higher odds of being a violator than a parolee who did not commit multiple offenses but is otherwise identical.
### exp(ln(oddsA)) = exp(ln(oddsB) + 1.61)
### exp(ln(oddsA)) = exp(ln(oddsB)) * exp(1.61)
### oddsA = exp(1.61) * oddsB
### oddsA= 5.01 * oddsB
【4.3 從預測值估計勝率和機率】According to the model, what are the odds this individual is a violator? What is the probability this individual is a violator?
odds <- glm(violator ~ male + race + age + state + time.served + max.sentence + multiple.offenses + crime, data = parole, family = binomial)
odds
Call: glm(formula = violator ~ male + race + age + state + time.served +
max.sentence + multiple.offenses + crime, family = binomial,
data = parole)
Coefficients:
(Intercept) male race age
-4.05238 0.27062 0.75725 0.00655
state2 state3 state4 time.served
0.20840 0.89381 -3.28084 -0.07655
max.sentence multiple.offenses crime2 crime3
0.05329 1.53155 0.33689 -0.28099
crime4
-0.15781
Degrees of Freedom: 674 Total (i.e. Null); 662 Residual
Null Deviance: 483
Residual Deviance: 349 AIC: 375
logodds <- -4.05238 + 0.27062*male + 0.75725*race - 0.00655*age + 0.20840*state2 + 0.89381*state3 - 3.28084*state4 - 0.07655*time.served + 0.05329*max.sentence + 1.53155*multiple.offenses + 0.33689*crime2 - 0.28099*crime3 - 0.15781*crime4
male <- 1
race <- 1
age <- 50
state <- 1
time.served <- 3
max.sentence <- 12
multiple.offenses <- 0
crime <- 2
logodds
[1] -2.605
odds <- exp(logodds)
odds
[1] 0.07388
probViolation <- 1 / (1 + odds)
probViolation
[1] 0.9312
5 驗證模型 Evaluating the Model on the Testing Set
【5.1 從測試資料預測機率】What is the maximum predicted probability of a violation?
prediction <- predict(paroleLog, type = "response", newdata = test)
summary(prediction)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0024 0.0212 0.0529 0.1360 0.1254 0.8125
【5.2 從混淆矩陣計算敏感性、明確性、正確率】What is the model’s sensitivity, specificity, accuracy?
table(test$violator, prediction > 0.5)
FALSE TRUE
0 169 10
1 11 12
sensitivity <- 12 / (11 + 12)
specificity <- 169 / (169 +10)
accuracy <- (169 + 12) / (169 + 11 + 10 +12)
sensitivity
[1] 0.5217
specificity
[1] 0.9441
accuracy
[1] 0.896
【5.3 底線機率】What is the accuracy of a simple model that predicts that every parolee is a non-violator?
table(test$violator)
0 1
179 23
accuracy <- 179 / (179 + 23)
accuracy
[1] 0.8861
【5.4 根據報償矩陣調整臨界機率】Which of the following most likely describes their preferences and best course of action?
pred <- predict(paroleLog, type = "response", newdata = test)
summary(pred)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0024 0.0212 0.0529 0.1360 0.1254 0.8125
【5.5 正確率 vs 辨識率】Which of the following is the most accurate assessment of the value of the logistic regression model with a cutoff 0.5 to a parole board, based on the model’s accuracy as compared to the simple baseline model?
table(pred$violator, pred > 0.5)
Error in pred$violator : $ operator is invalid for atomic vectors
【5.6 計算辨識率】Using the ROCR package, what is the AUC value for the model?
library(ROCR)
ROCRpred <- prediction(prediction, test$violator)
as.numeric(performance(ROCRpred, "auc")@y.values)
[1] 0.9031
【5.7 辨識率的定義】Describe the meaning of AUC in this context.
# (1) The probability the model can correctly differentiate between a randomly selected parole violator and a randomly selected parole non-violator.
# AUC(Area Under Curve, 辨識力)
6 抽樣偏差 Identifying Bias in Observational Data
【6.1 如何避免、診斷、修正抽樣偏差】How could we improve our dataset to best address selection bias?
# (4) We should use a dataset tracking a group of parolees from the start of their parole until either they violated parole or they completed their term.
---
title: "AS3-2 Predicting Parole Violators"
author: "楊凱倫 M064610021"
output: html_notebook
---

```{r echo=T, message=F, cache=F, warning=F}
rm(list=ls(all=T))
options(digits=4, scipen=12)
library(magrittr)
```

- - -

### Introduction

**議題：用申請假釋者的屬性，預測他會不會違反假釋規定**

**學習重點：**

+ 設定隨機種子set.seed()，依比例分割資料
+ 從邏輯式回歸的係數推算自變數的邊際效果
+ 勝率和機率、勝率比和機率差 之間的換算 
+ 臨界機率對混淆矩陣(期望報酬)的影響payoff = matrix(c(0,-100,-10,-50),2,2); payoff
+ AUC的實質意義payoff = matrix(c(100, -80, -20, 100),2,2); payoff
+ 如何(從報酬矩陣)決定臨界機率 
+ 什麼是抽樣偏差,如何避免、如何修正

<br>

- - -

#### 1 資料處理 Loading the Dataset

【**1.1 讀進資料**】How many parolees are contained in the dataset?
```{r}
parole <- read.csv("parole.csv")
str(parole)
summary(parole)
```

【**1.2 底線機率**】How many of the parolees in the dataset violated the terms of their parole?
```{r}
table(parole$violator)
```
<br>

- - -

#### 2 整理資料框 Creating Our Prediction Model

【**2.1 類別變數**】Which variables in this dataset are unordered factors with at least three levels? 
```{r}
### (4) state
### (8) crime
```

【**2.2 資料框摘要**】How does the output of `summary()` change for a factor variable as compared to a numerical variable? 
```{r}
### (1) The output becomes similar to that of the table() function applied to that variable
parole$state = as.factor(parole$state)
parole$crime = as.factor(parole$crime)
class(parole$state)
class(parole$crime)
```
<br>

- - -

#### 3 資料分割 Splitting into a Training and Testing Set

【**3.1 指定隨機種子、依比例分割資料**】Roughly what proportion of parolees have been allocated to the training and testing sets?
```{r}
### (1) 70% to the training set, 30% to the testing set
set.seed(144)
library(caTools)
split = sample.split(parole$violator, SplitRatio = 0.7)
train = subset(parole, split == TRUE)
test = subset(parole, split == FALSE)
```

【**3.2 隨機種子的功用**】Now, suppose you re-ran lines [1]-[5] of Problem 3.1. What would you expect? If you instead ONLY re-ran lines [3]-[5], what would you expect? If you instead called set.seed() with a different number and then re-ran lines [3]-[5] of Problem 3.1, what would you expect?
```{r}
### (2) A different training/testing set split from the first execution of [1]-[5]
### (1) The exact same training/testing set split as the first execution of [1]-[5]
### (2) A different training/testing set split from the first execution of [1]-[5]

split = sample.split(parole$violator, SplitRatio = 0.7)
train = subset(parole, split == TRUE)
test = subset(parole, split == FALSE)

```
<br>

- - -

#### 4 建立模型 Building a Logistic Regression Model

【**4.1 顯著性**】What variables are significant in this model?
```{r}
### race/ state4/ multiple.offenses
set.seed(144)
library(caTools)
split = sample.split(parole$violator, SplitRatio = 0.7)
train = subset(parole, split == TRUE)
test = subset(parole, split == FALSE)

paroleLog <- glm(violator ~., data = parole, family = "binomial")
```

【**4.2 從回歸係數估計邊際效用**】What can we say based on the coefficient of the `multiple.offenses` variable?
```{r}
# (4) Our model predicts that a parolee who committed multiple offenses has 5.01 times higher odds of being a violator than a parolee who did not commit multiple offenses but is otherwise identical.
### exp(ln(oddsA)) = exp(ln(oddsB) + 1.61)
### exp(ln(oddsA)) = exp(ln(oddsB)) * exp(1.61)
### oddsA = exp(1.61) * oddsB
### oddsA= 5.01 * oddsB
```

【**4.3 從預測值估計勝率和機率**】According to the model, what are the odds this individual is a violator?  What is the probability this individual is a violator?
```{r}
odds <- glm(violator ~ male + race + age + state + time.served + max.sentence + multiple.offenses + crime, data = parole, family = binomial)
odds

logodds <- -4.05238 + 0.27062*male + 0.75725*race - 0.00655*age + 0.20840*state2 + 0.89381*state3 - 3.28084*state4 - 0.07655*time.served + 0.05329*max.sentence + 1.53155*multiple.offenses + 0.33689*crime2 - 0.28099*crime3 - 0.15781*crime4
male <- 1
race <- 1
age <- 50
state <- 1
time.served <- 3
max.sentence <- 12
multiple.offenses <- 0
crime <- 2
logodds
odds <- exp(logodds)
odds
probViolation <- 1 / (1 + odds)
probViolation
```
<br>

- - -

#### 5 驗證模型 Evaluating the Model on the Testing Set

【**5.1 從測試資料預測機率**】What is the maximum predicted probability of a violation?
```{r}
prediction <- predict(paroleLog, type = "response", newdata = test)
summary(prediction)
```

【**5.2 從混淆矩陣計算敏感性、明確性、正確率**】What is the model's `sensitivity`, `specificity`, `accuracy`?
```{r}
table(test$violator, prediction > 0.5)
sensitivity <- 12 / (11 + 12)
specificity <- 169 / (169 +10)
accuracy <- (169 + 12) / (169 + 11 + 10 +12)
sensitivity
specificity
accuracy
```

【**5.3 底線機率**】What is the accuracy of a simple model that predicts that every parolee is a non-violator?
```{r}
table(test$violator)
accuracy <- 179 / (179 + 23)
accuracy
```

【**5.4 根據報償矩陣調整臨界機率**】Which of the following most likely describes their preferences and best course of action?
```{r}
# (2) The board assigns more cost to a false negative than a false positive, and should therefore use a logistic regression cutoff less than 0.5.
### FN：被冤枉，要盡量避免此一情況發生，故臨界值必須調低，以降低型二錯誤發生率

pred <- predict(paroleLog, type = "response", newdata = test) 
summary(pred)
```

【**5.5 正確率 vs 辨識率**】Which of the following is the most accurate assessment of the value of the logistic regression model with a cutoff 0.5 to a parole board, based on the model's accuracy as compared to the simple baseline model?
```{r}
# (4) The model is likely of value to the board, and using a different logistic regression cutoff is likely to improve the model's value.
### baseline 和 新建出來的 model 都具有解釋效力，然而根據testing data 所建置出來的新模型解釋效力更佳 

```

【**5.6 計算辨識率**】Using the `ROCR` package, what is the AUC value for the model?
```{r}
library(ROCR)
ROCRpred <- prediction(prediction, test$violator)
as.numeric(performance(ROCRpred, "auc")@y.values)
```

【**5.7 辨識率的定義**】Describe the meaning of AUC in this context.
```{r}
# (1) The probability the model can correctly differentiate between a randomly selected parole violator and a randomly selected parole non-violator.

# AUC(Area Under Curve, 辨識力)
```
<br>

- - -

#### 6 抽樣偏差 Identifying Bias in Observational Data

【**6.1 如何避免、診斷、修正抽樣偏差**】How could we improve our dataset to best address selection bias?
```{r}
# (4) We should use a dataset tracking a group of parolees from the start of their parole until either they violated parole or they completed their term.
```
<br>

- - -

<br><br><br>
