rm(list=ls(all=T))
options(digits=4, scipen=12)
library(magrittr)

Introduction

議題:用申請假釋者的屬性,預測他會不會違反假釋規定

學習重點:



1 資料處理 Loading the Dataset

1.1 讀進資料】How many parolees are contained in the dataset?

parole=read.csv("data/parole.csv")
nrow(parole)
[1] 675

1.2 底線機率】How many of the parolees in the dataset violated the terms of their parole?

sum(parole$violator==1)
[1] 78



2 整理資料框 Creating Our Prediction Model

2.1 類別變數】Which variables in this dataset are unordered factors with at least three levels?

parole$state = factor(parole$state)
parole$crime = factor(parole$crime)
summary(parole)
      male            race           age       state    time.served    max.sentence  multiple.offenses crime  
 Min.   :0.000   Min.   :1.00   Min.   :18.4   1:143   Min.   :0.00   Min.   : 1.0   Min.   :0.000     1:315  
 1st Qu.:1.000   1st Qu.:1.00   1st Qu.:25.4   2:120   1st Qu.:3.25   1st Qu.:12.0   1st Qu.:0.000     2:106  
 Median :1.000   Median :1.00   Median :33.7   3: 82   Median :4.40   Median :12.0   Median :1.000     3:153  
 Mean   :0.807   Mean   :1.42   Mean   :34.5   4:330   Mean   :4.20   Mean   :13.1   Mean   :0.536     4:101  
 3rd Qu.:1.000   3rd Qu.:2.00   3rd Qu.:42.5           3rd Qu.:5.20   3rd Qu.:15.0   3rd Qu.:1.000            
 Max.   :1.000   Max.   :2.00   Max.   :67.0           Max.   :6.00   Max.   :18.0   Max.   :1.000            
    violator    
 Min.   :0.000  
 1st Qu.:0.000  
 Median :0.000  
 Mean   :0.116  
 3rd Qu.:0.000  
 Max.   :1.000  

2.2 資料框摘要】How does the output of summary() change for a factor variable as compared to a numerical variable?

set.seed(144)
library(caTools)
split = sample.split(parole$violator, SplitRatio = 0.7)
train = subset(parole, split == TRUE)
test = subset(parole, split == FALSE)
"0.7 training 0.3 testing"
[1] "0.7 training 0.3 testing"



3 資料分割 Splitting into a Training and Testing Set

3.1 指定隨機種子、依比例分割資料】Roughly what proportion of parolees have been allocated to the training and testing sets?

paroleglm=glm(violator~.,train,family="binomial")
summary(paroleglm)

Call:
glm(formula = violator ~ ., family = "binomial", data = train)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.704  -0.424  -0.272  -0.169   2.837  

Coefficients:
                   Estimate Std. Error z value    Pr(>|z|)    
(Intercept)       -4.241157   1.293885   -3.28       0.001 ** 
male               0.386990   0.437961    0.88       0.377    
race               0.886719   0.395066    2.24       0.025 *  
age               -0.000176   0.016085   -0.01       0.991    
state2             0.443301   0.481662    0.92       0.357    
state3             0.834980   0.556270    1.50       0.133    
state4            -3.396788   0.611586   -5.55 0.000000028 ***
time.served       -0.123887   0.120423   -1.03       0.304    
max.sentence       0.080295   0.055375    1.45       0.147    
multiple.offenses  1.611992   0.385305    4.18 0.000028683 ***
crime2             0.683714   0.500355    1.37       0.172    
crime3            -0.278105   0.432836   -0.64       0.521    
crime4            -0.011763   0.571304   -0.02       0.984    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 340.04  on 472  degrees of freedom
Residual deviance: 251.48  on 460  degrees of freedom
AIC: 277.5

Number of Fisher Scoring iterations: 6
"race state4 multiple.offenses are significant"
[1] "race state4 multiple.offenses are significant"

3.2 隨機種子的功用】Now, suppose you re-ran lines [1]-[5] of Problem 3.1. What would you expect? If you instead ONLY re-ran lines [3]-[5], what would you expect? If you instead called set.seed() with a different number and then re-ran lines [3]-[5] of Problem 3.1, what would you expect?

odd=-4.241157+0.386990+0.886719+50*(-0.000176)+3*(-0.123887)+12*0.080295+0.683714
exp(odd)
[1] 0.1826
1/(1+exp(-odd))
[1] 0.1544



4 建立模型 Building a Logistic Regression Model

4.1 顯著性】What variables are significant in this model?

x=table(actual = test$violator, predict = predictmd>=0.5)
x
      predict
actual FALSE TRUE
     0   167   12
     1    11   12
sensitivity=12/(12+11)
specificity=167/(167+12)
accuracy=(12+167)/(12+167+12+11)
sensitivity
[1] 0.5217
specificity
[1] 0.933
accuracy
[1] 0.8861

4.2 從回歸係數估計邊際效用】What can we say based on the coefficient of the multiple.offenses variable?

table(test$violator)

  0   1 
179  23 
179/(179+23)
[1] 0.8861

4.3 從預測值估計勝率和機率】According to the model, what are the odds this individual is a violator? What is the probability this individual is a violator?

library(caTools)
colAUC(predictmd, test$violator)
          [,1]
0 vs. 1 0.8946



5 驗證模型 Evaluating the Model on the Testing Set

5.1 從測試資料預測機率】What is the maximum predicted probability of a violation?

max(predict(model1,newdata = test, type = "response"))
[1] 0.9073

5.2 從混淆矩陣計算敏感性、明確性、正確率】What is the model’s sensitivity, specificity, accuracy?

preResult <- predict(model1,newdata = test, type = "response")
table(test$violator, as.numeric(preResult >= 0.5))
   
      0   1
  0 167  12
  1  11  12
12/(11+12)
[1] 0.5217
167/(167+12)
[1] 0.933
(167+12)/(167+12+11+12)
[1] 0.8861
# 利用 table() 計算出 confusion matrix
# 並利用 confusion matrix 計算 Sensitivity、Specificity 以及 Accuracy

5.3 底線機率】What is the accuracy of a simple model that predicts that every parolee is a non-violator?

table(test$violator)

  0   1 
179  23 
179/(179+23)
[1] 0.8861
# 底線機率單純看其 violater 變數項是否為 1

5.4 根據報償矩陣調整臨界機率】Which of the following most likely describes their preferences and best course of action?

print("The board assigns more cost to a false negative than a false positive, and should therefore use a logistic regression cutoff less than 0.5. ", quote = FALSE)
[1] The board assigns more cost to a false negative than a false positive, and should therefore use a logistic regression cutoff less than 0.5. 
# 放錯人比關錯人的成本要大得多,因此評估該假釋犯是否會違反規定的標準應該要嚴格一些

5.5 正確率 vs 辨識率】Which of the following is the most accurate assessment of the value of the logistic regression model with a cutoff 0.5 to a parole board, based on the model’s accuracy as compared to the simple baseline model?

print("The model is likely of value to the board, and using a different logistic regression cutoff is likely to improve the model's value. ", quote = FALSE)
[1] The model is likely of value to the board, and using a different logistic regression cutoff is likely to improve the model's value. 
# 可以將低放錯人的次數

5.6 計算辨識率】Using the ROCR package, what is the AUC value for the model?

library(ROCR)
pred <- prediction(preResult, test$violator)
as.numeric(performance(pred, "auc")@y.values)
[1] 0.8946

5.7 辨識率的定義】Describe the meaning of AUC in this context.

print("The probability the model can correctly differentiate between a randomly selected parole violator and a randomly selected parole non-violator.", quote = FALSE)
[1] The probability the model can correctly differentiate between a randomly selected parole violator and a randomly selected parole non-violator.
# 模型能夠猜對假釋犯不會違反規定與會違反規定的概率



6 抽樣偏差 Identifying Bias in Observational Data

6.1 如何避免、診斷、修正抽樣偏差】How could we improve our dataset to best address selection bias?

print("We should use a dataset tracking a group of parolees from the start of their parole until either they violated parole or they completed their term.", quote = FALSE)
[1] We should use a dataset tracking a group of parolees from the start of their parole until either they violated parole or they completed their term.
# 面對缺漏值
# 若是將該假釋犯的 violator 自動補為 0
# 會造成模型偏向放大不會犯罪的可能性
# 若是將該假釋犯的 violator 自動補為 NA
# R 會自動在建立模型的時候剔除這些觀測值
# 最好的方式就是去追蹤這些假釋犯至其假釋期間結束






---
title: "AS3-2 Predicting Parole Violators"
author: "Group 2"
output: html_notebook
---

```{r echo=T, message=F, cache=F, warning=F}
rm(list=ls(all=T))
options(digits=4, scipen=12)
library(magrittr)
```

- - -

### Introduction

**議題：用申請假釋者的屬性，預測他會不會違反假釋規定**

**學習重點：**

+ 設定隨機種子set.seed()，依比例分割資料
+ 從邏輯式回歸的係數推算自變數的邊際效果
+ 勝率和機率、勝率比和機率差 之間的換算 
+ 臨界機率對混淆矩陣(期望報酬)的影響payoff = matrix(c(0,-100,-10,-50),2,2); payoff
+ AUC的實質意義payoff = matrix(c(100, -80, -20, 100),2,2); payoff
+ 如何(從報酬矩陣)決定臨界機率 
+ 什麼是抽樣偏差,如何避免、如何修正

<br>

- - -

#### 1 資料處理 Loading the Dataset

【**1.1 讀進資料**】How many parolees are contained in the dataset?
```{r}

parole <- read.csv("data/parole.csv")
nrow(parole)

```

【**1.2 底線機率**】How many of the parolees in the dataset violated the terms of their parole?
```{r}

nrow(parole[parole$violator == 1,])

```
<br>

- - -

#### 2 整理資料框 Creating Our Prediction Model

【**2.1 類別變數**】Which variables in this dataset are unordered factors with at least three levels? 
```{r}

parole2 <- data.frame(sapply(parole,as.factor))
summary(parole2)
print("state",quote = FALSE)
print("crime",quote = FALSE)

parole$male <- parole2$male
parole$race <- parole2$race
parole$state <- parole2$state
parole$multiple.offenses <- parole2$multiple.offenses
parole$crime <- parole2$crime
parole$violator <- parole2$violator

# 將原始資料先全部轉成 factor 存到另一個 data.frame
# 觀察這個 data.frame summary() 的結果
# 挑出是類別的變數並更新原始 data.frame 中那幾個變數為 factor 資料

```

【**2.2 資料框摘要**】How does the output of `summary()` change for a factor variable as compared to a numerical variable? 
```{r}

print("The output becomes similar to that of the table() function applied to that variable .",quote = FALSE)

# factor 變數的 summary 會是各個類別的出現次數

```
<br>

- - -

#### 3 資料分割 Splitting into a Training and Testing Set

【**3.1 指定隨機種子、依比例分割資料**】Roughly what proportion of parolees have been allocated to the training and testing sets?
```{r}

set.seed(144)
library(caTools)
split <- sample.split(parole$violator, SplitRatio = 0.7)
train <- subset(parole, split == TRUE)
test <- subset(parole, split == FALSE)
nrow(train)/(nrow(train)+nrow(test))
print("70% to the training set, 30% to the testing set . ",quote = FALSE)

# 指定隨機種子並指定比例來分割資料
# 七成為 training data、三成為 testing data

```

【**3.2 隨機種子的功用**】Now, suppose you re-ran lines [1]-[5] of Problem 3.1. What would you expect? If you instead ONLY re-ran lines [3]-[5], what would you expect? If you instead called set.seed() with a different number and then re-ran lines [3]-[5] of Problem 3.1, what would you expect?
```{r}

print("The exact same training/testing set split as the first execution of [1]-[5] .",quote = FALSE)
print("A different training/testing set split from the first execution of [1]-[5] . ",quote = FALSE)
print("A different training/testing set split from the first execution of [1]-[5] . ",quote = FALSE)

# 若是有指定好隨機種子
# 接下來在使用相關函式的時候都會得到相同的結果
# 因次若是沒有設定隨機種子的數字、或是指定了不同的種子
# 同樣的函式將會有不同的產出

```
<br>

- - -

#### 4 建立模型 Building a Logistic Regression Model

【**4.1 顯著性**】What variables are significant in this model?
```{r}

model1 <- glm(violator ~ . , data = train , family = "binomial")
summary(model1)
print("race , state4 , multiple.offenses",quote = FALSE)

# 利用 glm 函式搭配 family 指定為 binomial 來建立邏輯斯回歸模型
# 這邊建立的模型為 violator 為應變數
# 其他 data.frame 中的變數通通為自變數

```

【**4.2 從回歸係數估計邊際效用**】What can we say based on the coefficient of the `multiple.offenses` variable?
```{r}

exp(1.611992)
print("Our model predicts that a parolee who committed multiple offenses has 5.01 times higher odds of being a violator than a parolee who did not commit multiple offenses but is otherwise identical. ", quote = FALSE)

# 擁有 multiple offenses 紀錄的假釋者
# 有 5.01 倍的機率違反假釋期間的規定

```

【**4.3 從預測值估計勝率和機率**】According to the model, what are the odds this individual is a violator?  What is the probability this individual is a violator?
```{r}

logit <- predict(model1,newdata = data.frame(male = "1",race = "1",age = 50, state = "1", time.served = 3 , max.sentence = 12, multiple.offenses = "0",crime = "2"))
names(logit) <- NULL
exp(logit)
1/(1+exp(-logit))

# 將該假釋者的資訊丟入模型預測後
# 把結果之 logit 計算成 odds 以及 probability

```
<br>

- - -

#### 5 驗證模型 Evaluating the Model on the Testing Set

【**5.1 從測試資料預測機率**】What is the maximum predicted probability of a violation?
```{r}

max(predict(model1,newdata = test, type = "response"))

```

【**5.2 從混淆矩陣計算敏感性、明確性、正確率**】What is the model's `sensitivity`, `specificity`, `accuracy`?
```{r}

preResult <- predict(model1,newdata = test, type = "response")
table(test$violator, as.numeric(preResult >= 0.5))
12/(11+12)
167/(167+12)
(167+12)/(167+12+11+12)

# 利用 table() 計算出 confusion matrix
# 並利用 confusion matrix 計算 Sensitivity、Specificity 以及 Accuracy

```

【**5.3 底線機率**】What is the accuracy of a simple model that predicts that every parolee is a non-violator?
```{r}

table(test$violator)
179/(179+23)

# 底線機率單純看其 violater 變數項是否為 1

```

【**5.4 根據報償矩陣調整臨界機率**】Which of the following most likely describes their preferences and best course of action?
```{r}

print("The board assigns more cost to a false negative than a false positive, and should therefore use a logistic regression cutoff less than 0.5. ", quote = FALSE)

# 放錯人比關錯人的成本要大得多，因此評估該假釋犯是否會違反規定的標準應該要嚴格一些

```

【**5.5 正確率 vs 辨識率**】Which of the following is the most accurate assessment of the value of the logistic regression model with a cutoff 0.5 to a parole board, based on the model's accuracy as compared to the simple baseline model?
```{r}

print("The model is likely of value to the board, and using a different logistic regression cutoff is likely to improve the model's value. ", quote = FALSE)

# 可以將低放錯人的次數

```

【**5.6 計算辨識率**】Using the `ROCR` package, what is the AUC value for the model?
```{r}

library(ROCR)
pred <- prediction(preResult, test$violator)
as.numeric(performance(pred, "auc")@y.values)

```

【**5.7 辨識率的定義**】Describe the meaning of AUC in this context.
```{r}

print("The probability the model can correctly differentiate between a randomly selected parole violator and a randomly selected parole non-violator.", quote = FALSE)

# 模型能夠猜對假釋犯不會違反規定與會違反規定的概率

```
<br>

- - -

#### 6 抽樣偏差 Identifying Bias in Observational Data

【**6.1 如何避免、診斷、修正抽樣偏差**】How could we improve our dataset to best address selection bias?
```{r}

print("We should use a dataset tracking a group of parolees from the start of their parole until either they violated parole or they completed their term.", quote = FALSE)

# 面對缺漏值
# 若是將該假釋犯的 violator 自動補為 0
# 會造成模型偏向放大不會犯罪的可能性
# 若是將該假釋犯的 violator 自動補為 NA
# R 會自動在建立模型的時候剔除這些觀測值

# 最好的方式就是去追蹤這些假釋犯至其假釋期間結束

```
<br>

- - -

<br><br><br>
