Introduction
議題:用申請假釋者的屬性,預測他會不會違反假釋規定
學習重點:
- 設定隨機種子set.seed(),依比例分割資料
- 從邏輯式回歸的係數推算自變數的邊際效果
- 勝率和機率、勝率比和機率差 之間的換算
- 臨界機率對混淆矩陣(期望報酬)的影響payoff = matrix(c(0,-100,-10,-50),2,2); payoff
- AUC的實質意義payoff = matrix(c(100, -80, -20, 100),2,2); payoff
- 如何(從報酬矩陣)決定臨界機率
- 什麼是抽樣偏差,如何避免、如何修正
1 資料處理 Loading the Dataset
【1.1 讀進資料】How many parolees are contained in the dataset?
parole= read.csv("Unit3/parole.csv")
#675
【1.2 底線機率】How many of the parolees in the dataset violated the terms of their parole?
table(parole$violator) #78
2 整理資料框 Creating Our Prediction Model
【2.1 類別變數】Which variables in this dataset are unordered factors with at least three levels?
#state, crime. 其他即使是factor,但都只有兩個層級
【2.2 資料框摘要】How does the output of summary() change for a factor variable as compared to a numerical variable?
parole$state= as.factor(parole$state)
parole$crime= as.factor(parole$crime)
summary(parole$state) #The output becomes similar to that of the table() function applied to that variable
3 資料分割 Splitting into a Training and Testing Set
【3.1 指定隨機種子、依比例分割資料】Roughly what proportion of parolees have been allocated to the training and testing sets?
set.seed(144) #設定隨機種子
library(caTools)
split = sample.split(parole$violator, SplitRatio = 0.7)
tr = subset(parole, split == TRUE)
ts = subset(parole, split == FALSE)
#70% to the training set, 30% to the testing set
【3.2 隨機種子的功用】Now, suppose you re-ran lines [1]-[5] of Problem 3.1. What would you expect? If you instead ONLY re-ran lines [3]-[5], what would you expect? If you instead called set.seed() with a different number and then re-ran lines [3]-[5] of Problem 3.1, what would you expect?
#重新跑一次全部會得到一樣的結果。但設了隨機種子之後分割兩次會得不同結果。設不同隨機種子也會得不同結果。
4 建立模型 Building a Logistic Regression Model
【4.1 顯著性】What variables are significant in this model?
model1=glm(violator~., tr, family="binomial")
summary(model1)
#race2, state4, multiple.offenses1
【4.2 從回歸係數估計邊際效用】What can we say based on the coefficient of the multiple.offenses variable?
#For parolees A and B who are identical other than A having committed multiple offenses, the predicted log odds of A is 1.61 more than the predicted log odds of B. #有A和B兩個罪犯,除了A是多重犯罪者而B不是之外,兩者皆相同。而Coefficient=1.6119919的意思,代表A的logit會比B的高1.612。logit是機率,odds取log
#If we have a coefficient c for a variable, then that means the log odds (or Logit) are increased by c for a unit increase in the variable.
#so answer: Our model predicts that a parolee who committed multiple offenses has 5.01 times higher odds of being a violator than a parolee who did not commit multiple offenses but is otherwise identical.
【4.3 從預測值估計勝率和機率】Consider a parolee who is male, of white race, aged 50 years at prison release, from the state of Maryland, served 3 months, had a maximum sentence of 12 months, did not commit multiple offenses, and committed a larceny. Answer the following questions based on the model’s predictions for this individual. According to the model, what are the odds this individual is a violator? What is the probability this individual is a violator?
#male 0.3869904
#race 0.8867192
#age -0.0001756
#time.served -0.1238867
#max.sentence 0.0802954
#crime2 0.6837143
A= -4.2411574+0.3869904*1+ 0.8867192*1+ 50*-0.0001756+ 3*-0.1238867 + 12*0.0802954+ 1*0.6837143 #logit
#odd= p/(1-p),p = odd/(1+odd)#導一下可求出
A ; exp(A)/(1+exp(A))
#所以預測此人假釋再犯罪的odd為-1.700629,而機率則為0.1543832
5 驗證模型 Evaluating the Model on the Testing Set
【5.1 從測試資料預測機率】What is the maximum predicted probability of a violation?
pred1=predict(model1, newdata=ts, type="response")
summary(pred1)
#0.9073
【5.2 從混淆矩陣計算敏感性、明確性、正確率】What is the model’s sensitivity, specificity, accuracy?
table(ts$violator, as.numeric(pred1 >= 0.5)) #做出confusion matrix
sens= 12/(12+11)
spec= 167/(167+12)
ACC= (167+12)/(167+12+11+12)
sens;spec;ACC
【5.3 底線機率】What is the accuracy of a simple model that predicts that every parolee is a non-violator?
table(ts$violator)
179/(179+23)
【5.4 根據報償矩陣調整臨界機率】Which of the following most likely describes their preferences and best course of action?
table(ts$violator, as.numeric(pred1 <= 0.5))
#The board assigns more cost to a false negative than a false positive, and should therefore use a logistic regression cutoff less than 0.5.
#縮減臨界機率會導致positive預測增加,所以FP也會比較多,並且減少FN。
【5.5 正確率 vs 辨識率】Which of the following is the most accurate assessment of the value of the logistic regression model with a cutoff 0.5 to a parole board, based on the model’s accuracy as compared to the simple baseline model?
table(ts$violator, as.numeric(pred1 >= 0.5))
#The model is likely of value to the board, and using a different logistic regression cutoff is likely to improve the model's value.
【5.6 計算辨識率】Using the ROCR package, what is the AUC value for the model?
library(ROCR)
ROCRpred1= prediction(pred1, ts$violator)
as.numeric(performance(ROCRpred1, "auc") @y.values)
#AUC模型在所有臨界機率之中的辨識能力;在樣本的兩類別中各隨機選取一點時,模型能夠正確區辨它們的機率
【5.7 辨識率的定義】Describe the meaning of AUC in this context.
#The probability the model can correctly differentiate between a randomly selected parole violator and a randomly selected parole non-violator
6 抽樣偏差 Identifying Bias in Observational Data
【6.1 如何避免、診斷、修正抽樣偏差】How could we improve our dataset to best address selection bias?
#We should use a dataset tracking a group of parolees from the start of their parole until either they violated parole or they completed their term.
---
title: "AS3-2 Predicting Parole Violators"
author: "卓雍然 D994010001"
output: html_notebook
---

```{r echo=T, message=F, cache=F, warning=F}
rm(list=ls(all=T))
options(digits=4, scipen=12)
library(magrittr)
```

- - -

### Introduction

**議題：用申請假釋者的屬性，預測他會不會違反假釋規定**

**學習重點：**

+ 設定隨機種子set.seed()，依比例分割資料
+ 從邏輯式回歸的係數推算自變數的邊際效果
+ 勝率和機率、勝率比和機率差 之間的換算 
+ 臨界機率對混淆矩陣(期望報酬)的影響payoff = matrix(c(0,-100,-10,-50),2,2); payoff
+ AUC的實質意義payoff = matrix(c(100, -80, -20, 100),2,2); payoff
+ 如何(從報酬矩陣)決定臨界機率 
+ 什麼是抽樣偏差,如何避免、如何修正

<br>

- - -

#### 1 資料處理 Loading the Dataset

【**1.1 讀進資料**】How many parolees are contained in the dataset?
```{r}
parole= read.csv("Unit3/parole.csv")
#675
```

【**1.2 底線機率**】How many of the parolees in the dataset violated the terms of their parole?
```{r}
table(parole$violator) #78
```
<br>

- - -

#### 2 整理資料框 Creating Our Prediction Model

【**2.1 類別變數**】Which variables in this dataset are unordered factors with at least three levels? 
```{r}
#state, crime. 其他即使是factor，但都只有兩個層級
```

【**2.2 資料框摘要**】How does the output of `summary()` change for a factor variable as compared to a numerical variable? 
```{r}
parole$state= as.factor(parole$state)
parole$crime= as.factor(parole$crime)
summary(parole$state) #The output becomes similar to that of the table() function applied to that variable
```
<br>

- - -

#### 3 資料分割 Splitting into a Training and Testing Set

【**3.1 指定隨機種子、依比例分割資料**】Roughly what proportion of parolees have been allocated to the training and testing sets?
```{r}
 set.seed(144) #設定隨機種子
 library(caTools) 
 split = sample.split(parole$violator, SplitRatio = 0.7)
 tr = subset(parole, split == TRUE)
 ts = subset(parole, split == FALSE)
 #70% to the training set, 30% to the testing set
```

【**3.2 隨機種子的功用**】Now, suppose you re-ran lines [1]-[5] of Problem 3.1. What would you expect? If you instead ONLY re-ran lines [3]-[5], what would you expect? If you instead called set.seed() with a different number and then re-ran lines [3]-[5] of Problem 3.1, what would you expect?
```{r}
#重新跑一次全部會得到一樣的結果。但設了隨機種子之後分割兩次會得不同結果。設不同隨機種子也會得不同結果。
```
<br>

- - -

#### 4 建立模型 Building a Logistic Regression Model

【**4.1 顯著性**】What variables are significant in this model?
```{r}
model1=glm(violator~., tr,  family="binomial")
summary(model1)
#race2, state4, multiple.offenses1
```

【***4.2 從回歸係數估計邊際效用***】What can we say based on the coefficient of the `multiple.offenses` variable?
```{r}
#For parolees A and B who are identical other than A having committed multiple offenses, the predicted log odds of A is 1.61 more than the predicted log odds of B. #有A和B兩個罪犯，除了A是多重犯罪者而B不是之外，兩者皆相同。而Coefficient=1.6119919的意思，代表A的logit會比B的高1.612。logit是機率，odds取log
#If we have a coefficient c for a variable, then that means the log odds (or Logit) are increased by c for a unit increase in the variable.
#so answer: Our model predicts that a parolee who committed multiple offenses has 5.01 times higher odds of being a violator than a parolee who did not commit multiple offenses but is otherwise identical.
```

【**4.3 從預測值估計勝率和機率**】Consider a parolee who is male, of white race, aged 50 years at prison release, from the state of Maryland, served 3 months, had a maximum sentence of 12 months, did not commit multiple offenses, and committed a larceny. Answer the following questions based on the model's predictions for this individual.
According to the model, what are the odds this individual is a violator?  What is the probability this individual is a violator?
```{r}
#male 0.3869904
#race 0.8867192
#age -0.0001756
#time.served -0.1238867
#max.sentence 0.0802954
#crime2 0.6837143
A= -4.2411574+0.3869904*1+ 0.8867192*1+ 50*-0.0001756+ 3*-0.1238867 + 12*0.0802954+ 1*0.6837143 #logit
#odd= p/(1-p),p = odd/(1+odd)#導一下可求出
A ; exp(A)/(1+exp(A))
#所以預測此人假釋再犯罪的odd為-1.700629，而機率則為0.1543832


```
<br>

- - -

#### 5 驗證模型 Evaluating the Model on the Testing Set

【**5.1 從測試資料預測機率**】What is the maximum predicted probability of a violation?
```{r}
pred1=predict(model1, newdata=ts, type="response")
summary(pred1)
#0.9073
```

【**5.2 從混淆矩陣計算敏感性、明確性、正確率**】What is the model's `sensitivity`, `specificity`, `accuracy`?
```{r}
table(ts$violator, as.numeric(pred1 >= 0.5)) #做出confusion matrix
sens= 12/(12+11)
spec= 167/(167+12)
ACC= (167+12)/(167+12+11+12)
sens;spec;ACC
```

【**5.3 底線機率**】What is the accuracy of a simple model that predicts that every parolee is a non-violator?
```{r}
table(ts$violator)
179/(179+23)
```

【**5.4 根據報償矩陣調整臨界機率**】Which of the following most likely describes their preferences and best course of action?
```{r}
table(ts$violator, as.numeric(pred1 <= 0.5))
#The board assigns more cost to a false negative than a false positive, and should therefore use a logistic regression cutoff less than 0.5.
#縮減臨界機率會導致positive預測增加，所以FP也會比較多，並且減少FN。
```

【**5.5 正確率 vs 辨識率**】Which of the following is the most accurate assessment of the value of the logistic regression model with a cutoff 0.5 to a parole board, based on the model's accuracy as compared to the simple baseline model?
```{r}
table(ts$violator, as.numeric(pred1 >= 0.5))
#The model is likely of value to the board, and using a different logistic regression cutoff is likely to improve the model's value.
```

【**5.6 計算辨識率**】Using the `ROCR` package, what is the AUC value for the model?
```{r}
library(ROCR)
ROCRpred1= prediction(pred1, ts$violator)
as.numeric(performance(ROCRpred1, "auc") @y.values)
#AUC模型在所有臨界機率之中的辨識能力；在樣本的兩類別中各隨機選取一點時，模型能夠正確區辨它們的機率
```

【**5.7 辨識率的定義**】Describe the meaning of AUC in this context.
```{r}
#The probability the model can correctly differentiate between a randomly selected parole violator and a randomly selected parole non-violator
```
<br>

- - -

#### 6 抽樣偏差 Identifying Bias in Observational Data

【**6.1 如何避免、診斷、修正抽樣偏差**】How could we improve our dataset to best address selection bias?
```{r}
#We should use a dataset tracking a group of parolees from the start of their parole until either they violated parole or they completed their term.
```
<br>

- - -

<br><br><br>
