rm(list=ls(all=T))
options(digits=4, scipen=12)
library(magrittr)

Introduction

議題:用申請假釋者的屬性,預測他會不會違反假釋規定

學習重點:



1 資料處理 Loading the Dataset

1.1 讀進資料】How many parolees are contained in the dataset?

parole=read.csv("data/parole.csv")
nrow(parole)
[1] 675

1.2 底線機率】How many of the parolees in the dataset violated the terms of their parole?

sum(parole$violator==1)
[1] 78



2 整理資料框 Creating Our Prediction Model

2.1 類別變數】Which variables in this dataset are unordered factors with at least three levels?

parole$state = factor(parole$state)
parole$crime = factor(parole$crime)
summary(parole)
      male            race           age       state    time.served    max.sentence  multiple.offenses crime  
 Min.   :0.000   Min.   :1.00   Min.   :18.4   1:143   Min.   :0.00   Min.   : 1.0   Min.   :0.000     1:315  
 1st Qu.:1.000   1st Qu.:1.00   1st Qu.:25.4   2:120   1st Qu.:3.25   1st Qu.:12.0   1st Qu.:0.000     2:106  
 Median :1.000   Median :1.00   Median :33.7   3: 82   Median :4.40   Median :12.0   Median :1.000     3:153  
 Mean   :0.807   Mean   :1.42   Mean   :34.5   4:330   Mean   :4.20   Mean   :13.1   Mean   :0.536     4:101  
 3rd Qu.:1.000   3rd Qu.:2.00   3rd Qu.:42.5           3rd Qu.:5.20   3rd Qu.:15.0   3rd Qu.:1.000            
 Max.   :1.000   Max.   :2.00   Max.   :67.0           Max.   :6.00   Max.   :18.0   Max.   :1.000            
    violator    
 Min.   :0.000  
 1st Qu.:0.000  
 Median :0.000  
 Mean   :0.116  
 3rd Qu.:0.000  
 Max.   :1.000  

2.2 資料框摘要】How does the output of summary() change for a factor variable as compared to a numerical variable?

set.seed(144)
library(caTools)
split = sample.split(parole$violator, SplitRatio = 0.7)
train = subset(parole, split == TRUE)
test = subset(parole, split == FALSE)
"0.7 training 0.3 testing"
[1] "0.7 training 0.3 testing"



3 資料分割 Splitting into a Training and Testing Set

3.1 指定隨機種子、依比例分割資料】Roughly what proportion of parolees have been allocated to the training and testing sets?

paroleglm=glm(violator~.,train,family="binomial")
summary(paroleglm)

Call:
glm(formula = violator ~ ., family = "binomial", data = train)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.704  -0.424  -0.272  -0.169   2.837  

Coefficients:
                   Estimate Std. Error z value    Pr(>|z|)    
(Intercept)       -4.241157   1.293885   -3.28       0.001 ** 
male               0.386990   0.437961    0.88       0.377    
race               0.886719   0.395066    2.24       0.025 *  
age               -0.000176   0.016085   -0.01       0.991    
state2             0.443301   0.481662    0.92       0.357    
state3             0.834980   0.556270    1.50       0.133    
state4            -3.396788   0.611586   -5.55 0.000000028 ***
time.served       -0.123887   0.120423   -1.03       0.304    
max.sentence       0.080295   0.055375    1.45       0.147    
multiple.offenses  1.611992   0.385305    4.18 0.000028683 ***
crime2             0.683714   0.500355    1.37       0.172    
crime3            -0.278105   0.432836   -0.64       0.521    
crime4            -0.011763   0.571304   -0.02       0.984    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 340.04  on 472  degrees of freedom
Residual deviance: 251.48  on 460  degrees of freedom
AIC: 277.5

Number of Fisher Scoring iterations: 6
"race state4 multiple.offenses are significant"
[1] "race state4 multiple.offenses are significant"

3.2 隨機種子的功用】Now, suppose you re-ran lines [1]-[5] of Problem 3.1. What would you expect? If you instead ONLY re-ran lines [3]-[5], what would you expect? If you instead called set.seed() with a different number and then re-ran lines [3]-[5] of Problem 3.1, what would you expect?

odd=-4.241157+0.386990+0.886719+50*(-0.000176)+3*(-0.123887)+12*0.080295+0.683714
exp(odd)
[1] 0.1826
1/(1+exp(-odd))
[1] 0.1544



4 建立模型 Building a Logistic Regression Model

4.1 顯著性】What variables are significant in this model?

x=table(actual = test$violator, predict = predictmd>=0.5)
x
      predict
actual FALSE TRUE
     0   167   12
     1    11   12
sensitivity=12/(12+11)
specificity=167/(167+12)
accuracy=(12+167)/(12+167+12+11)
sensitivity
[1] 0.5217
specificity
[1] 0.933
accuracy
[1] 0.8861

4.2 從回歸係數估計邊際效用】What can we say based on the coefficient of the multiple.offenses variable?

table(test$violator)

  0   1 
179  23 
179/(179+23)
[1] 0.8861

4.3 從預測值估計勝率和機率】According to the model, what are the odds this individual is a violator? What is the probability this individual is a violator?

library(caTools)
colAUC(predictmd, test$violator)
          [,1]
0 vs. 1 0.8946



5 驗證模型 Evaluating the Model on the Testing Set

5.1 從測試資料預測機率】What is the maximum predicted probability of a violation?

max(predict(model1,newdata = test, type = "response"))
[1] 0.9073

5.2 從混淆矩陣計算敏感性、明確性、正確率】What is the model’s sensitivity, specificity, accuracy?

preResult <- predict(model1,newdata = test, type = "response")
table(test$violator, as.numeric(preResult >= 0.5))
   
      0   1
  0 167  12
  1  11  12
12/(11+12)
[1] 0.5217
167/(167+12)
[1] 0.933
(167+12)/(167+12+11+12)
[1] 0.8861
# 利用 table() 計算出 confusion matrix
# 並利用 confusion matrix 計算 Sensitivity、Specificity 以及 Accuracy

5.3 底線機率】What is the accuracy of a simple model that predicts that every parolee is a non-violator?

table(test$violator)

  0   1 
179  23 
179/(179+23)
[1] 0.8861
# 底線機率單純看其 violater 變數項是否為 1

5.4 根據報償矩陣調整臨界機率】Which of the following most likely describes their preferences and best course of action?

print("The board assigns more cost to a false negative than a false positive, and should therefore use a logistic regression cutoff less than 0.5. ", quote = FALSE)
[1] The board assigns more cost to a false negative than a false positive, and should therefore use a logistic regression cutoff less than 0.5. 
# 放錯人比關錯人的成本要大得多,因此評估該假釋犯是否會違反規定的標準應該要嚴格一些

5.5 正確率 vs 辨識率】Which of the following is the most accurate assessment of the value of the logistic regression model with a cutoff 0.5 to a parole board, based on the model’s accuracy as compared to the simple baseline model?

print("The model is likely of value to the board, and using a different logistic regression cutoff is likely to improve the model's value. ", quote = FALSE)
[1] The model is likely of value to the board, and using a different logistic regression cutoff is likely to improve the model's value. 
# 可以將低放錯人的次數

5.6 計算辨識率】Using the ROCR package, what is the AUC value for the model?

library(ROCR)
pred <- prediction(preResult, test$violator)
as.numeric(performance(pred, "auc")@y.values)
[1] 0.8946

5.7 辨識率的定義】Describe the meaning of AUC in this context.

print("The probability the model can correctly differentiate between a randomly selected parole violator and a randomly selected parole non-violator.", quote = FALSE)
[1] The probability the model can correctly differentiate between a randomly selected parole violator and a randomly selected parole non-violator.
# 模型能夠猜對假釋犯不會違反規定與會違反規定的概率



6 抽樣偏差 Identifying Bias in Observational Data

6.1 如何避免、診斷、修正抽樣偏差】How could we improve our dataset to best address selection bias?

print("We should use a dataset tracking a group of parolees from the start of their parole until either they violated parole or they completed their term.", quote = FALSE)
[1] We should use a dataset tracking a group of parolees from the start of their parole until either they violated parole or they completed their term.
# 面對缺漏值
# 若是將該假釋犯的 violator 自動補為 0
# 會造成模型偏向放大不會犯罪的可能性
# 若是將該假釋犯的 violator 自動補為 NA
# R 會自動在建立模型的時候剔除這些觀測值
# 最好的方式就是去追蹤這些假釋犯至其假釋期間結束






