議題:議題:使用貸款人的資料,預測他會不會還款
【1.1 基礎機率】What proportion of the loans in the dataset were not paid in full?
and 'data.frame': 9578 obs. of 14 variables:
$ credit.policy : int 1 1 1 1 1 1 1 1 1 1 ...
$ purpose : chr "debt_consolidation" "credit_card" "debt_consolidation" "debt_consolidation" ...
$ int.rate : num 0.119 0.107 0.136 0.101 0.143 ...
$ installment : num 829 228 367 162 103 ...
$ log.annual.inc : num 11.4 11.1 10.4 11.4 11.3 ...
$ dti : num 19.5 14.3 11.6 8.1 15 ...
$ fico : int 737 707 682 712 667 727 667 722 682 707 ...
$ days.with.cr.line: num 5640 2760 4710 2700 4066 ...
$ revol.bal : int 28854 33623 3511 33667 4740 50807 3839 24220 69909 5630 ...
$ revol.util : num 52.1 76.7 25.6 73.2 39.5 51 76.8 68.6 51.1 23 ...
$ inq.last.6mths : int 0 0 1 1 0 0 0 0 1 1 ...
$ delinq.2yrs : int 0 0 0 0 1 0 0 0 0 0 ...
$ pub.rec : int 0 0 0 0 0 0 1 0 0 0 ...
$ not.fully.paid : int 0 0 0 0 0 0 1 1 0 0 ...
- attr(*, "spec")=List of 2
..$ cols :List of 14
.. ..$ credit.policy : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ purpose : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ int.rate : list()
.. .. ..- attr(*, "class")= chr "collector_double" "collector"
.. ..$ installment : list()
.. .. ..- attr(*, "class")= chr "collector_double" "collector"
.. ..$ log.annual.inc : list()
.. .. ..- attr(*, "class")= chr "collector_double" "collector"
.. ..$ dti : list()
.. .. ..- attr(*, "class")= chr "collector_double" "collector"
.. ..$ fico : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ days.with.cr.line: list()
.. .. ..- attr(*, "class")= chr "collector_double" "collector"
.. ..$ revol.bal : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ revol.util : list()
.. .. ..- attr(*, "class")= chr "collector_double" "collector"
.. ..$ inq.last.6mths : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ delinq.2yrs : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ pub.rec : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ not.fully.paid : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
..$ default: list()
.. ..- attr(*, "class")= chr "collector_guess" "collector"
..- attr(*, "class")= chr "col_spec"
summary(loans)
credit.policy purpose int.rate installment
Min. :0.000 Length:9578 Min. :0.060 Min. : 15.7
1st Qu.:1.000 Class :character 1st Qu.:0.104 1st Qu.:163.8
Median :1.000 Mode :character Median :0.122 Median :268.9
Mean :0.805 Mean :0.123 Mean :319.1
3rd Qu.:1.000 3rd Qu.:0.141 3rd Qu.:432.8
Max. :1.000 Max. :0.216 Max. :940.1
log.annual.inc dti fico days.with.cr.line
Min. : 7.55 Min. : 0.00 Min. :612 Min. : 179
1st Qu.:10.56 1st Qu.: 7.21 1st Qu.:682 1st Qu.: 2820
Median :10.93 Median :12.66 Median :707 Median : 4140
Mean :10.93 Mean :12.61 Mean :711 Mean : 4562
3rd Qu.:11.29 3rd Qu.:17.95 3rd Qu.:737 3rd Qu.: 5730
Max. :14.53 Max. :29.96 Max. :827 Max. :17640
revol.bal revol.util inq.last.6mths delinq.2yrs
Min. : 0 Min. : 0.0 Min. : 0.00 Min. : 0.000
1st Qu.: 3187 1st Qu.: 22.6 1st Qu.: 0.00 1st Qu.: 0.000
Median : 8596 Median : 46.3 Median : 1.00 Median : 0.000
Mean : 16914 Mean : 46.8 Mean : 1.58 Mean : 0.164
3rd Qu.: 18250 3rd Qu.: 70.9 3rd Qu.: 2.00 3rd Qu.: 0.000
Max. :1207359 Max. :119.0 Max. :33.00 Max. :13.000
pub.rec not.fully.paid
Min. :0.000 Min. :0.00
1st Qu.:0.000 1st Qu.:0.00
Median :0.000 Median :0.00
Mean :0.062 Mean :0.16
3rd Qu.:0.000 3rd Qu.:0.00
Max. :5.000 Max. :1.00
table(loans$not.fully.paid)
0 1
8045 1533
1533/(8045+1533) #0.1600543
[1] 0.1601
【1.2 檢查缺項】Which of the following variables has at least one missing observation?
summary(loans)
credit.policy purpose int.rate installment
Min. :0.000 Length:9578 Min. :0.060 Min. : 15.7
1st Qu.:1.000 Class :character 1st Qu.:0.104 1st Qu.:163.8
Median :1.000 Mode :character Median :0.122 Median :268.9
Mean :0.805 Mean :0.123 Mean :319.1
3rd Qu.:1.000 3rd Qu.:0.141 3rd Qu.:432.8
Max. :1.000 Max. :0.216 Max. :940.1
log.annual.inc dti fico days.with.cr.line
Min. : 7.55 Min. : 0.00 Min. :612 Min. : 179
1st Qu.:10.56 1st Qu.: 7.21 1st Qu.:682 1st Qu.: 2820
Median :10.93 Median :12.66 Median :707 Median : 4140
Mean :10.93 Mean :12.61 Mean :711 Mean : 4562
3rd Qu.:11.29 3rd Qu.:17.95 3rd Qu.:737 3rd Qu.: 5730
Max. :14.53 Max. :29.96 Max. :827 Max. :17640
revol.bal revol.util inq.last.6mths delinq.2yrs
Min. : 0 Min. : 0.0 Min. : 0.00 Min. : 0.000
1st Qu.: 3187 1st Qu.: 22.6 1st Qu.: 0.00 1st Qu.: 0.000
Median : 8596 Median : 46.3 Median : 1.00 Median : 0.000
Mean : 16914 Mean : 46.8 Mean : 1.58 Mean : 0.164
3rd Qu.: 18250 3rd Qu.: 70.9 3rd Qu.: 2.00 3rd Qu.: 0.000
Max. :1207359 Max. :119.0 Max. :33.00 Max. :13.000
pub.rec not.fully.paid
Min. :0.000 Min. :0.00
1st Qu.:0.000 1st Qu.:0.00
Median :0.000 Median :0.00
Mean :0.062 Mean :0.16
3rd Qu.:0.000 3rd Qu.:0.00
Max. :5.000 Max. :1.00
#log.annual.inc,days.with.cr.line,revol.until,inq.last.6mths,delinq.2yrs,pub.rec
【1.3 決定是否要補缺項】Which of the following is the best reason to fill in the missing values for these variables instead of removing observations with missing data?
missing = subset(loans,is.na(log.annual.inc)| is.na(days.with.cr.line)| is.na(revol.util)| is.na(inq.last.6mths)| is.na(delinq.2yrs)| is.na(pub.rec))
nrow(missing)
[1] 0
table(missing$not.fully.paid)
< table of extent 0 >
12/62 #0.19
[1] 0.1935
#We want to be able to predict risk for all borrowers, instead of just the ones with all data reported
【1.4 補缺項工具】What best describes the process we just used to handle missing values?
library(mice)
set.seed(144)
vars.for.imputation = setdiff(names(loans), "not.fully.paid")
imputed = complete(mice(loans[vars.for.imputation]))
iter imp variable
1 1
1 2
1 3
1 4
1 5
2 1
2 2
2 3
2 4
2 5
3 1
3 2
3 3
3 4
3 5
4 1
4 2
4 3
4 4
4 5
5 1
5 2
5 3
5 4
5 5
Number of logged events: 1
loans[vars.for.imputation] = imputed
#We predicted missing variable values using the available independent variables for each observation.
【2.1 顯著性】Which independent variables are significant in our model?
library(caTools)
set.seed(144)
split=sample.split(loans$not.fully.paid,SplitRatio=0.7)
train=subset(loans,split==TRUE)
test=subset(loans,split==FALSE)
model1=glm(not.fully.paid~.,data=train,family=binomial)
summary(model1)
Call:
glm(formula = not.fully.paid ~ ., family = binomial, data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.172 -0.620 -0.495 -0.361 2.635
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 9.23868531 1.55340639 5.95 0.0000000027 ***
credit.policy -0.35962512 0.10049091 -3.58 0.00035 ***
purposecredit_card -0.61018465 0.13434286 -4.54 0.0000055724 ***
purposedebt_consolidation -0.31801924 0.09181777 -3.46 0.00053 ***
purposeeducational 0.13933851 0.17520994 0.80 0.42646
purposehome_improvement 0.17762530 0.14786550 1.20 0.22965
purposemajor_purchase -0.47785935 0.20067564 -2.38 0.01725 *
purposesmall_business 0.41640011 0.14181047 2.94 0.00332 **
int.rate 0.69224856 2.08421219 0.33 0.73978
installment 0.00126980 0.00020911 6.07 0.0000000013 ***
log.annual.inc -0.43103508 0.07138624 -6.04 0.0000000016 ***
dti 0.00449540 0.00549797 0.82 0.41356
fico -0.00940183 0.00170766 -5.51 0.0000000368 ***
days.with.cr.line 0.00000413 0.00001582 0.26 0.79406
revol.bal 0.00000300 0.00000116 2.58 0.01001 *
revol.util 0.00176946 0.00152968 1.16 0.24737
inq.last.6mths 0.07630349 0.01552192 4.92 0.0000008840 ***
delinq.2yrs -0.08756530 0.06557839 -1.34 0.18179
pub.rec 0.31888334 0.11367874 2.81 0.00503 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 5896.6 on 6704 degrees of freedom
Residual deviance: 5489.5 on 6686 degrees of freedom
AIC: 5527
Number of Fisher Scoring iterations: 5
【2.2 從回歸係數估計邊際效用】Consider two loan applications, which are identical other than the fact that the borrower in Application A has FICO credit score 700 while the borrower in Application B has FICO credit score 710. What is the value of Logit(A) - Logit(B)? What is the value of O(A)/O(B)?
# the difference of logits
-0.00929*(-10) #0.0929
[1] 0.0929
# the ratio of odds
exp(0.0929) #1.097352
[1] 1.097
【2.3 混淆矩陣、正確率 vs 底線機率】What is the accuracy of the logistic regression model? What is the accuracy of the baseline model?
# test accuracy
test$predicted.risk=predict(model1,type="response",newdata=test)
table(test$not.fully.paid,test$predicted.risk>0.5)
FALSE TRUE
0 2401 12
1 457 3
(2400+3)/(2400+13+457+3) #0.8364
[1] 0.8364
# baseline accuracy
table(test$not.fully.paid)
0 1
2413 460
(2413)/(2413+460) #0.8398886
[1] 0.8399
【2.4 ROC & AUC】Use the ROCR package to compute the test set AUC.
# test
# baseline accuracy
library(ROCR)
ROCRpredTest=prediction(test$predicted.risk,test$not.fully.paid)
auc=as.numeric(performance(ROCRpredTest,"auc")@y.values);auc #0.6718878
[1] 0.6718
【3.1 高底線模型】The variable int.rate is highly significant in the bivariate model, but it is not significant at the 0.05 level in the model trained with all the independent variables. What is the most likely explanation for this difference?
summary(model1)
Call:
glm(formula = not.fully.paid ~ ., family = binomial, data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.172 -0.620 -0.495 -0.361 2.635
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 9.23868531 1.55340639 5.95 0.0000000027 ***
credit.policy -0.35962512 0.10049091 -3.58 0.00035 ***
purposecredit_card -0.61018465 0.13434286 -4.54 0.0000055724 ***
purposedebt_consolidation -0.31801924 0.09181777 -3.46 0.00053 ***
purposeeducational 0.13933851 0.17520994 0.80 0.42646
purposehome_improvement 0.17762530 0.14786550 1.20 0.22965
purposemajor_purchase -0.47785935 0.20067564 -2.38 0.01725 *
purposesmall_business 0.41640011 0.14181047 2.94 0.00332 **
int.rate 0.69224856 2.08421219 0.33 0.73978
installment 0.00126980 0.00020911 6.07 0.0000000013 ***
log.annual.inc -0.43103508 0.07138624 -6.04 0.0000000016 ***
dti 0.00449540 0.00549797 0.82 0.41356
fico -0.00940183 0.00170766 -5.51 0.0000000368 ***
days.with.cr.line 0.00000413 0.00001582 0.26 0.79406
revol.bal 0.00000300 0.00000116 2.58 0.01001 *
revol.util 0.00176946 0.00152968 1.16 0.24737
inq.last.6mths 0.07630349 0.01552192 4.92 0.0000008840 ***
delinq.2yrs -0.08756530 0.06557839 -1.34 0.18179
pub.rec 0.31888334 0.11367874 2.81 0.00503 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 5896.6 on 6704 degrees of freedom
Residual deviance: 5489.5 on 6686 degrees of freedom
AIC: 5527
Number of Fisher Scoring iterations: 5
model2=glm(not.fully.paid~int.rate,data=train,family=binomial)
summary(model2)
Call:
glm(formula = not.fully.paid ~ int.rate, family = binomial, data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.055 -0.627 -0.544 -0.436 2.291
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.673 0.169 -21.8 <2e-16 ***
int.rate 15.921 1.270 12.5 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 5896.6 on 6704 degrees of freedom
Residual deviance: 5734.8 on 6703 degrees of freedom
AIC: 5739
Number of Fisher Scoring iterations: 4
#int.rate is correlated with other risk-related variables, and therefore does not incrementally improve the model when those other variables are included
【3.2 高底線模型的預測值】What is the highest predicted probability of a loan not being paid in full on the testing set? With a logistic regression cutoff of 0.5, how many loans would be predicted as not being paid in full on the testing set?
predictTest=predict(model2,type="response",newdata=test)
summary(predictTest) #0.42662 #0
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.062 0.116 0.151 0.160 0.189 0.427
【3.3 高底線模型的辨識率】What is the test set AUC of the bivariate model?
ROCRpredTest2=prediction(predictTest,test$not.fully.paid)
as.numeric(performance(ROCRpredTest2,"auc")@y.values)
[1] 0.6239
#0.6239081
【4.1 投資價值的算法】How much does a $10 investment with an annual interest rate of 6% pay back after 3 years, using continuous compounding of interest?
#c=10,r=0.06,t=3
10*exp(0.06*3) #11.97
[1] 11.97
【4.2 投資獲利的算法,合約完成】While the investment has value c * exp(rt) dollars after collecting interest, the investor had to pay $c for the investment. What is the profit to the investor if the investment is paid back in full?
#c * exp(rt) - c
【4.3 投資獲利的算法,違約】Now, consider the case where the investor made a $c investment, but it was not paid back in full. Assume, conservatively, that no money was received from the borrower (often a lender will receive some but not all of the value of the loan, making this a pessimistic assumption of how much is received). What is the profit to the investor in this scenario?
# -c
【5.1 計算測試資料的實際投報率】What is the maximum profit of a $10 investment in any loan in the testing set?
test$profit = exp(test$int.rate*3) - 1
test$profit[test$not.fully.paid == 1] = -1
summary(test$profit)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.000 0.286 0.411 0.209 0.498 0.889
#0.8895*10=8.895
A simple investment strategy of equally investing in all the loans would yield profit $20.94 for a $100 investment. But this simple investment strategy does not leverage the prediction model we built earlier in this problem.
【6.1 高利率、高風險】What is the average profit of a $1 investment in one of these high-interest loans (do not include the $ sign in your answer)? What proportion of the high-interest loans were not paid back in full?
highinterest=subset(test,int.rate>=0.15)
summary(highinterest$profit) #0.2251
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.000 -1.000 0.599 0.225 0.638 0.889
table(highinterest$not.fully.paid)
0 1
327 110
110/(327+110) #0.2517162
[1] 0.2517
【6.2 高利率之中的低風險】What is the profit of the investor, who invested $1 in each of these 100 loans? How many of 100 selected loans were not paid back in full?
cutoff = sort(highinterest$predicted.risk, decreasing=FALSE)[100]
selectedLoans=subset(highinterest,predicted.risk<=cutoff)
sum(selectedLoans$profit) #31.27825
[1] 31.23
table(selectedLoans$not.fully.paid) #19
0 1
81 19
【Q】利用我們建好的模型,你可以設計出比上述的方法獲利更高的投資方法嗎?請詳述你的作法?
highinterest1=subset(test,int.rate>=0.14)
summary(highinterest1)
cutoff1 = sort(highinterest1$predicted.risk, decreasing=FALSE)[100]
selectedLoans1=subset(highinterest1,predicted.risk<=cutoff1)
sum(selectedLoans1$profit) #34.67
table(selectedLoans1$not.fully.paid) #14
# 一開始, 我們認為選擇高的期望利率,會有高的報酬,但發現,提高利率之後,報酬不增反減,因為沒有履約償還的人數變多了,於是,我們嘗試將利率降低,雖然期望利率變低了,但發現不履約的人減少,因而獲利增加