A digital arm of a bank faces challenges with lead conversions. The primary objective of this division is to increase customer acquisition through digital channels. The division was set up a few years back and the primary focus of the division over these years has been to increase the number of leads getting into the conversion funnel.
They source leads through various channels like search, display, email campaigns and via affiliate partners. As expected, they see differential conversion depending on the sources and the quality of these leads.
They now want to identify the leads’ segments having a higher conversion ratio (lead to buying a product) so that they can specifically target these potential customers through additional channels and re-marketing. They have provided a partial data set for salaried customers from the last 3 months. They also capture basic details about customers. We need to identify the segment of customers with a high probability of conversion in the next 30 days.
| ID | Unique ID (can not be used for predictions) |
|---|---|
| Gender | Sex of the applicant |
| DOB | Date of Birth of the applicant |
| Lead_Creation_Date | Date on which Lead was created |
| City_Code | Anonymised Code for the City |
| City_Category | Anonymised City Feature |
| Employer_Code | Anonymised Code for the Employer |
| Employer_Category1 | Anonymised Employer Feature |
| Employer_Category2 | Anonymised Employer Feature |
| Monthly_Income | Monthly Income in Dollars |
| Customer_Existing_Primary_Bank_Code | Anonymised Customer Bank Code |
| Primary_Bank_Type | Anonymised Bank Feature |
| Contacted | Contact Verified (Y/N) |
| Source | Categorical Variable representing source of lead |
| Source_Category | Type of Source |
| Existing_EMI | EMI of Existing Loans in Dollars |
| Loan_Amount | Loan Amount Requested |
| Loan_Period | Loan Period (Years) |
| Interest_Rate | Interest Rate of Submitted Loan Amount |
| EMI | EMI of Requested Loan Amount in dollars |
| Var1 | Categorical variable with multiple levels |
| Approved | (Target) Whether a loan is Approved or not (0/1) |
## Approved Rejected
## 1020 68693
## [1] 0.01463142
From the Approval ratio, we notice that the target is imbalanced.
##train number of records:
nrow(trainset)
## [1] 34856
summary(trainset$Approved_level)
## Approved Rejected
## 515 34341
##test number of records:
nrow(testset)
## [1] 17428
summary(testset$Approved_level)
## Approved Rejected
## 269 17159
##holdout number of records:
nrow(holdout)
## [1] 17429
summary(holdout$Approved_level)
## Approved Rejected
## 236 17193
As in Appendix A.
Loan_Amount, Loan_Period, Interest_Rate, and EMI have very high missing rate, so let’s look at these four variables first.
I used histogram, boxplot, bivariate chart and WOE binning for attribute analysis and transformation
WOE=ln(\(\frac{p(non-event)}{p(event)}\))
IV=\(\sum_{i=1}^n (DistributionGood-DistributionBad)\)*WOE
Loan Amount
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 5000 20000 30000 39420 50000 300000 13828
## $WOE
## # A tibble: 4 x 16
## grp n_obs mean~ sum_~ sum_~ sum_w mean_y1 mean_~ Lowe~ Upper~ woe
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 12519 21226 199 185 12519 0.0159 0.0148 5000 36000 0.0731
## 2 2 2579 42155 27.0 38.1 2579 0.0105 0.0148 37000 49000 -0.345
## 3 3 5930 76649 197 87.6 5930 0.0332 0.0148 50000 300000 0.810
## 4 Missi~ 13828 NA 92.0 204 13828 0.00665 0.0148 NA NA -0.798
## # ... with 5 more variables: ks <dbl>, info <dbl>, grp.n <dbl>, grp.f
## # <fctr>, plot.width <dbl>
##
## $Stat
## MAX_KS Info.Value Trend.Estimate Trend.Pr(>|t|)
## 2.180771e+01 3.555042e-01 1.290376e-05 4.008000e-01
##
## $WOE_Code
## [1] "train$w_Loan_Amount = train$Loan_Amount"
## [2] "g_Loan_Amount = c(5000, 37000, 50000, 3e+05, NA)"
## [3] "train$w_Loan_Amount[findInterval(train$Loan_Amount, g_Loan_Amount, rightmost.closed=TRUE) == 1] = 0.073115722598372"
## [4] "train$w_Loan_Amount[findInterval(train$Loan_Amount, g_Loan_Amount, rightmost.closed=TRUE) == 2] = -0.344506472743474"
## [5] "train$w_Loan_Amount[findInterval(train$Loan_Amount, g_Loan_Amount, rightmost.closed=TRUE) == 3] = 0.810237903879759"
## [6] "train$w_Loan_Amount[findInterval(train$Loan_Amount, g_Loan_Amount, rightmost.closed=TRUE) == 4] = -0.797848556858933"
## [7] "train$w_Loan_Amount[is.na(train$Loan_Amount) | is.infinite(train$Loan_Amount) | is.nan(train$Loan_Amount)] = -0.797848556858933"
##
## $Plot
train$Loan_Amount_flag<-as.factor(ifelse(is.na(train$Loan_Amount),1,0))
train$Loan_Amount_Impute_F<-coalesce(as.numeric(train$Loan_Amount),-1)
train$M_Loan_Amount_F<-as.factor(ifelse(train$Loan_Amount_Impute_F<50000&train$Loan_Amount_Impute_F>0,0,ifelse(train$Loan_Amount_Impute_F>0,1,-1)))
train$Loan_Amount_Impute_C<-coalesce(as.numeric(train$Loan_Amount),0)
train%>%group_by(M_Loan_Amount_F)%>%summarise(approval=sum(Approved),cnt=n(),ratio=approval/cnt)
## # A tibble: 3 x 4
## M_Loan_Amount_F approval cnt ratio
## <fctr> <int> <int> <dbl>
## 1 -1 92 13828 0.00665
## 2 0 226 15098 0.0150
## 3 1 197 5930 0.0332
Loan_Period
train$Loan_Period_flag<-as.factor(ifelse(is.na(train$Loan_Period),1,0))
train$Loan_Period_Impute<-(coalesce(as.numeric(train$Loan_Period),-1))
train$M_Loan_Period_F<-as.factor(train$Loan_Period_Impute)
# woe.binning(train,'Approved','Loan_Period_Impute')
train%>%group_by(Loan_Period_flag)%>%summarise(approval=sum(Approved),cnt=n(),ratio=approval/cnt)
## # A tibble: 2 x 4
## Loan_Period_flag approval cnt ratio
## <fctr> <int> <int> <dbl>
## 1 0 423 21028 0.0201
## 2 1 92 13828 0.00665
Interest Rate
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 11.99 15.10 18.00 19.23 20.00 37.00 23737
## $WOE
## # A tibble: 5 x 16
## grp n_obs mean_x sum_y1 sum_~ sum_w mean_y1 mean_~ Lowe~ Uppe~ woe
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2781 14.1 141 41.1 2781 5.07e-2 0.0148 12.0 15.1 1.23
## 2 2 3629 16.6 74.0 53.6 3629 2.04e-2 0.0148 15.2 18.2 0.322
## 3 3 3543 21.4 18.0 52.3 3543 5.08e-3 0.0148 18.4 31.0 -1.07
## 4 4 1166 33.0 1.00 17.2 1166 8.58e-4 0.0148 31.5 37.0 -2.85
## 5 Missi~ 23737 NA 281 351 23737 1.18e-2 0.0148 NA NA -0.222
## # ... with 5 more variables: ks <dbl>, info <dbl>, grp.n <dbl>, grp.f
## # <fctr>, plot.width <dbl>
##
## $Stat
## MAX_KS Info.Value Trend.Estimate Trend.Pr(>|t|)
## 23.35762560 0.44285020 -0.22203655 0.02796948
##
## $WOE_Code
## [1] "train$w_Interest_Rate = train$Interest_Rate"
## [2] "g_Interest_Rate = c(11.99, 15.25, 18.4, 31.5, 37, NA)"
## [3] "train$w_Interest_Rate[findInterval(train$Interest_Rate, g_Interest_Rate, rightmost.closed=TRUE) == 1] = 1.23300770327775"
## [4] "train$w_Interest_Rate[findInterval(train$Interest_Rate, g_Interest_Rate, rightmost.closed=TRUE) == 2] = 0.322166353124642"
## [5] "train$w_Interest_Rate[findInterval(train$Interest_Rate, g_Interest_Rate, rightmost.closed=TRUE) == 3] = -1.06754367983577"
## [6] "train$w_Interest_Rate[findInterval(train$Interest_Rate, g_Interest_Rate, rightmost.closed=TRUE) == 4] = -2.8465206997769"
## [7] "train$w_Interest_Rate[findInterval(train$Interest_Rate, g_Interest_Rate, rightmost.closed=TRUE) == 5] = -0.221621954713664"
## [8] "train$w_Interest_Rate[is.na(train$Interest_Rate) | is.infinite(train$Interest_Rate) | is.nan(train$Interest_Rate)] = -0.221621954713664"
##
## $Plot
train$Interest_Rate_flag<-as.factor(ifelse(is.na(train$Interest_Rate),1,0))
mean((train[!is.na(train$Interest_Rate),]$Interest_Rate))
## [1] 19.23383
quantile(train[!is.na(train$Interest_Rate),]$Interest_Rate,0.95)
## 95%
## 31.5
train$Interest_Rate_Impute<-coalesce(as.numeric(train$Interest_Rate),19.23383)
bivar.plot(train,'Interest_Rate_Impute','Approved')
train$Interest_Rate_Impute_grp1<-ifelse(train$Interest_Rate_Impute<=15,15-train$Interest_Rate_Impute,0)
train$Interest_Rate_Impute_grp2<-ifelse(train$Interest_Rate_Impute>15&train$Interest_Rate_Impute<=20,train$Interest_Rate_Impute-15,0)
train$Interest_Rate_Impute_grp3<-ifelse(train$Interest_Rate_Impute>20&train$Interest_Rate_Impute<=30,train$Interest_Rate_Impute-20,0)
train$Interest_Rate_Impute_grp4<-ifelse(train$Interest_Rate_Impute>30,train$Interest_Rate_Impute-30,0)
EMI (Equated Monthly Installment)
EMI=\(\frac{r*Lan_Amount}{1-(1+r)^(-n)}\)
EMI is highly correlated with Interest Rate and Loan Amount
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 119 654 937 1102 1297 13560 23737
## 99%
## 8.318254
## 1%
## 5.488938
train$EMI_flag<-as.factor(ifelse(is.na(train$EMI),1,0))
bivar.plot(train,'EMI_log_capped','Approved')
train$M_EMI<-coalesce(as.numeric(train$EMI_log_capped),0)
bivar.plot(train,'M_EMI','Approved')
train$M_EMI<-ifelse(train$M_EMI<6,1,
ifelse(train$M_EMI<7.5,2,3))
train%>%group_by(M_EMI)%>%summarise(approval=sum(Approved),cnt=n(),ratio=approval/cnt)
## # A tibble: 3 x 4
## M_EMI approval cnt ratio
## <dbl> <int> <int> <dbl>
## 1 1.00 293 24829 0.0118
## 2 2.00 157 8766 0.0179
## 3 3.00 65 1261 0.0515
For high missing rate attributes, I lean towards excluding them from regression model. However, I will keep them as potential candidates and compare models with or without these high missing rate variables.
Convert categorical variables to factors
train$M_City_Category_F<-as.factor(train$City_Category)
train$M_Employer_Category1_F<-as.factor(train$Employer_Category1)
train$M_Employer_Category2_F<-as.factor(train$Employer_Category2)
train$M_Primary_Bank_Type_F<-as.factor(train$Primary_Bank_Type)
train$M_Contacted_F<-as.factor(train$Contacted)
train$M_Gender_F<-as.factor(train$Gender)
train$M_Source_Category_F<-as.factor(train$Source_Category)
## # A tibble: 5 x 4
## M_Var1_F approval cnt ratio
## <fctr> <int> <int> <dbl>
## 1 0 57 11637 0.00490
## 2 2 21 6696 0.00314
## 3 4 33 3885 0.00849
## 4 7 113 5976 0.0189
## 5 10 291 6662 0.0437
Correlation
## 'data.frame': 34856 obs. of 9 variables:
## $ Monthly_Income : num 3500 2250 3500 7000 2500 ...
## $ Existing_EMI : num 0 0 0 0 0 0 255 1000 0 0 ...
## $ Interest_Rate : num 13.2 NA NA NA 20 ...
## $ Employer_Category2: int 1 4 4 4 4 4 4 4 4 4 ...
## $ Loan_Amount : int 20000 45000 92000 NA 66000 NA 30000 45000 74000 NA ...
## $ Loan_Period : int 2 4 5 NA 5 NA 5 5 5 NA ...
## $ EMI : int 953 NA NA NA 1749 NA 722 NA NA NA ...
## $ Var1 : int 10 0 7 0 7 0 0 10 10 10 ...
## $ Approved : int 0 0 0 0 0 0 0 0 0 0 ...
## [1] 3
Monthly Income
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1650 2500 6278 4000 38380000
## 99.5%
## 10.59666
## 5%
## 6.908755
train$Monthly_Income_flag<-as.factor(ifelse(is.na(train$Monthly_Income),1,0))
train$Monthly_Income_log_capped<-pmin(pmax(train$Monthly_Income_log,6.908755),10.59666 )
bivar.plot(train,'Monthly_Income_log_capped','Approved')
WOE_numeric_split(x='Monthly_Income_log_capped',y1='Approved',y0='one',data=train,group=20)
## $WOE
## # A tibble: 8 x 14
## grp n_obs mean_x sum_y1 sum_y0 sum_w mean_y1 mean~ Lowe~ Uppe~ woe
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 11906 7.24 3.00 11906 11906 2.52e-4 1.00 6.91 7.60 -4.07
## 2 2 2027 7.60 8.00 2027 2027 3.95e-3 1.00 7.60 7.63 -1.32
## 3 3 6691 7.79 55.0 6691 6691 8.22e-3 1.00 7.63 7.94 -0.586
## 4 4 3659 8.03 57.0 3659 3659 1.56e-2 1.00 7.94 8.16 0.0529
## 5 5 3379 8.26 79.0 3379 3379 2.34e-2 1.00 8.16 8.41 0.459
## 6 6 3675 8.55 143 3675 3675 3.89e-2 1.00 8.41 8.73 0.968
## 7 7 1770 8.91 99.0 1770 1770 5.59e-2 1.00 8.73 9.14 1.33
## 8 8 1749 9.66 71.0 1749 1749 4.06e-2 1.00 9.15 10.6 1.01
## # ... with 3 more variables: ks <dbl>, info <dbl>, plot.width <dbl>
##
## $Stat
## MAX_KS Info.Value Trend.Estimate Trend.Pr(>|t|)
## 46.353619106 1.943089527 2.802307536 0.004907191
##
## $WOE_Code
## [1] "train$w_Monthly_Income_log_capped = train$Monthly_Income_log_capped"
## [2] "g_Monthly_Income_log_capped = c(6.908755, 7.60140233458373, 7.63094658089046, 7.93773177526011, 8.16080392095467, 8.41205487329293, 8.73246584834988, 9.14857134557855, 10.59666)"
## [3] "train$w_Monthly_Income_log_capped[findInterval(train$Monthly_Income_log_capped, g_Monthly_Income_log_capped, rightmost.closed=TRUE) == 1] = -4.07137179791168"
## [4] "train$w_Monthly_Income_log_capped[findInterval(train$Monthly_Income_log_capped, g_Monthly_Income_log_capped, rightmost.closed=TRUE) == 2] = -1.32005693763862"
## [5] "train$w_Monthly_Income_log_capped[findInterval(train$Monthly_Income_log_capped, g_Monthly_Income_log_capped, rightmost.closed=TRUE) == 3] = -0.586371766416804"
## [6] "train$w_Monthly_Income_log_capped[findInterval(train$Monthly_Income_log_capped, g_Monthly_Income_log_capped, rightmost.closed=TRUE) == 4] = 0.0529197699488852"
## [7] "train$w_Monthly_Income_log_capped[findInterval(train$Monthly_Income_log_capped, g_Monthly_Income_log_capped, rightmost.closed=TRUE) == 5] = 0.458926432886273"
## [8] "train$w_Monthly_Income_log_capped[findInterval(train$Monthly_Income_log_capped, g_Monthly_Income_log_capped, rightmost.closed=TRUE) == 6] = 0.968349885746512"
## [9] "train$w_Monthly_Income_log_capped[findInterval(train$Monthly_Income_log_capped, g_Monthly_Income_log_capped, rightmost.closed=TRUE) == 7] = 1.33119869170026"
## [10] "train$w_Monthly_Income_log_capped[findInterval(train$Monthly_Income_log_capped, g_Monthly_Income_log_capped, rightmost.closed=TRUE) == 8] = 1.01069406915626"
##
## $Plot
train$M_Monthly_Income<-ifelse(train$Monthly_Income_log_capped<7.630655,-2.92930647,
ifelse(train$Monthly_Income_log_capped<7.937160,-0.58637177,
ifelse(train$Monthly_Income_log_capped<8.160261,0.05291977,
ifelse(train$Monthly_Income_log_capped<8.409808,0.45892643,1.08001423))))
train%>%group_by(M_Monthly_Income)%>%summarise(approval=sum(Approved),cnt=n(),ratio=approval/cnt)
## # A tibble: 5 x 4
## M_Monthly_Income approval cnt ratio
## <dbl> <int> <int> <dbl>
## 1 -2.93 11 13932 0.000790
## 2 -0.586 55 6691 0.00822
## 3 0.0529 57 3659 0.0156
## 4 0.459 79 3379 0.0234
## 5 1.08 313 7195 0.0435
Existing EMI
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.0 0.0 353.6 350.0 62630.0 30
## 99%
## 8.2943
## 1%
## 0
train$Existing_EMI_flag<-as.factor(ifelse(is.na(train$Existing_EMI),1,0))
bivar.plot(train,'Existing_EMI_log_capped','Approved')
WOE_numeric_split(x='Existing_EMI_log_capped',y1='Approved',y0='one',data=train,group=10)
## $WOE
## # A tibble: 3 x 14
## grp n_obs mean_x sum_y1 sum_y0 sum_w mean_~ mean~ Lower~ Upper~ woe
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 27350 0.788 288 27350 27350 0.0105 1.00 0 6.22 -0.339
## 2 2 3441 6.46 60.0 3441 3441 0.0174 1.00 6.22 6.91 0.166
## 3 3 4065 7.47 167 4065 4065 0.0411 1.00 6.91 8.31 1.02
## # ... with 3 more variables: ks <dbl>, info <dbl>, plot.width <dbl>
##
## $Stat
## MAX_KS Info.Value Trend.Estimate Trend.Pr(>|t|)
## 22.5433573 0.2916497 0.1615363 0.2360679
##
## $WOE_Code
## [1] "train$w_Existing_EMI_log_capped = train$Existing_EMI_log_capped"
## [2] "g_Existing_EMI_log_capped = c(0, 6.21660610108486, 6.90875477931522, 8.305953)"
## [3] "train$w_Existing_EMI_log_capped[findInterval(train$Existing_EMI_log_capped, g_Existing_EMI_log_capped, rightmost.closed=TRUE) == 1] = -0.338697660580639"
## [4] "train$w_Existing_EMI_log_capped[findInterval(train$Existing_EMI_log_capped, g_Existing_EMI_log_capped, rightmost.closed=TRUE) == 2] = 0.165640823558162"
## [5] "train$w_Existing_EMI_log_capped[findInterval(train$Existing_EMI_log_capped, g_Existing_EMI_log_capped, rightmost.closed=TRUE) == 3] = 1.02263845756839"
##
## $Plot
# woe.binning(train,'Approved','Existing_EMI_log_capped')
train$M_Existing_EMI<-ifelse(train$Existing_EMI_log_capped<6.215408,-0.3386977,
ifelse(train$Existing_EMI_log_capped<6.907455,0.1656408,1.0226385))
train%>%group_by(M_Existing_EMI)%>%summarise(approval=sum(Approved),cnt=n(),ratio=approval/cnt)
## # A tibble: 3 x 4
## M_Existing_EMI approval cnt ratio
## <dbl> <int> <int> <dbl>
## 1 -0.339 288 27350 0.0105
## 2 0.166 60 3440 0.0174
## 3 1.02 167 4066 0.0411
For Existing EMI, try continuous form as well.
Date Attributes: DOB and Lead_Date
train$DOB<- as.Date(train$DOB,
format = "%d/%m/%y")
train$Lead_Creation_Date<- as.Date(train$Lead_Creation_Date,
format = "%d/%m/%y")
train$M_DOB_Year<-year(train$DOB)
train$M_DOB_Year<-pmin(coalesce(train$M_DOB_Year,2016),2016-21)
train$M_DOB_Month<-month(train$DOB)
train$M_Lead_Year<-pmin(year(train$Lead_Creation_Date),2016)
train$M_Lead_Month<-month(train$Lead_Creation_Date)
train$M_Lead_Month<-as.factor(train$M_Lead_Month)
train$M_DOB_Month<-as.factor(train$M_DOB_Month)
As in Appendix B.
Decision tree cannot split trees on imbalanced data
## Call:
## rpart(formula = fm, data = train, method = "class")
## n= 34856
##
## CP nsplit rel error xerror xstd
## 1 0 0 1 0 0
##
## Node number 1: 34856 observations
## predicted class=0 expected loss=0.01477507 P(node) =1
## class counts: 34341 515
## probabilities: 0.985 0.015
library(ROSE)
fm<-as.formula(Approved ~ M_Gender_F+M_City_Category_F
+M_Employer_Category1_F+M_Employer_Category2_F
+M_Monthly_Income
+M_Primary_Bank_Type_F+M_Contacted_F+M_Source_Category_F
+M_Existing_EMI+Var1
+Interest_Rate
+EMI
+Loan_Amount
+Loan_Period
)
data.rose <- ROSE(fm, data = train, seed = 1)$data
model6<-as.formula(Approved~M_Gender_F
+M_City_Category_F.B
+M_Employer_Category1_F.A
+M_Primary_Bank_Type_F.
+M_Primary_Bank_Type_F.G
# +M_Existing_EMI
+Existing_EMI_log_capped
+Interest_Rate_Impute_grp2
# +Monthly_Income_log_capped
+M_Monthly_Income
+M_Var1_F.10
)
logi.model<-glm(model6, data=train3,
family = binomial(link = "logit"))
summary(logi.model)
##
## Call:
## glm(formula = model6, family = binomial(link = "logit"), data = train3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6286 -0.2002 -0.0977 -0.0440 3.9444
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.07798 0.14766 -27.616 < 2e-16 ***
## M_Gender_FMale 0.32288 0.11075 2.915 0.003551 **
## M_City_Category_F.B 0.46428 0.15156 3.063 0.002189 **
## M_Employer_Category1_F.A -0.34671 0.09768 -3.549 0.000386 ***
## M_Primary_Bank_Type_F. -1.83720 0.32367 -5.676 1.38e-08 ***
## M_Primary_Bank_Type_F.G -0.49143 0.12169 -4.038 5.38e-05 ***
## Existing_EMI_log_capped 0.07887 0.01322 5.965 2.45e-09 ***
## Interest_Rate_Impute_grp2 -0.11097 0.02434 -4.559 5.14e-06 ***
## M_Monthly_Income 0.81680 0.06238 13.093 < 2e-16 ***
## M_Var1_F.10 0.29632 0.10427 2.842 0.004486 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5363.6 on 34855 degrees of freedom
## Residual deviance: 4531.3 on 34846 degrees of freedom
## AIC: 4551.3
##
## Number of Fisher Scoring iterations: 9
test3$pred<-round(predict(logi.model,newdata=test3,type='response'))
confusionMatrix(test3$pred,test3$Approved)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 17159 269
## 1 0 0
##
## Accuracy : 0.9846
## 95% CI : (0.9826, 0.9863)
## No Information Rate : 0.9846
## P-Value [Acc > NIR] : 0.5162
##
## Kappa : 0
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.9846
## Neg Pred Value : NaN
## Prevalence : 0.9846
## Detection Rate : 0.9846
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 0
##
library(dismo)
data(train4)
set.seed(12345)
model.fixed <- gbm.fixed(data=train4, gbm.x = (2:10), gbm.y = 1,family = "bernoulli", tree.complexity = 5, learning.rate = 0.01, bag.fraction = 0.5, n.trees = 500)
## [1] fitting gbm model with a fixed number of 500 trees for Approved
## [1] total deviance = 5363.61
## [1] residual deviance = 4424.44
names(model.fixed$gbm.call)[1] <- "dataframe"
find.int <- gbm.interactions(model.fixed)
# find.int$interactions
find.int$rank.list
## var1.index var1.names var2.index var2.names
## 1 8 M_Monthly_Income 4 M_Primary_Bank_Type_F.
## 2 9 M_Var1_F.10 6 Existing_EMI_log_capped
## 3 6 Existing_EMI_log_capped 3 M_Employer_Category1_F.A
## 4 6 Existing_EMI_log_capped 5 M_Primary_Bank_Type_F.G
## int.size
## 1 44.31
## 2 5.59
## 3 5.50
## 4 4.58
Stepwise forward selection to get interaction terms
model8<-as.formula(Approved ~ M_Monthly_Income + M_Primary_Bank_Type_F. + M_Primary_Bank_Type_F.G +
Interest_Rate_Impute_grp2 + Existing_EMI_log_capped + M_Employer_Category1_F.A +
M_Gender_F + M_City_Category_F.B + M_Var1_F.10 + M_Primary_Bank_Type_F.:Interest_Rate_Impute_grp2 +
M_Monthly_Income:M_Gender_F
)
logi.model<-glm(model8, data=train3,
family = binomial(link = "logit"))
summary(logi.model)
##
## Call:
## glm(formula = model8, family = binomial(link = "logit"), data = train3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6072 -0.2028 -0.0926 -0.0329 4.1191
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -4.22523 0.16561
## M_Monthly_Income 1.02022 0.11722
## M_Primary_Bank_Type_F. -0.73075 0.41516
## M_Primary_Bank_Type_F.G -0.48166 0.12165
## Interest_Rate_Impute_grp2 -0.10147 0.02456
## Existing_EMI_log_capped 0.07727 0.01322
## M_Employer_Category1_F.A -0.35839 0.09793
## M_Gender_FMale 0.50177 0.13944
## M_City_Category_F.B 0.45682 0.15154
## M_Var1_F.10 0.31156 0.10452
## M_Primary_Bank_Type_F.:Interest_Rate_Impute_grp2 -0.46718 0.17037
## M_Monthly_Income:M_Gender_FMale -0.29924 0.13262
## z value Pr(>|z|)
## (Intercept) -25.514 < 2e-16 ***
## M_Monthly_Income 8.703 < 2e-16 ***
## M_Primary_Bank_Type_F. -1.760 0.078380 .
## M_Primary_Bank_Type_F.G -3.959 7.51e-05 ***
## Interest_Rate_Impute_grp2 -4.131 3.61e-05 ***
## Existing_EMI_log_capped 5.845 5.06e-09 ***
## M_Employer_Category1_F.A -3.660 0.000253 ***
## M_Gender_FMale 3.598 0.000320 ***
## M_City_Category_F.B 3.014 0.002574 **
## M_Var1_F.10 2.981 0.002874 **
## M_Primary_Bank_Type_F.:Interest_Rate_Impute_grp2 -2.742 0.006104 **
## M_Monthly_Income:M_Gender_FMale -2.256 0.024045 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5363.6 on 34855 degrees of freedom
## Residual deviance: 4517.3 on 34844 degrees of freedom
## AIC: 4541.3
##
## Number of Fisher Scoring iterations: 10
test3$pred<-round(predict(logi.model,newdata=test3,type='response'))
confusionMatrix(test3$pred,test3$Approved)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 17159 269
## 1 0 0
##
## Accuracy : 0.9846
## 95% CI : (0.9826, 0.9863)
## No Information Rate : 0.9846
## P-Value [Acc > NIR] : 0.5162
##
## Kappa : 0
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.9846
## Neg Pred Value : NaN
## Prevalence : 0.9846
## Detection Rate : 0.9846
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 0
##
Accuracy=ln(\(\frac{TP+TN}{P+N}\))
train3$predlink<-round(predict(logi.model,newdata=train3,type='link'))
summary(exp(train3$predlink))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000167 0.0009119 0.0067380 0.0160600 0.0183200 0.1353000
train3$pred<-ifelse(exp(train3$predlink)>0.045,1,0)
test3$predlink<-round(predict(logi.model,newdata=test3,type='link'))
test3$pred<-ifelse(exp(test3$predlink)>0.045,1,0)
pred_prob<-round(predict(logi.model,newdata=test3,type='link'))
library(car)
vif(lm(model6, data=train3))
## M_Gender_F M_City_Category_F.B
## 1.215335 1.022013
## M_Employer_Category1_F.A M_Primary_Bank_Type_F.
## 1.083169 1.111446
## M_Primary_Bank_Type_F.G Existing_EMI_log_capped
## 1.171321 1.080049
## Interest_Rate_Impute_grp2 M_Monthly_Income
## 1.140728 1.512113
## M_Var1_F.10
## 1.435848
GAINS.CHART(testset$Approved,exp(pred_prob),n.rank=10)
## $gainschart
## Rank Actual_Sum Actual_Mean Pred_Sum Pred_Mean Count
## 1 1 0 0.000000000 1.134770 0.0002801901 4050
## 2 2 6 0.002824859 1.936837 0.0009118820 2124
## 3 3 6 0.002700270 5.507787 0.0024787522 2222
## 4 4 27 0.011455240 15.881341 0.0067379470 2357
## 5 5 67 0.020483033 59.910455 0.0183156389 3271
## 6 6 122 0.041105121 147.768019 0.0497870684 2968
## 7 7 41 0.094036697 59.006183 0.1353352832 436
##
## $ks
## [1] 42.64
##
## $gini
## [1] 30.93
confusionMatrix(test3$pred,test3$Approved)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 13918 106
## 1 3241 163
##
## Accuracy : 0.808
## 95% CI : (0.802, 0.8138)
## No Information Rate : 0.9846
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0619
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.81112
## Specificity : 0.60595
## Pos Pred Value : 0.99244
## Neg Pred Value : 0.04788
## Prevalence : 0.98457
## Detection Rate : 0.79860
## Detection Prevalence : 0.80468
## Balanced Accuracy : 0.70853
##
## 'Positive' Class : 0
##
Residual(train3$Approved,train3$pred,train3$Existing_EMI,Weight=train3$one,NBins=50)
Residual(train3$Approved,train3$pred,train3$Loan_Amount,Weight=train3$one,NBins=50)
Residual(train3$Approved,train3$pred,train3$Interest_Rate,Weight=train3$one,NBins=30)
ModelComparisonData %>% ggvis(~ValidationWeight, ~ValidationLoss) %>% layer_paths(stroke := "blue") %>%
layer_paths(data = ModelComparisonData, x = ~ValidationWeight, y = ~ValidationWeight, stroke := "black") %>%
layer_paths(data = ModelComparisonData, x = ~TrainWeight, y = ~TrainLoss, stroke := "red")
###smote gbm#################################
train$Approved<-as.factor(ifelse(train$Approved==1,'Approved','Rejected'))
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
summaryFunction = twoClassSummary,
classProbs = TRUE)
set.seed(5627)
gbmGrid <- expand.grid(interaction.depth = c(1,2,3),
n.trees = (1:20)*50,
shrinkage = c(0.1,0.05),
n.minobsinnode = 50)
ctrl$sampling <- "smote"
# smote_fit <- train(model,
# data = train,
# method = "gbm",
# bag.fraction = 0.5,
# verbose = FALSE,
# metric = "ROC",
# trControl = ctrl,
# tuneGrid =gbmGrid
# # ,na.action=na.exclude
# )
#find best iteration
plot(smote_fit)
summary(smote_fit)
## var rel.inf
## M_Monthly_Income M_Monthly_Income 59.1756399
## M_Var1 M_Var1 10.1334777
## M_Interest_Rate M_Interest_Rate 8.4750497
## M_Existing_EMI M_Existing_EMI 7.8571167
## M_Loan_Amount M_Loan_Amount 7.7540301
## M_EMI M_EMI 3.0969584
## M_Employer_Category1A M_Employer_Category1A 1.5391777
## M_Employer_Category1C M_Employer_Category1C 1.3280068
## M_Employer_Category1B M_Employer_Category1B 0.4391393
## M_GenderMale M_GenderMale 0.2014038
## M_Loan_Periodnot_missing M_Loan_Periodnot_missing 0.0000000
pred<-predict(smote_fit,newdata=test,type='prob')
test$Approved<-as.factor(ifelse(test$Approved==1,'Approved','Rejected'))
pred1<-as.data.frame(predict(smote_fit,newdata=test,type='raw'))
pred2<-pred$Approved
testset$Approved_Aligned<-testset$Approved/mean(pred2)
GAINS.CHART(testset$Approved_Aligned,(pred2),n.rank=10)
## $gainschart
## Rank Actual_Sum Actual_Mean Pred_Sum Pred_Mean Count
## 1 1 0.000000 0.000000000 22.64243 0.01299049 1743
## 2 2 7.538538 0.004317605 29.56613 0.01693364 1746
## 3 3 0.000000 0.000000000 47.96860 0.02752071 1743
## 4 4 7.538538 0.004332493 108.86728 0.06256740 1740
## 5 5 45.231230 0.025905630 247.11831 0.14153397 1746
## 6 6 49.000499 0.028177400 433.56899 0.24932087 1739
## 7 7 109.308806 0.062749028 615.89203 0.35355455 1742
## 8 8 150.770767 0.086500727 803.14289 0.46078192 1743
## 9 9 207.309805 0.118938500 1015.57877 0.58266137 1743
## 10 10 437.235225 0.250852109 1299.36255 0.74547479 1743
##
## $ks
## [1] 49.22
##
## $gini
## [1] 63.09
confusionMatrix(pred1$`predict(smote_fit, newdata = test, type = "raw")`,test$Approved)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Approved Rejected
## Approved 176 3572
## Rejected 93 13587
##
## Accuracy : 0.7897
## 95% CI : (0.7836, 0.7957)
## No Information Rate : 0.9846
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0606
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.65428
## Specificity : 0.79183
## Pos Pred Value : 0.04696
## Neg Pred Value : 0.99320
## Prevalence : 0.01543
## Detection Rate : 0.01010
## Detection Prevalence : 0.21506
## Balanced Accuracy : 0.72305
##
## 'Positive' Class : Approved
##
Further tune gbm model
Try spline smoothing on continuous variables: Monthly Income, Existing EMI, and Interest Rate
Check glm/gbm overfitting
PCA
Segmentation, decision tree/random forest
lasso/ridge
Test on Holdout dataset.
Logistic regression did a fairly good job in predicting the approval probabilities with the unbalanced dataset, however, feature engineering takes a lot of time and analysis with modeler’s experience based judgements. GBM gave a better model with less feature engineering work, but GBM is a blackbox, need to tune parameters and more difficult to explain to regulators if filed as a pricing model.
Questions?.