Outline:

  1. Problem statement
  2. Exploratory data analysis
  3. Attribute transformation
  4. Logistic regression
  5. Gbm and more
  6. Conclusion

Problem Statement

A digital arm of a bank faces challenges with lead conversions. The primary objective of this division is to increase customer acquisition through digital channels. The division was set up a few years back and the primary focus of the division over these years has been to increase the number of leads getting into the conversion funnel.

They source leads through various channels like search, display, email campaigns and via affiliate partners. As expected, they see differential conversion depending on the sources and the quality of these leads.

They now want to identify the leads’ segments having a higher conversion ratio (lead to buying a product) so that they can specifically target these potential customers through additional channels and re-marketing. They have provided a partial data set for salaried customers from the last 3 months. They also capture basic details about customers. We need to identify the segment of customers with a high probability of conversion in the next 30 days.

Attributes summary

ID Unique ID (can not be used for predictions)
Gender Sex of the applicant
DOB Date of Birth of the applicant
Lead_Creation_Date Date on which Lead was created
City_Code Anonymised Code for the City
City_Category Anonymised City Feature
Employer_Code Anonymised Code for the Employer
Employer_Category1 Anonymised Employer Feature
Employer_Category2 Anonymised Employer Feature
Monthly_Income Monthly Income in Dollars
Customer_Existing_Primary_Bank_Code Anonymised Customer Bank Code
Primary_Bank_Type Anonymised Bank Feature
Contacted Contact Verified (Y/N)
Source Categorical Variable representing source of lead
Source_Category Type of Source
Existing_EMI EMI of Existing Loans in Dollars
Loan_Amount Loan Amount Requested
Loan_Period Loan Period (Years)
Interest_Rate Interest Rate of Submitted Loan Amount
EMI EMI of Requested Loan Amount in dollars
Var1 Categorical variable with multiple levels
Approved (Target) Whether a loan is Approved or not (0/1)

Examine target

## Approved Rejected 
##     1020    68693
## [1] 0.01463142

From the Approval ratio, we notice that the target is imbalanced.

Split Train/Test/Holdout: 50%/25%/25%

##train number of records:
nrow(trainset)
## [1] 34856
summary(trainset$Approved_level)
## Approved Rejected 
##      515    34341
##test number of records:
nrow(testset)
## [1] 17428
summary(testset$Approved_level)
## Approved Rejected 
##      269    17159
##holdout number of records:
nrow(holdout)
## [1] 17429
summary(holdout$Approved_level)
## Approved Rejected 
##      236    17193

Data Diagnostics

As in Appendix A.

Loan_Amount, Loan_Period, Interest_Rate, and EMI have very high missing rate, so let’s look at these four variables first.

I used histogram, boxplot, bivariate chart and WOE binning for attribute analysis and transformation

WOE=ln(\(\frac{p(non-event)}{p(event)}\))

IV=\(\sum_{i=1}^n (DistributionGood-DistributionBad)\)*WOE

Loan Amount

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    5000   20000   30000   39420   50000  300000   13828

## $WOE
## # A tibble: 4 x 16
##   grp    n_obs mean~ sum_~ sum_~ sum_w mean_y1 mean_~ Lowe~ Upper~     woe
##   <chr>  <int> <dbl> <dbl> <dbl> <dbl>   <dbl>  <dbl> <dbl>  <dbl>   <dbl>
## 1 1      12519 21226 199   185   12519 0.0159  0.0148  5000  36000  0.0731
## 2 2       2579 42155  27.0  38.1  2579 0.0105  0.0148 37000  49000 -0.345 
## 3 3       5930 76649 197    87.6  5930 0.0332  0.0148 50000 300000  0.810 
## 4 Missi~ 13828    NA  92.0 204   13828 0.00665 0.0148    NA     NA -0.798 
## # ... with 5 more variables: ks <dbl>, info <dbl>, grp.n <dbl>, grp.f
## #   <fctr>, plot.width <dbl>
## 
## $Stat
##         MAX_KS     Info.Value Trend.Estimate Trend.Pr(>|t|) 
##   2.180771e+01   3.555042e-01   1.290376e-05   4.008000e-01 
## 
## $WOE_Code
## [1] "train$w_Loan_Amount = train$Loan_Amount"                                                                                        
## [2] "g_Loan_Amount = c(5000, 37000, 50000, 3e+05, NA)"                                                                               
## [3] "train$w_Loan_Amount[findInterval(train$Loan_Amount, g_Loan_Amount, rightmost.closed=TRUE) == 1] = 0.073115722598372"            
## [4] "train$w_Loan_Amount[findInterval(train$Loan_Amount, g_Loan_Amount, rightmost.closed=TRUE) == 2] = -0.344506472743474"           
## [5] "train$w_Loan_Amount[findInterval(train$Loan_Amount, g_Loan_Amount, rightmost.closed=TRUE) == 3] = 0.810237903879759"            
## [6] "train$w_Loan_Amount[findInterval(train$Loan_Amount, g_Loan_Amount, rightmost.closed=TRUE) == 4] = -0.797848556858933"           
## [7] "train$w_Loan_Amount[is.na(train$Loan_Amount) | is.infinite(train$Loan_Amount) | is.nan(train$Loan_Amount)] = -0.797848556858933"
## 
## $Plot

train$Loan_Amount_flag<-as.factor(ifelse(is.na(train$Loan_Amount),1,0))
train$Loan_Amount_Impute_F<-coalesce(as.numeric(train$Loan_Amount),-1)
train$M_Loan_Amount_F<-as.factor(ifelse(train$Loan_Amount_Impute_F<50000&train$Loan_Amount_Impute_F>0,0,ifelse(train$Loan_Amount_Impute_F>0,1,-1)))

train$Loan_Amount_Impute_C<-coalesce(as.numeric(train$Loan_Amount),0)
train%>%group_by(M_Loan_Amount_F)%>%summarise(approval=sum(Approved),cnt=n(),ratio=approval/cnt)
## # A tibble: 3 x 4
##   M_Loan_Amount_F approval   cnt   ratio
##   <fctr>             <int> <int>   <dbl>
## 1 -1                    92 13828 0.00665
## 2 0                    226 15098 0.0150 
## 3 1                    197  5930 0.0332

Loan_Period

train$Loan_Period_flag<-as.factor(ifelse(is.na(train$Loan_Period),1,0))
train$Loan_Period_Impute<-(coalesce(as.numeric(train$Loan_Period),-1))
train$M_Loan_Period_F<-as.factor(train$Loan_Period_Impute)

# woe.binning(train,'Approved','Loan_Period_Impute')
train%>%group_by(Loan_Period_flag)%>%summarise(approval=sum(Approved),cnt=n(),ratio=approval/cnt)
## # A tibble: 2 x 4
##   Loan_Period_flag approval   cnt   ratio
##   <fctr>              <int> <int>   <dbl>
## 1 0                     423 21028 0.0201 
## 2 1                      92 13828 0.00665

Interest Rate

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   11.99   15.10   18.00   19.23   20.00   37.00   23737

## $WOE
## # A tibble: 5 x 16
##   grp    n_obs mean_x sum_y1 sum_~ sum_w mean_y1 mean_~ Lowe~ Uppe~    woe
##   <chr>  <int>  <dbl>  <dbl> <dbl> <dbl>   <dbl>  <dbl> <dbl> <dbl>  <dbl>
## 1 1       2781   14.1 141     41.1  2781 5.07e-2 0.0148  12.0  15.1  1.23 
## 2 2       3629   16.6  74.0   53.6  3629 2.04e-2 0.0148  15.2  18.2  0.322
## 3 3       3543   21.4  18.0   52.3  3543 5.08e-3 0.0148  18.4  31.0 -1.07 
## 4 4       1166   33.0   1.00  17.2  1166 8.58e-4 0.0148  31.5  37.0 -2.85 
## 5 Missi~ 23737   NA   281    351   23737 1.18e-2 0.0148  NA    NA   -0.222
## # ... with 5 more variables: ks <dbl>, info <dbl>, grp.n <dbl>, grp.f
## #   <fctr>, plot.width <dbl>
## 
## $Stat
##         MAX_KS     Info.Value Trend.Estimate Trend.Pr(>|t|) 
##    23.35762560     0.44285020    -0.22203655     0.02796948 
## 
## $WOE_Code
## [1] "train$w_Interest_Rate = train$Interest_Rate"                                                                                            
## [2] "g_Interest_Rate = c(11.99, 15.25, 18.4, 31.5, 37, NA)"                                                                                  
## [3] "train$w_Interest_Rate[findInterval(train$Interest_Rate, g_Interest_Rate, rightmost.closed=TRUE) == 1] = 1.23300770327775"               
## [4] "train$w_Interest_Rate[findInterval(train$Interest_Rate, g_Interest_Rate, rightmost.closed=TRUE) == 2] = 0.322166353124642"              
## [5] "train$w_Interest_Rate[findInterval(train$Interest_Rate, g_Interest_Rate, rightmost.closed=TRUE) == 3] = -1.06754367983577"              
## [6] "train$w_Interest_Rate[findInterval(train$Interest_Rate, g_Interest_Rate, rightmost.closed=TRUE) == 4] = -2.8465206997769"               
## [7] "train$w_Interest_Rate[findInterval(train$Interest_Rate, g_Interest_Rate, rightmost.closed=TRUE) == 5] = -0.221621954713664"             
## [8] "train$w_Interest_Rate[is.na(train$Interest_Rate) | is.infinite(train$Interest_Rate) | is.nan(train$Interest_Rate)] = -0.221621954713664"
## 
## $Plot

train$Interest_Rate_flag<-as.factor(ifelse(is.na(train$Interest_Rate),1,0))
mean((train[!is.na(train$Interest_Rate),]$Interest_Rate))
## [1] 19.23383
quantile(train[!is.na(train$Interest_Rate),]$Interest_Rate,0.95)
##  95% 
## 31.5
train$Interest_Rate_Impute<-coalesce(as.numeric(train$Interest_Rate),19.23383)
bivar.plot(train,'Interest_Rate_Impute','Approved')

train$Interest_Rate_Impute_grp1<-ifelse(train$Interest_Rate_Impute<=15,15-train$Interest_Rate_Impute,0)
train$Interest_Rate_Impute_grp2<-ifelse(train$Interest_Rate_Impute>15&train$Interest_Rate_Impute<=20,train$Interest_Rate_Impute-15,0)
train$Interest_Rate_Impute_grp3<-ifelse(train$Interest_Rate_Impute>20&train$Interest_Rate_Impute<=30,train$Interest_Rate_Impute-20,0)
train$Interest_Rate_Impute_grp4<-ifelse(train$Interest_Rate_Impute>30,train$Interest_Rate_Impute-30,0)

EMI (Equated Monthly Installment)

EMI=\(\frac{r*Lan_Amount}{1-(1+r)^(-n)}\)

EMI is highly correlated with Interest Rate and Loan Amount

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     119     654     937    1102    1297   13560   23737

##      99% 
## 8.318254
##       1% 
## 5.488938
train$EMI_flag<-as.factor(ifelse(is.na(train$EMI),1,0))
bivar.plot(train,'EMI_log_capped','Approved')

train$M_EMI<-coalesce(as.numeric(train$EMI_log_capped),0)
bivar.plot(train,'M_EMI','Approved')

train$M_EMI<-ifelse(train$M_EMI<6,1,
                                 ifelse(train$M_EMI<7.5,2,3))
train%>%group_by(M_EMI)%>%summarise(approval=sum(Approved),cnt=n(),ratio=approval/cnt)
## # A tibble: 3 x 4
##   M_EMI approval   cnt  ratio
##   <dbl>    <int> <int>  <dbl>
## 1  1.00      293 24829 0.0118
## 2  2.00      157  8766 0.0179
## 3  3.00       65  1261 0.0515

For high missing rate attributes, I lean towards excluding them from regression model. However, I will keep them as potential candidates and compare models with or without these high missing rate variables.

Categorical variable analysis

Convert categorical variables to factors

train$M_City_Category_F<-as.factor(train$City_Category)
train$M_Employer_Category1_F<-as.factor(train$Employer_Category1)
train$M_Employer_Category2_F<-as.factor(train$Employer_Category2)
train$M_Primary_Bank_Type_F<-as.factor(train$Primary_Bank_Type)
train$M_Contacted_F<-as.factor(train$Contacted)
train$M_Gender_F<-as.factor(train$Gender)
train$M_Source_Category_F<-as.factor(train$Source_Category)
## # A tibble: 5 x 4
##   M_Var1_F approval   cnt   ratio
##   <fctr>      <int> <int>   <dbl>
## 1 0              57 11637 0.00490
## 2 2              21  6696 0.00314
## 3 4              33  3885 0.00849
## 4 7             113  5976 0.0189 
## 5 10            291  6662 0.0437

Numeric variable analysis

Correlation

## 'data.frame':    34856 obs. of  9 variables:
##  $ Monthly_Income    : num  3500 2250 3500 7000 2500 ...
##  $ Existing_EMI      : num  0 0 0 0 0 0 255 1000 0 0 ...
##  $ Interest_Rate     : num  13.2 NA NA NA 20 ...
##  $ Employer_Category2: int  1 4 4 4 4 4 4 4 4 4 ...
##  $ Loan_Amount       : int  20000 45000 92000 NA 66000 NA 30000 45000 74000 NA ...
##  $ Loan_Period       : int  2 4 5 NA 5 NA 5 5 5 NA ...
##  $ EMI               : int  953 NA NA NA 1749 NA 722 NA NA NA ...
##  $ Var1              : int  10 0 7 0 7 0 0 10 10 10 ...
##  $ Approved          : int  0 0 0 0 0 0 0 0 0 0 ...

## [1] 3

Monthly Income

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0     1650     2500     6278     4000 38380000

##    99.5% 
## 10.59666
##       5% 
## 6.908755

train$Monthly_Income_flag<-as.factor(ifelse(is.na(train$Monthly_Income),1,0))
train$Monthly_Income_log_capped<-pmin(pmax(train$Monthly_Income_log,6.908755),10.59666 )
bivar.plot(train,'Monthly_Income_log_capped','Approved')

WOE_numeric_split(x='Monthly_Income_log_capped',y1='Approved',y0='one',data=train,group=20)
## $WOE
## # A tibble: 8 x 14
##     grp n_obs mean_x sum_y1 sum_y0 sum_w mean_y1 mean~ Lowe~ Uppe~     woe
##   <int> <int>  <dbl>  <dbl>  <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1     1 11906   7.24   3.00  11906 11906 2.52e-4  1.00  6.91  7.60 -4.07  
## 2     2  2027   7.60   8.00   2027  2027 3.95e-3  1.00  7.60  7.63 -1.32  
## 3     3  6691   7.79  55.0    6691  6691 8.22e-3  1.00  7.63  7.94 -0.586 
## 4     4  3659   8.03  57.0    3659  3659 1.56e-2  1.00  7.94  8.16  0.0529
## 5     5  3379   8.26  79.0    3379  3379 2.34e-2  1.00  8.16  8.41  0.459 
## 6     6  3675   8.55 143      3675  3675 3.89e-2  1.00  8.41  8.73  0.968 
## 7     7  1770   8.91  99.0    1770  1770 5.59e-2  1.00  8.73  9.14  1.33  
## 8     8  1749   9.66  71.0    1749  1749 4.06e-2  1.00  9.15 10.6   1.01  
## # ... with 3 more variables: ks <dbl>, info <dbl>, plot.width <dbl>
## 
## $Stat
##         MAX_KS     Info.Value Trend.Estimate Trend.Pr(>|t|) 
##   46.353619106    1.943089527    2.802307536    0.004907191 
## 
## $WOE_Code
##  [1] "train$w_Monthly_Income_log_capped = train$Monthly_Income_log_capped"                                                                                                              
##  [2] "g_Monthly_Income_log_capped = c(6.908755, 7.60140233458373, 7.63094658089046, 7.93773177526011, 8.16080392095467, 8.41205487329293, 8.73246584834988, 9.14857134557855, 10.59666)"
##  [3] "train$w_Monthly_Income_log_capped[findInterval(train$Monthly_Income_log_capped, g_Monthly_Income_log_capped, rightmost.closed=TRUE) == 1] = -4.07137179791168"                    
##  [4] "train$w_Monthly_Income_log_capped[findInterval(train$Monthly_Income_log_capped, g_Monthly_Income_log_capped, rightmost.closed=TRUE) == 2] = -1.32005693763862"                    
##  [5] "train$w_Monthly_Income_log_capped[findInterval(train$Monthly_Income_log_capped, g_Monthly_Income_log_capped, rightmost.closed=TRUE) == 3] = -0.586371766416804"                   
##  [6] "train$w_Monthly_Income_log_capped[findInterval(train$Monthly_Income_log_capped, g_Monthly_Income_log_capped, rightmost.closed=TRUE) == 4] = 0.0529197699488852"                   
##  [7] "train$w_Monthly_Income_log_capped[findInterval(train$Monthly_Income_log_capped, g_Monthly_Income_log_capped, rightmost.closed=TRUE) == 5] = 0.458926432886273"                    
##  [8] "train$w_Monthly_Income_log_capped[findInterval(train$Monthly_Income_log_capped, g_Monthly_Income_log_capped, rightmost.closed=TRUE) == 6] = 0.968349885746512"                    
##  [9] "train$w_Monthly_Income_log_capped[findInterval(train$Monthly_Income_log_capped, g_Monthly_Income_log_capped, rightmost.closed=TRUE) == 7] = 1.33119869170026"                     
## [10] "train$w_Monthly_Income_log_capped[findInterval(train$Monthly_Income_log_capped, g_Monthly_Income_log_capped, rightmost.closed=TRUE) == 8] = 1.01069406915626"                     
## 
## $Plot

train$M_Monthly_Income<-ifelse(train$Monthly_Income_log_capped<7.630655,-2.92930647,
                                 ifelse(train$Monthly_Income_log_capped<7.937160,-0.58637177,
                                  ifelse(train$Monthly_Income_log_capped<8.160261,0.05291977,
                                   ifelse(train$Monthly_Income_log_capped<8.409808,0.45892643,1.08001423))))
train%>%group_by(M_Monthly_Income)%>%summarise(approval=sum(Approved),cnt=n(),ratio=approval/cnt)
## # A tibble: 5 x 4
##   M_Monthly_Income approval   cnt    ratio
##              <dbl>    <int> <int>    <dbl>
## 1          -2.93         11 13932 0.000790
## 2          -0.586        55  6691 0.00822 
## 3           0.0529       57  3659 0.0156  
## 4           0.459        79  3379 0.0234  
## 5           1.08        313  7195 0.0435

Existing EMI

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     0.0     0.0   353.6   350.0 62630.0      30

##    99% 
## 8.2943
## 1% 
##  0

train$Existing_EMI_flag<-as.factor(ifelse(is.na(train$Existing_EMI),1,0))
bivar.plot(train,'Existing_EMI_log_capped','Approved')

WOE_numeric_split(x='Existing_EMI_log_capped',y1='Approved',y0='one',data=train,group=10)
## $WOE
## # A tibble: 3 x 14
##     grp n_obs mean_x sum_y1 sum_y0 sum_w mean_~ mean~ Lower~ Upper~    woe
##   <int> <int>  <dbl>  <dbl>  <dbl> <dbl>  <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     1 27350  0.788  288    27350 27350 0.0105  1.00   0      6.22 -0.339
## 2     2  3441  6.46    60.0   3441  3441 0.0174  1.00   6.22   6.91  0.166
## 3     3  4065  7.47   167     4065  4065 0.0411  1.00   6.91   8.31  1.02 
## # ... with 3 more variables: ks <dbl>, info <dbl>, plot.width <dbl>
## 
## $Stat
##         MAX_KS     Info.Value Trend.Estimate Trend.Pr(>|t|) 
##     22.5433573      0.2916497      0.1615363      0.2360679 
## 
## $WOE_Code
## [1] "train$w_Existing_EMI_log_capped = train$Existing_EMI_log_capped"                                                                                         
## [2] "g_Existing_EMI_log_capped = c(0, 6.21660610108486, 6.90875477931522, 8.305953)"                                                                          
## [3] "train$w_Existing_EMI_log_capped[findInterval(train$Existing_EMI_log_capped, g_Existing_EMI_log_capped, rightmost.closed=TRUE) == 1] = -0.338697660580639"
## [4] "train$w_Existing_EMI_log_capped[findInterval(train$Existing_EMI_log_capped, g_Existing_EMI_log_capped, rightmost.closed=TRUE) == 2] = 0.165640823558162" 
## [5] "train$w_Existing_EMI_log_capped[findInterval(train$Existing_EMI_log_capped, g_Existing_EMI_log_capped, rightmost.closed=TRUE) == 3] = 1.02263845756839"  
## 
## $Plot

# woe.binning(train,'Approved','Existing_EMI_log_capped')
train$M_Existing_EMI<-ifelse(train$Existing_EMI_log_capped<6.215408,-0.3386977,
                             ifelse(train$Existing_EMI_log_capped<6.907455,0.1656408,1.0226385))
train%>%group_by(M_Existing_EMI)%>%summarise(approval=sum(Approved),cnt=n(),ratio=approval/cnt)
## # A tibble: 3 x 4
##   M_Existing_EMI approval   cnt  ratio
##            <dbl>    <int> <int>  <dbl>
## 1         -0.339      288 27350 0.0105
## 2          0.166       60  3440 0.0174
## 3          1.02       167  4066 0.0411

For Existing EMI, try continuous form as well.

Date Attributes: DOB and Lead_Date

train$DOB<- as.Date(train$DOB,
                   format = "%d/%m/%y")
train$Lead_Creation_Date<- as.Date(train$Lead_Creation_Date,
                                  format = "%d/%m/%y")
train$M_DOB_Year<-year(train$DOB)
train$M_DOB_Year<-pmin(coalesce(train$M_DOB_Year,2016),2016-21)
train$M_DOB_Month<-month(train$DOB)
train$M_Lead_Year<-pmin(year(train$Lead_Creation_Date),2016)
train$M_Lead_Month<-month(train$Lead_Creation_Date)

train$M_Lead_Month<-as.factor(train$M_Lead_Month)
train$M_DOB_Month<-as.factor(train$M_DOB_Month)

KS summary

As in Appendix B.

Decision tree

Decision tree cannot split trees on imbalanced data

## Call:
## rpart(formula = fm, data = train, method = "class")
##   n= 34856 
## 
##   CP nsplit rel error xerror xstd
## 1  0      0         1      0    0
## 
## Node number 1: 34856 observations
##   predicted class=0  expected loss=0.01477507  P(node) =1
##     class counts: 34341   515
##    probabilities: 0.985 0.015

Decision tree with rose data

library(ROSE)
fm<-as.formula(Approved ~ M_Gender_F+M_City_Category_F
             +M_Employer_Category1_F+M_Employer_Category2_F
             +M_Monthly_Income
             +M_Primary_Bank_Type_F+M_Contacted_F+M_Source_Category_F
             +M_Existing_EMI+Var1
             +Interest_Rate
             +EMI
             +Loan_Amount
             +Loan_Period
             )
data.rose <- ROSE(fm, data = train, seed = 1)$data

Logistic Regression

model6<-as.formula(Approved~M_Gender_F
+M_City_Category_F.B
+M_Employer_Category1_F.A               
+M_Primary_Bank_Type_F.
+M_Primary_Bank_Type_F.G
# +M_Existing_EMI
+Existing_EMI_log_capped
+Interest_Rate_Impute_grp2
# +Monthly_Income_log_capped
+M_Monthly_Income
+M_Var1_F.10
)


logi.model<-glm(model6, data=train3,
               family = binomial(link = "logit"))
summary(logi.model)
## 
## Call:
## glm(formula = model6, family = binomial(link = "logit"), data = train3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6286  -0.2002  -0.0977  -0.0440   3.9444  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -4.07798    0.14766 -27.616  < 2e-16 ***
## M_Gender_FMale             0.32288    0.11075   2.915 0.003551 ** 
## M_City_Category_F.B        0.46428    0.15156   3.063 0.002189 ** 
## M_Employer_Category1_F.A  -0.34671    0.09768  -3.549 0.000386 ***
## M_Primary_Bank_Type_F.    -1.83720    0.32367  -5.676 1.38e-08 ***
## M_Primary_Bank_Type_F.G   -0.49143    0.12169  -4.038 5.38e-05 ***
## Existing_EMI_log_capped    0.07887    0.01322   5.965 2.45e-09 ***
## Interest_Rate_Impute_grp2 -0.11097    0.02434  -4.559 5.14e-06 ***
## M_Monthly_Income           0.81680    0.06238  13.093  < 2e-16 ***
## M_Var1_F.10                0.29632    0.10427   2.842 0.004486 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5363.6  on 34855  degrees of freedom
## Residual deviance: 4531.3  on 34846  degrees of freedom
## AIC: 4551.3
## 
## Number of Fisher Scoring iterations: 9
test3$pred<-round(predict(logi.model,newdata=test3,type='response'))
confusionMatrix(test3$pred,test3$Approved)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 17159   269
##          1     0     0
##                                           
##                Accuracy : 0.9846          
##                  95% CI : (0.9826, 0.9863)
##     No Information Rate : 0.9846          
##     P-Value [Acc > NIR] : 0.5162          
##                                           
##                   Kappa : 0               
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.9846          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.9846          
##          Detection Rate : 0.9846          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
## 

Look for interaction term

library(dismo)  
data(train4)
set.seed(12345)
model.fixed <- gbm.fixed(data=train4, gbm.x = (2:10), gbm.y = 1,family = "bernoulli", tree.complexity = 5, learning.rate = 0.01, bag.fraction = 0.5, n.trees = 500)
## [1] fitting gbm model with a fixed number of 500 trees for Approved
## [1] total deviance = 5363.61
## [1] residual deviance = 4424.44
names(model.fixed$gbm.call)[1] <- "dataframe"

find.int <- gbm.interactions(model.fixed)
# find.int$interactions
find.int$rank.list
##   var1.index              var1.names var2.index               var2.names
## 1          8        M_Monthly_Income          4   M_Primary_Bank_Type_F.
## 2          9             M_Var1_F.10          6  Existing_EMI_log_capped
## 3          6 Existing_EMI_log_capped          3 M_Employer_Category1_F.A
## 4          6 Existing_EMI_log_capped          5  M_Primary_Bank_Type_F.G
##   int.size
## 1    44.31
## 2     5.59
## 3     5.50
## 4     4.58

Stepwise forward selection to get interaction terms

model8<-as.formula(Approved ~ M_Monthly_Income + M_Primary_Bank_Type_F. + M_Primary_Bank_Type_F.G + 
    Interest_Rate_Impute_grp2 + Existing_EMI_log_capped + M_Employer_Category1_F.A + 
    M_Gender_F + M_City_Category_F.B + M_Var1_F.10 + M_Primary_Bank_Type_F.:Interest_Rate_Impute_grp2 + 
    M_Monthly_Income:M_Gender_F 
    )

logi.model<-glm(model8, data=train3,
               family = binomial(link = "logit"))
summary(logi.model)
## 
## Call:
## glm(formula = model8, family = binomial(link = "logit"), data = train3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6072  -0.2028  -0.0926  -0.0329   4.1191  
## 
## Coefficients:
##                                                  Estimate Std. Error
## (Intercept)                                      -4.22523    0.16561
## M_Monthly_Income                                  1.02022    0.11722
## M_Primary_Bank_Type_F.                           -0.73075    0.41516
## M_Primary_Bank_Type_F.G                          -0.48166    0.12165
## Interest_Rate_Impute_grp2                        -0.10147    0.02456
## Existing_EMI_log_capped                           0.07727    0.01322
## M_Employer_Category1_F.A                         -0.35839    0.09793
## M_Gender_FMale                                    0.50177    0.13944
## M_City_Category_F.B                               0.45682    0.15154
## M_Var1_F.10                                       0.31156    0.10452
## M_Primary_Bank_Type_F.:Interest_Rate_Impute_grp2 -0.46718    0.17037
## M_Monthly_Income:M_Gender_FMale                  -0.29924    0.13262
##                                                  z value Pr(>|z|)    
## (Intercept)                                      -25.514  < 2e-16 ***
## M_Monthly_Income                                   8.703  < 2e-16 ***
## M_Primary_Bank_Type_F.                            -1.760 0.078380 .  
## M_Primary_Bank_Type_F.G                           -3.959 7.51e-05 ***
## Interest_Rate_Impute_grp2                         -4.131 3.61e-05 ***
## Existing_EMI_log_capped                            5.845 5.06e-09 ***
## M_Employer_Category1_F.A                          -3.660 0.000253 ***
## M_Gender_FMale                                     3.598 0.000320 ***
## M_City_Category_F.B                                3.014 0.002574 ** 
## M_Var1_F.10                                        2.981 0.002874 ** 
## M_Primary_Bank_Type_F.:Interest_Rate_Impute_grp2  -2.742 0.006104 ** 
## M_Monthly_Income:M_Gender_FMale                   -2.256 0.024045 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5363.6  on 34855  degrees of freedom
## Residual deviance: 4517.3  on 34844  degrees of freedom
## AIC: 4541.3
## 
## Number of Fisher Scoring iterations: 10
test3$pred<-round(predict(logi.model,newdata=test3,type='response'))
confusionMatrix(test3$pred,test3$Approved)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 17159   269
##          1     0     0
##                                           
##                Accuracy : 0.9846          
##                  95% CI : (0.9826, 0.9863)
##     No Information Rate : 0.9846          
##     P-Value [Acc > NIR] : 0.5162          
##                                           
##                   Kappa : 0               
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.9846          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.9846          
##          Detection Rate : 0.9846          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
## 

Accuracy=ln(\(\frac{TP+TN}{P+N}\))

Change cutoff value

train3$predlink<-round(predict(logi.model,newdata=train3,type='link'))
summary(exp(train3$predlink))
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000167 0.0009119 0.0067380 0.0160600 0.0183200 0.1353000
train3$pred<-ifelse(exp(train3$predlink)>0.045,1,0)
test3$predlink<-round(predict(logi.model,newdata=test3,type='link'))
test3$pred<-ifelse(exp(test3$predlink)>0.045,1,0)

pred_prob<-round(predict(logi.model,newdata=test3,type='link'))

library(car)
vif(lm(model6, data=train3))
##                M_Gender_F       M_City_Category_F.B 
##                  1.215335                  1.022013 
##  M_Employer_Category1_F.A    M_Primary_Bank_Type_F. 
##                  1.083169                  1.111446 
##   M_Primary_Bank_Type_F.G   Existing_EMI_log_capped 
##                  1.171321                  1.080049 
## Interest_Rate_Impute_grp2          M_Monthly_Income 
##                  1.140728                  1.512113 
##               M_Var1_F.10 
##                  1.435848
GAINS.CHART(testset$Approved,exp(pred_prob),n.rank=10)

## $gainschart
##   Rank Actual_Sum Actual_Mean   Pred_Sum    Pred_Mean Count
## 1    1          0 0.000000000   1.134770 0.0002801901  4050
## 2    2          6 0.002824859   1.936837 0.0009118820  2124
## 3    3          6 0.002700270   5.507787 0.0024787522  2222
## 4    4         27 0.011455240  15.881341 0.0067379470  2357
## 5    5         67 0.020483033  59.910455 0.0183156389  3271
## 6    6        122 0.041105121 147.768019 0.0497870684  2968
## 7    7         41 0.094036697  59.006183 0.1353352832   436
## 
## $ks
## [1] 42.64
## 
## $gini
## [1] 30.93
confusionMatrix(test3$pred,test3$Approved)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 13918   106
##          1  3241   163
##                                          
##                Accuracy : 0.808          
##                  95% CI : (0.802, 0.8138)
##     No Information Rate : 0.9846         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0619         
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.81112        
##             Specificity : 0.60595        
##          Pos Pred Value : 0.99244        
##          Neg Pred Value : 0.04788        
##              Prevalence : 0.98457        
##          Detection Rate : 0.79860        
##    Detection Prevalence : 0.80468        
##       Balanced Accuracy : 0.70853        
##                                          
##        'Positive' Class : 0              
## 

Residual Analysis

Residual(train3$Approved,train3$pred,train3$Existing_EMI,Weight=train3$one,NBins=50)
Residual(train3$Approved,train3$pred,train3$Loan_Amount,Weight=train3$one,NBins=50)
Residual(train3$Approved,train3$pred,train3$Interest_Rate,Weight=train3$one,NBins=30)

Lorenz Curve, Train vs Test

ModelComparisonData %>% ggvis(~ValidationWeight, ~ValidationLoss) %>% layer_paths(stroke := "blue") %>% 
  layer_paths(data = ModelComparisonData, x = ~ValidationWeight, y = ~ValidationWeight, stroke := "black") %>%
  layer_paths(data = ModelComparisonData, x = ~TrainWeight, y = ~TrainLoss, stroke := "red")

Smote GBM

###smote gbm#################################
train$Approved<-as.factor(ifelse(train$Approved==1,'Approved','Rejected'))

ctrl <- trainControl(method = "repeatedcv",
                     number = 10,
                     repeats = 5,
                     summaryFunction = twoClassSummary,
                     classProbs = TRUE)


set.seed(5627)

gbmGrid <-  expand.grid(interaction.depth = c(1,2,3),
                        n.trees = (1:20)*50, 
                        shrinkage = c(0.1,0.05),
                        n.minobsinnode = 50) 

ctrl$sampling <- "smote"

# smote_fit <- train(model,
#                    data = train,
#                    method = "gbm",
#                    bag.fraction = 0.5,
#                    verbose = FALSE,
#                    metric = "ROC",
#                    trControl = ctrl,
#                    tuneGrid =gbmGrid
#                    # ,na.action=na.exclude
#                    )

#find best iteration
plot(smote_fit)

summary(smote_fit)

##                                               var    rel.inf
## M_Monthly_Income                 M_Monthly_Income 59.1756399
## M_Var1                                     M_Var1 10.1334777
## M_Interest_Rate                   M_Interest_Rate  8.4750497
## M_Existing_EMI                     M_Existing_EMI  7.8571167
## M_Loan_Amount                       M_Loan_Amount  7.7540301
## M_EMI                                       M_EMI  3.0969584
## M_Employer_Category1A       M_Employer_Category1A  1.5391777
## M_Employer_Category1C       M_Employer_Category1C  1.3280068
## M_Employer_Category1B       M_Employer_Category1B  0.4391393
## M_GenderMale                         M_GenderMale  0.2014038
## M_Loan_Periodnot_missing M_Loan_Periodnot_missing  0.0000000
pred<-predict(smote_fit,newdata=test,type='prob')
test$Approved<-as.factor(ifelse(test$Approved==1,'Approved','Rejected'))
pred1<-as.data.frame(predict(smote_fit,newdata=test,type='raw'))

pred2<-pred$Approved
testset$Approved_Aligned<-testset$Approved/mean(pred2)
GAINS.CHART(testset$Approved_Aligned,(pred2),n.rank=10)

## $gainschart
##    Rank Actual_Sum Actual_Mean   Pred_Sum  Pred_Mean Count
## 1     1   0.000000 0.000000000   22.64243 0.01299049  1743
## 2     2   7.538538 0.004317605   29.56613 0.01693364  1746
## 3     3   0.000000 0.000000000   47.96860 0.02752071  1743
## 4     4   7.538538 0.004332493  108.86728 0.06256740  1740
## 5     5  45.231230 0.025905630  247.11831 0.14153397  1746
## 6     6  49.000499 0.028177400  433.56899 0.24932087  1739
## 7     7 109.308806 0.062749028  615.89203 0.35355455  1742
## 8     8 150.770767 0.086500727  803.14289 0.46078192  1743
## 9     9 207.309805 0.118938500 1015.57877 0.58266137  1743
## 10   10 437.235225 0.250852109 1299.36255 0.74547479  1743
## 
## $ks
## [1] 49.22
## 
## $gini
## [1] 63.09
confusionMatrix(pred1$`predict(smote_fit, newdata = test, type = "raw")`,test$Approved)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Approved Rejected
##   Approved      176     3572
##   Rejected       93    13587
##                                           
##                Accuracy : 0.7897          
##                  95% CI : (0.7836, 0.7957)
##     No Information Rate : 0.9846          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0606          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.65428         
##             Specificity : 0.79183         
##          Pos Pred Value : 0.04696         
##          Neg Pred Value : 0.99320         
##              Prevalence : 0.01543         
##          Detection Rate : 0.01010         
##    Detection Prevalence : 0.21506         
##       Balanced Accuracy : 0.72305         
##                                           
##        'Positive' Class : Approved        
## 

Model Comparison on Holdout dataset

Next Step

  1. Further tune gbm model

  2. Try spline smoothing on continuous variables: Monthly Income, Existing EMI, and Interest Rate

  3. Check glm/gbm overfitting

  4. PCA

  5. Segmentation, decision tree/random forest

  6. lasso/ridge

  7. Test on Holdout dataset.

Conclusion

Logistic regression did a fairly good job in predicting the approval probabilities with the unbalanced dataset, however, feature engineering takes a lot of time and analysis with modeler’s experience based judgements. GBM gave a better model with less feature engineering work, but GBM is a blackbox, need to tune parameters and more difficult to explain to regulators if filed as a pricing model.

Questions?.

Questions?.