(1) Overview

The following dataset was obtained from Kaggle and is originally published as part of the book “Econmetric Analysis” by William Greene. The data is collected from a lending institution and indicates whether a cardholder’s application got accepted (1) or denied (0).


Using this data, I will use a logistic regression model to predict whether cardholders in a test sample got approved or declined.

(2) Exploring the data

Note the dataset has 12 variables including the approval variable (“card”). Approximately 23% of total cardholders were denied a credit card. Variables such as age, income, and dependents have a wide range of values. The data set is a mix of integers, factor, and numerical values. See a summary of the content below.

## corrplot 0.84 loaded
##   card         reports             age              income      
##  no : 296   Min.   : 0.0000   Min.   : 0.1667   Min.   : 0.210  
##  yes:1023   1st Qu.: 0.0000   1st Qu.:25.4167   1st Qu.: 2.244  
##             Median : 0.0000   Median :31.2500   Median : 2.900  
##             Mean   : 0.4564   Mean   :33.2131   Mean   : 3.365  
##             3rd Qu.: 0.0000   3rd Qu.:39.4167   3rd Qu.: 4.000  
##             Max.   :14.0000   Max.   :83.5000   Max.   :13.500  
##      share            expenditure       owner     selfemp   
##  Min.   :0.0001091   Min.   :   0.000   no :738   no :1228  
##  1st Qu.:0.0023159   1st Qu.:   4.583   yes:581   yes:  91  
##  Median :0.0388272   Median : 101.298                       
##  Mean   :0.0687322   Mean   : 185.057                       
##  3rd Qu.:0.0936168   3rd Qu.: 249.036                       
##  Max.   :0.9063205   Max.   :3099.505                       
##    dependents         months         majorcards         active      
##  Min.   :0.0000   Min.   :  0.00   Min.   :0.0000   Min.   : 0.000  
##  1st Qu.:0.0000   1st Qu.: 12.00   1st Qu.:1.0000   1st Qu.: 2.000  
##  Median :1.0000   Median : 30.00   Median :1.0000   Median : 6.000  
##  Mean   :0.9939   Mean   : 55.27   Mean   :0.8173   Mean   : 6.997  
##  3rd Qu.:2.0000   3rd Qu.: 72.00   3rd Qu.:1.0000   3rd Qu.:11.000  
##  Max.   :6.0000   Max.   :540.00   Max.   :1.0000   Max.   :46.000
## 'data.frame':    1319 obs. of  12 variables:
##  $ card       : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ reports    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ age        : num  37.7 33.2 33.7 30.5 32.2 ...
##  $ income     : num  4.52 2.42 4.5 2.54 9.79 ...
##  $ share      : num  0.03327 0.00522 0.00416 0.06521 0.06705 ...
##  $ expenditure: num  124.98 9.85 15 137.87 546.5 ...
##  $ owner      : Factor w/ 2 levels "no","yes": 2 1 2 1 2 1 1 2 2 1 ...
##  $ selfemp    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ dependents : int  3 3 4 0 2 0 2 0 0 0 ...
##  $ months     : int  54 34 58 25 64 54 7 77 97 65 ...
##  $ majorcards : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ active     : int  12 13 5 7 5 1 5 3 6 18 ...
## [1] 1319   12

Feature Engineering

In order to create the predictive model, we need our variables to be in a numeric format. To do this I needed to format the data accordingly using a series of loops. The affected parameters are dictated below.

data[,1:12]<- sapply(data[,1:12], as.character)
n<- nrow(data)
##To turn quantitive data into qualitive data
for (i in 1:n)
{if (data$card[i] == "yes"){data$card[i]<- 1} else {data$card[i]<- 0}}
for (i in 1:n)
{if (data$owner[i] == "yes"){data$owner[i]<- 1} else {data$owner[i]<- 0}}
for (i in 1:n)
{if (data$selfemp[i] == "yes"){data$selfemp[i]<- 1} else {data$selfemp[i]<- 0}}
data[,1:12]<- sapply(data[,1:12], as.numeric)

(2) Building the Model

Now that the data is in the appropriate format, I decided to create a train and test model based on a 80/20 ratio. I will use the rule of 10 in order to keep the risk of overfitting low. To handpick the 10 variables included in the model I created a correlation plot.

## 
## Call:
## glm(formula = card ~ reports + income + owner + selfemp + dependents + 
##     majorcards + active, family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.8635   0.1414   0.3772   0.6047   2.4323  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.09656    0.26951   0.358 0.720136    
## reports     -1.92571    0.17224 -11.180  < 2e-16 ***
## income       0.27226    0.07505   3.628 0.000286 ***
## owner        0.24965    0.21352   1.169 0.242326    
## selfemp     -0.99171    0.31939  -3.105 0.001903 ** 
## dependents  -0.18273    0.07910  -2.310 0.020873 *  
## majorcards   0.54040    0.21283   2.539 0.011114 *  
## active       0.15726    0.02223   7.072 1.52e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1096.61  on 1055  degrees of freedom
## Residual deviance:  752.97  on 1048  degrees of freedom
## AIC: 768.97
## 
## Number of Fisher Scoring iterations: 6

Figure 1: Age has the smallest correlation compared to the others. I decided to omit “share” and “expenditures” because the correlation between the two is too high and prevents me from creating an accurate model (fitted probabilities). Essentially all of the variables within the model are statistically significant.

(3) Conclusion

The model appears to predict approved cardholders with high accuracy. Though the accuracy of the model is acceptable, our largest variances appear to be predicting rejected applicants. This could be due to the fact that our model was based on a dataset that was heavily disproportionate and had signficantly more acceptances than rejections. The summary of our predictions are highlighted below.

##    
##       0   1
##   0  34  36
##   1   7 186
## [1] "The model predicted  84 % of the test dataset correctly"