The following dataset was obtained from Kaggle and is originally published as part of the book “Econmetric Analysis” by William Greene. The data is collected from a lending institution and indicates whether a cardholder’s application got accepted (1) or denied (0).
Using this data, I will use a logistic regression model to predict whether cardholders in a test sample got approved or declined.
Note the dataset has 12 variables including the approval variable (“card”). Approximately 23% of total cardholders were denied a credit card. Variables such as age, income, and dependents have a wide range of values. The data set is a mix of integers, factor, and numerical values. See a summary of the content below.
## corrplot 0.84 loaded
## card reports age income
## no : 296 Min. : 0.0000 Min. : 0.1667 Min. : 0.210
## yes:1023 1st Qu.: 0.0000 1st Qu.:25.4167 1st Qu.: 2.244
## Median : 0.0000 Median :31.2500 Median : 2.900
## Mean : 0.4564 Mean :33.2131 Mean : 3.365
## 3rd Qu.: 0.0000 3rd Qu.:39.4167 3rd Qu.: 4.000
## Max. :14.0000 Max. :83.5000 Max. :13.500
## share expenditure owner selfemp
## Min. :0.0001091 Min. : 0.000 no :738 no :1228
## 1st Qu.:0.0023159 1st Qu.: 4.583 yes:581 yes: 91
## Median :0.0388272 Median : 101.298
## Mean :0.0687322 Mean : 185.057
## 3rd Qu.:0.0936168 3rd Qu.: 249.036
## Max. :0.9063205 Max. :3099.505
## dependents months majorcards active
## Min. :0.0000 Min. : 0.00 Min. :0.0000 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.: 12.00 1st Qu.:1.0000 1st Qu.: 2.000
## Median :1.0000 Median : 30.00 Median :1.0000 Median : 6.000
## Mean :0.9939 Mean : 55.27 Mean :0.8173 Mean : 6.997
## 3rd Qu.:2.0000 3rd Qu.: 72.00 3rd Qu.:1.0000 3rd Qu.:11.000
## Max. :6.0000 Max. :540.00 Max. :1.0000 Max. :46.000
## 'data.frame': 1319 obs. of 12 variables:
## $ card : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ reports : int 0 0 0 0 0 0 0 0 0 0 ...
## $ age : num 37.7 33.2 33.7 30.5 32.2 ...
## $ income : num 4.52 2.42 4.5 2.54 9.79 ...
## $ share : num 0.03327 0.00522 0.00416 0.06521 0.06705 ...
## $ expenditure: num 124.98 9.85 15 137.87 546.5 ...
## $ owner : Factor w/ 2 levels "no","yes": 2 1 2 1 2 1 1 2 2 1 ...
## $ selfemp : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ dependents : int 3 3 4 0 2 0 2 0 0 0 ...
## $ months : int 54 34 58 25 64 54 7 77 97 65 ...
## $ majorcards : int 1 1 1 1 1 1 1 1 1 1 ...
## $ active : int 12 13 5 7 5 1 5 3 6 18 ...
## [1] 1319 12
In order to create the predictive model, we need our variables to be in a numeric format. To do this I needed to format the data accordingly using a series of loops. The affected parameters are dictated below.
data[,1:12]<- sapply(data[,1:12], as.character)
n<- nrow(data)
##To turn quantitive data into qualitive data
for (i in 1:n)
{if (data$card[i] == "yes"){data$card[i]<- 1} else {data$card[i]<- 0}}
for (i in 1:n)
{if (data$owner[i] == "yes"){data$owner[i]<- 1} else {data$owner[i]<- 0}}
for (i in 1:n)
{if (data$selfemp[i] == "yes"){data$selfemp[i]<- 1} else {data$selfemp[i]<- 0}}
data[,1:12]<- sapply(data[,1:12], as.numeric)
Now that the data is in the appropriate format, I decided to create a train and test model based on a 80/20 ratio. I will use the rule of 10 in order to keep the risk of overfitting low. To handpick the 10 variables included in the model I created a correlation plot.
##
## Call:
## glm(formula = card ~ reports + income + owner + selfemp + dependents +
## majorcards + active, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.8635 0.1414 0.3772 0.6047 2.4323
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.09656 0.26951 0.358 0.720136
## reports -1.92571 0.17224 -11.180 < 2e-16 ***
## income 0.27226 0.07505 3.628 0.000286 ***
## owner 0.24965 0.21352 1.169 0.242326
## selfemp -0.99171 0.31939 -3.105 0.001903 **
## dependents -0.18273 0.07910 -2.310 0.020873 *
## majorcards 0.54040 0.21283 2.539 0.011114 *
## active 0.15726 0.02223 7.072 1.52e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1096.61 on 1055 degrees of freedom
## Residual deviance: 752.97 on 1048 degrees of freedom
## AIC: 768.97
##
## Number of Fisher Scoring iterations: 6
Figure 1: Age has the smallest correlation compared to the others. I decided to omit “share” and “expenditures” because the correlation between the two is too high and prevents me from creating an accurate model (fitted probabilities). Essentially all of the variables within the model are statistically significant.
The model appears to predict approved cardholders with high accuracy. Though the accuracy of the model is acceptable, our largest variances appear to be predicting rejected applicants. This could be due to the fact that our model was based on a dataset that was heavily disproportionate and had signficantly more acceptances than rejections. The summary of our predictions are highlighted below.
##
## 0 1
## 0 34 36
## 1 7 186
## [1] "The model predicted 84 % of the test dataset correctly"
Data: https://www.kaggle.com/dansbecker/aer-credit-card-data