Abstract

Credit scoring is an essential practice in financial institution as statistical analysis is performed by lenders and financial institution to estimate the probability of a customer considering as a good or bad loan applicant. This study examines a classic credit data of the customer’s creditworthiness. Logistic regression models were fit to determine whether a customer is “bad” (value of 1) or “good” (value of 0). Age, total income which is the sum of spouse and applicant’s income and residential status are found to be important determinants. Furthermore, to test the goodness of the chosen classifier, Cross Validation and Confusion Matrix method were used in this analysis. Although the data is dated, the techniques used for this study and the importance of these determinants provide helpful insights for current practices.

Introduction

The financial institution seeks to establish a more reliable credit scoring than the tradition judgmental approach when it comes to credit scoring. Granting loan to customers is one of the core business of a bank. In doing so, banks must have adequate systems to decide to whom to give loans. Credit scoring is a crucial risk assessment technique use to analyze and quantify a potential customer’s credit risk. When it comes to granting loans, the credit score of a customer shows the likelihood that the person will repay the debt. Throughout a period of time, banks have gathered plenty of information describing default behavior of their customers. For example, when considering whether an applicant is “good” or “bad” at repaying their debt, banks will look at a customer’s historical information regarding age, income, residential status, employment status and so on to gain insight on the likeliness to repay their debt.

For many financial analysts, data on customers’ information can be challenging to access as the financial institutions would like to protect the confidentiality of their customers. Because of this lack of up-to-date credit data, this study examines credit data that collected in 2000. All the techniques used in this dataset are to identify the connections between the characteristics of the loan applicant by showing how “bad” or “good” based on their history and most of the methods lead to a scorecard. Although the experience is dated, the regression techniques used in this report work reasonably well with current practices. Further, the determinants investigated in this study, such as age and income, are likely to be vital as it indicates to the lender whether or not to accept or reject a loan applicant.

I will study each predictor variable in relation to the target variable and test its significance to the data. To summarize these characteristics, in the Model Selection and Discussion on the Model and its Classifier, further discussion of the chosen model to represent the data is made. Following by concluding remarks and more details on the analysis will be shown in the appendix.

Data Characteristic

The data available for this dataset are cross-sectional and has been randomly generated in order to exhibit characteristics similar to the actual data. In this study, the dataset that were compiled consist of 900 of the applicant’s background information. This dataset has 240 bad loans and 660 good loans. The target variable that we are focusing on is BAD, meaning an applicant will be classify as “Bad” customer if the respective customer did not repay the debt as intended, hence, we will not be using these as predictor variables.

The following table shows the variables available and their definitions.

Item Variable Definition
1 DOB Year of Birth
2 NKID Number of children
3 DEP Number of other dependents
4 PHON Is there a home phone, 1 = yes, 0 = no
5 SINC Spouse Income
6 AES Applicant’s employment status
7 DAINC Applicant’s Income
8 RES Residential Status
9 DHVAL Value of Home
10 DMORT Mortgage Balance Outstanding
11 DOUTM Mortgage or rent expenses
12 DOUTL Expenses on Loans
13 DOUTHP Expenses on Hire Purchase
14 DOUTCC Expenses on Credit Cards

The outcomes of interest is to predict whether a customer is “bad” which returns value of 1 or “good” which returns value of 0. Before applying modeling techniques, it is important to understand the attributes and distributions of the variables as it will be helpful in determining which variables are significant in predicting the target variable. In this dataset, there are 5 categorical variables and 9 continuous variables as described below.

Based on the output, we can see that for this dataset, by default, 26.7% of the loan applicants were considered as bad customer.

mean(db$BAD)
## [1] 0.2666667

Bootstrap estimate of variability of BAD Credit Rate

For this study, we can first examine the dataset by using the bootstrap method. The bootstrap method is used to estimate the standard error or variability of bad credit rate which is by looking confidence interval. In our dataset, we have 240 records that defaulted as bad customer and 660 remaining records had paid their loan back; hence, this gives us an overall rate of 26.7%. The variability is widespread because when we sample (with replacement) from this set of records, we could get a different answer. The idea behind bootstrap method is for this dataset, and we can determine that 95% of the records fall between 23.7% and 29.7% of chances getting classified as bad customer.

##  2.5%  Mean 97.5% 
## 0.237 0.267 0.297

Before we fit the models, we will subjectively group the continuous numeric variable to binned categorical variables based on visualizing cut-off points that is seem fit. This is called coarse binning as we will group values with similar risk or share similar characteristics, in as few bins as possible in order to minimize the standard error. With the new group variables, we will remove the original continuous variables.

BAD Credit Rate by Variable

Age

By looking at the line in the graph below, we can see that there is a slight dip at ages between 30 and 40. It gives us an indication that the applicant that falls between the ages of 30 and 40 will have lower probability which is roughly around 25% of getting classified as a bad customer. However, the older the applicant is, the higher the likelihood that they will be classified as a bad customer because there might be some limitation such as the period in which they can repay the loan.

Furthermore, if we were to look carefully at the graph, we can see a few unusual points. The points are not credible as it does not make sense and this can be due to the nature of the data where there might be some missing information of the ages of some applicants. Overall, age predictors seem to be a useful predictor variable based on the graph pattern.

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Total of Dependents

On average, we can see that roughly 26% of the applicant who has 2 children or less are more likely to be a good customer, meaning that they have less financial burden because the cost of raising a child is less than to those applicants who have more than 2 children. However, from the table below, we can see that the applicants with 4 or more children have the lowest probability at 6.7% of being a bad customer, and logically, having more children will increase the financial burden. Therefore, Number of Children does not seems to be a good predictor variable to be included in the model.

On average, the applicants that do not have any number of dependent and with 1 dependent have lower probability, which is roughly 26% and 29% of being a bad customer. However, even though in the output it shows that an applicant with 2 or more number dependent has 60% chances of being classified as a bad customer, due to the lack number of records in the dataset, it is not credible. Therefore, the number of dependents is not a good predictor to fit in the model.

Due to the similar characteristics, of the two variables, NKID and DEP, we can combine these two variables. By creating the new variable DEPEND, the output was somewhat similar to NKID variables, where an applicant with less number of dependent will have higher chances of being a good customer. However, based on the output, we can see that the applicant with 4 or more number of dependent seems to have lower probability. Hence, the number of dependents is not a reliable variable to be used to predict the outcomes of the customer level.

##       Number of Children
##                 0         1          2          3           4 5
##   BAD  165.000000 26.000000 35.0000000 13.0000000  1.00000000 0
##   GOOD 444.000000 80.000000 98.0000000 23.0000000 14.00000000 1
##   %BAD   0.270936  0.245283  0.2631579  0.3611111  0.06666667 0
##       Number of other Dependents
##                  0          1   2
##   BAD  229.0000000  8.0000000 3.0
##   GOOD 639.0000000 19.0000000 2.0
##   %BAD   0.2638249  0.2962963 0.6
##       Number of Total Dependent
##                  0          1           2      3           4 5
##   BAD  161.0000000 25.0000000  35.0000000 18.000  1.00000000 0
##   GOOD 439.0000000 76.0000000 100.0000000 30.000 12.00000000 3
##   %BAD   0.2683333  0.2475248   0.2592593  0.375  0.07692308 0

Home Phone

This variable looks at whether or not the applicant has a home phone, where the value of 1, means they do have a home phone and value of 0, means they do not have a home phone. We can see that 34.7% of applicant with no home phone has higher chances of being a bad customer when compared to the ones who have a home phone. However, this variable does not show a significance difference; thus, we will not be using this predictor variable as it does not seem to be useful in explaining the target variable.

## Is there a home phone
##         0         1 
## 0.3478261 0.2574257

All Income

Based on the distribution of the graph for the applicant’s income, we can see that the higher the income, the applicant will have 25% or less probability of getting classified as a bad customer. Furthermore, we can observe that customer with no income will have 50% probability, by default of being a bad customer.

Based on the graph distribution of spouse’s income, it shows that spouse’s income with $5000 or less will be less likely be a good customer. However, we can see that this variable are not as credible as the applicant’s income as there are fewer observations recorded in the dataset. Therefore, we can further improve this variable by binning the Spouse Income and Applicant Income together.

Based on the graph distribution of All Income, we can see the points are evenly spread out, and it shares a similar graph pattern with the Applicant Income distribution. An applicant with higher income will have higher probability of having good credit score when compared with lower income’s applicant. Overall, the total income variable seems to be a good predictor in explaining whether or not an applicant is a good or bad customer.

## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Employment Status

In terms of employment status, an unemployed applicant will have a higher probability of being a bad customer as there is uncertainty whether the customer will be able to repay the debt or not. Applicants who are employed, by default, will have higher chances of paying their obligations as they have stable source of income. We can further improve the variable by binning the categories that shares similar characteristics such as public sector, military, unemployed, other and no response. Hence, we improved the variability of the employment status variable and it will be a useful predictor to predict the explanatory variable.

Categories Employment Status
V Government
W Housewife
M Military
P Private Sector
B Public Sector
R Retired
E Self-employed
T Student
U Unemployed
N Other
Z No response
##               B          E          M N           P          R          T
## BAD   8.0000000 28.0000000  2.0000000 0  90.0000000 41.0000000 23.0000000
## GOOD 15.0000000 61.0000000 12.0000000 3 304.0000000 45.0000000 62.0000000
## %BAD  0.3478261  0.3146067  0.1428571 0   0.2284264  0.4767442  0.2705882
##              U           V          W   Z
## BAD  4.0000000  28.0000000 13.0000000 3.0
## GOOD 3.0000000 136.0000000 16.0000000 3.0
## %BAD 0.5714286   0.1707317  0.4482759 0.5
##           Other          E           P          R          T           V
## BAD  17.0000000 28.0000000  90.0000000 41.0000000 23.0000000  28.0000000
## GOOD 36.0000000 61.0000000 304.0000000 45.0000000 62.0000000 136.0000000
## %BAD  0.3207547  0.3146067   0.2284264  0.4767442  0.2705882   0.1707317
##               W
## BAD  13.0000000
## GOOD 16.0000000
## %BAD  0.4482759

Residential Status

The variable residential status has 5 different categories, namely, (O = owner, F = Furnished Rental, U = Unfurnished Rental, P = Living with Parents, N = Other, Z = No Response). Based on the output, we can condense the categories that have a similar value of mean, which is furnished rental, unfurnished rental and living with parents. Based on the summary, roughly 25% of the applicant who has a stable living condition has higher probability of being a good customer. However, even though there is a lack information such as credential definition of the “N = Other” category, this variable is still useful in predicting the target variable as we can still look at other aspect such as whether the applicant owned or renting a place.

##           Tenant          N           O
## BAD   96.0000000 22.0000000 122.0000000
## GOOD 308.0000000 24.0000000 328.0000000
## %BAD   0.2376238  0.4782609   0.2711111

Also, by checking the probability of the following individual variables: value of home, mortgage balance outstanding, outgoings on mortgage or rents, loans, hire purchase and credit cards, there is no sufficient evidence because of the lack of records in the dataset; hence the variables will be omitted.

Overall, this dataset has a few significant predictor variables that will be able to use to create a new score function that can be used in financial institution for future practices.

Model Selection and Interpretation

The data characteristics section have established that there are real patterns between the bad credit rate and the variables, despite the great variability in the variables. This section summarizes these patterns using regression modelling. Following the statement of the model and its interpretation, this section describes the features of the data that drove the selection of the recommended model.

As a result of this study, the best method to be used to describe data and explain the relationship between one dependent binary variables and one or more quantitative or categorical variables is logistic regression using logit function in the glm modelling. Logistic regression method provides us parameters where we can use to estimate the probability (P) of classes they should belong to. The probability will range between 0 and 1. Based on this dataset, the probability a loan applicant is bad will be P(Y=1) and the probability a loan applicant is good will be P(Y=0).

The predictor variables that were found to be the best fit model for this dataset:

\[ \text{logit}(\pi) = -1.381 + 0.002055 \cdot \text{age} -0.00003173 \cdot\text{income} \]

\[ \pi = \frac{e^{ -1.381 + 0.002055 \cdot \text{age} -0.00003173 \cdot\text{income}}}{1 + e^{ -1.381 + 0.002055 \cdot \text{age} -0.00003173 \cdot\text{income}}} \]

## 
## Call:
## glm(formula = BAD ~ age + income, family = binomial(link = "logit"), 
##     data = db)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4092  -0.7944  -0.6645   1.0513   2.1489  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.381e+00  2.677e-01  -5.159 2.48e-07 ***
## age          2.055e-02  4.702e-03   4.371 1.24e-05 ***
## income      -3.173e-05  5.142e-06  -6.171 6.79e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1043.85  on 899  degrees of freedom
## Residual deviance:  983.64  on 897  degrees of freedom
## AIC: 989.64
## 
## Number of Fisher Scoring iterations: 4

Interpretation of coefficients

  1. age - The coefficent or parameter estimate for the variable age is \(0.02055\). This means that for a one-unit increase in age, then odds will increase by \(\exp(0.02055) = 1.02076\), of the target variable which is whether the customer is “bad” or “good” loan applicant, holding other variable constant.

  2. income - The coefficent or parameter estimate for the variable income is \(-3\times 10^{-5}\). This means that for a one-unit increase in income, then odds will decrease by \(\exp(-3\times 10^{-5}) = 0.99997\), of the target variable which is whether the customer is “bad” or “good” loan applicant, holding other variable constant.

Scenario 1:

At age of 50, income 55,000 (annually), our model estimates the odds to be equal to \[ \exp(-1.381 + 0.002055 \times 50 - 0.00003173 \times 55,000) = 0.04864 \]

and hence the probability is equal to \[ \pi = \frac{0.04864}{1.04864} = 0.0464 \]

Scenario 2:

At age of 20, income 20,000 (annually), our model estimates the odds to be equal to \[ \exp(-1.381 + 0.002055 \times 20 - 0.00003173 \times 20,000) = 0.1388 \]

and hence the probability is equal to \[ \pi = \frac{0.1388}{1.1388} = 0.1219 \]

By comparing scenario 1 and scenario 2, an applicant age 50 with income of $50,000 have lower probability of being considered as “bad” customer which is at 0.0464 when compared to an applicant age 20 with income of $20,000. We can see an increase of 9.4 percentage points increase.

With all of this, we can calculate the probability whether to consider the class of a loan applicant based on the age and income. To turn this probability into an actual classification of 1 meaning the loan applicant is considered “bad” or 0 meaning the loan applicant is considered “good”. Therefore, we need to pick a threshold \(\pi_0\). So our classifier will return an answer of “0” if the estimated probability \(\pi\) given the age and income is below our chosen threshold of \(\pi_0\). Otherwise, it will return “1” where the loan applicant is considered as a bad customer.

Discussion of Model

Discussion on Model Selection

While carrying out this study of the dataset, five different models were considered. In addition to the standard inclusion and expression of variables, some variables were altered by using feature engineering such as binning the variables. Feature engineering looks at the practice of deriving new variables by extracting from the existing ones, to obtain the best possible patterns between predictor and target variables. For this dataset, the predictor variables that were chosen; categorical: Employment status, Residential Status and Outgoings on Credit Card and continuous: Age of the applicant and the total income (the sum of applicant and spouse’s income).

Goodness of Fit

Logistic regression is said to provide a better fit to the data if it demonstrates an improvement over a model with fewer predictors. This step is performed using the likelihood ratio test, which compares the likelihood of the data under the full model against the probability of the data under a model with fewer predictors. Removing predictor variables from a model will most likely make the model fit less well, because of the lower log likelihood, but it is necessary to test whether the observed difference in the model fit is statistically significant.

An approach that can be used to determine the goodness of fit is through the Homer-Lemeshow statistics.The observations with similar predicted probabilities were split into group. Small values with large p-values indicate a good fit to the data while large values with p-values below 0.05 indicate a poor fit. The null hypothesis holds that the model fits the data, and in the below example we would reject \(H_0\).

Based on our model, we have chosen the model with predictor variables: age and income. When comparing the statistic-values and p –values, this model fit the criteria of the goodness of fit method as it portrays the lowest statistic value, which is 4.897 and p-value of 0.759. This gives us an indication that this model is a good fit to the data.

rbind("f01" = HL(db$BAD, predict(f01, type = "response")),
      "f02" = HL(db$BAD, predict(f02, type = "response")),
      "f03" = HL(db$BAD, predict(f03, type = "response")),
      "f04" = HL(db$BAD, predict(f04, type = "response")),
      "f05" = HL(db$BAD, predict(f05, type = "response")))
##      HL Stat.    P-Value
## f01  8.543669 0.38223957
## f02  4.987032 0.75896127
## f03  7.822068 0.45104143
## f04  9.057220 0.33749197
## f05 14.378968 0.07240702
round(M * 100, 2)
##     Accuracy Precision Sensitivity Specificity
## f01    73.67     52.54       12.92       95.76
## f02    73.89     53.16       17.50       94.39
## f03    74.11     53.68       21.25       93.33
## f04    74.44     55.32       21.67       93.64
## f05    74.11     53.61       21.67       93.18

Our second best model included the following predictor variables:

  1. Age
  2. Income
  3. Residential Status

For the second best model, house.stat which is Residential Status was included in the model. Residential status was included as a significant variable as we built the model. However if we were to compare between the two deviance residuals against probabilities graph below, its contribution to the reduction of deviance was minimal. The graph on the left shows us the deviance residuals for the best fit model and the graph on the shows us the deviance residuals for the second-best model.

grid.arrange(f02_plt, f04_plt, ncol=2,nrow=2)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Confusion Matrix

In this study, the ability of the model to discriminate between “good” and “bad” applicants is evaluated by Receiver Operating Characteristic (ROC) curve analysis. The ROC curves can also be used to compare the separated performance of two or more classifiers. Before we explain the ROC curve, we will first introduce the confusion matrix. When we consider choosing from multiple classifiers as our final credit scoring model, the important criterion to be considered would be the classification accuracy. It is the direct and primary measure of model performance. The best model is the model that can predict the class of a new case most accurately. We can arrange the results of our classifier into a \(2 \times 2\) matrix called the confusion matrix.

The rows of the confusion matrix represent the predicted values of our classifier. The columns represent the true condition. The notations are defined as:

  1. True Positive: The number of “bad” customers that are predicted correctly.
  2. False Negative: The number of “good” customers that are predicted incorrectly.
  3. False Positive: The number of “bad” customers that are predicted incorrectly.
  4. True Negative: The number of “good” customers that are predicted correctly.

Based on the confusion matrix, there are several metrics we can calculate: 1. Accuracy 1. Precision 1. Sensitivity 1. Specificity

From the confusion matrix, it provides us with evidence that roughly 638 applicants are classified as “good” loan applicant.

##        tc.POS tc.NEG
## pc.POS     32     22
## pc.NEG    208    638

From the output, the accuracy of the model was computed at 0.7444, and this value is relatively high as it is based on the data that was used to it in the parameters.

Sensitivity tells us the percentage of a loan applicant consider as “bad” customer was predicted correctly and for this classifier, the value turns out to be 0.1333. This tells us that 13.33% of a loan applicant was predicted correctly by our model.

Specificity tells us the percentage of a loan applicant consider as “good” customer was predicted correctly and for this classifier, the value turns out to be 0.9667. This tells us that 96.67% of a loan applicant was predicted correctly by our model.

However, we can observe that the model is doing far batter in specificity as compared to sensitivity. An ideal situation would be all four aspects has high values; however, for this model, the “Sensitivity” value turns out to be the lowest. For a bank or financial institution, the misclassification cost of not predicting loans correctly by default is much higher than misclassification cost of not predicting non-defaults accurately. Therefore, the selection of classification models is a decision with multiple criterions; in our case, we pursue a more reliable classifier with higher accuracy.

##    Accuracy   Precision Sensitivity Specificity 
##      0.7444      0.5926      0.1333      0.9667

ROC Curve and AUC

ROC determines the accuracy of a classification model, in which a threshold is to be determined in order to give a prediction of the class. It identifies the model’s accuracy using Area Under the Curve (AUC), and in our case, the value turns out to 0.6515, meaning that 65.2% of the data was predicted correctly. In this plot, we aim to push the curve toward 1 (left upper-corner) and maximize the area under the curve. The higher the curve, the better the model.

pred <- prediction(db$p, db$BAD)
as.numeric(performance(pred, "auc")@y.values)
## [1] 0.6515499

Based on the output of the threshold values table, if we increase the threshold for an example from 0.5 to 0.7, the accuracy and sensitivity decrease.

Based on the ROC curve, the cutoff value of 0.2 gives us a better tradeoff between specificity and sensitivity.

Cross Validation

The parameter must be chosen to fit the data as well as possible. Therefore, measures of goodness of fit are not to be trusted entirely.

Based on the above output from the confusion matrix section, accuracy calculated from the same sample that was used to estimate the model coefficients is too optimistic. The reason for that is because it is based on the data that has been used in the fitting process, which is known as apparent accuracy. However, what if we have new set of credit score card dataset to estimate the accuracy? The way to solve this problem is by using a technique called Cross-Validation. By using this technique, we can calculate a better estimation of the accuracy of the model.

For our study, we only have 900 observations; hence, the data is split into \(K = 10\) roughly equal parts.

One fold is held out for validation while other k folds are used to train the model and then used to predict the target variable in our testing data, which is whether a loan applicant is considered as “good” or “bad” customer. This process is repeated ten times.

set.seed(314337)
db$fold <- sample(c(rep(0, 90), rep(1, 90), rep(2, 90),
                      rep(3, 90), rep(4, 90), rep(5, 90),
                      rep(6, 90), rep(7, 90), rep(8, 90), rep(9, 90)),
                    nrow(db),
                    replace = FALSE)

From the output, observed that the accuracy for the Cross-Validation is slightly smaller than the apparent accuracy in the confusion matrix, with a difference of 0.51%. Even though this does not seem to be a huge difference, if we were to apply this model to a real –world scenario, a slight decrease or increase of such as 1% in the accuracy value will make a huge difference.

Overall, on average, there is a probability of 74% where our model can predict the dataset accurately.

whole.sample <- M[3,]
tbl <- rbind("Whole Sample" = whole.sample,
      "Cross-Validated" = fld.means,
      "Difference" = whole.sample - fld.means)
dimnames(tbl)[[2]] <- c("Accuracy", "Precision", "Sensitivity", "Specificity")
round(tbl,4)
##                 Accuracy Precision Sensitivity Specificity
## Whole Sample      0.7411    0.5368      0.2125      0.9333
## Cross-Validated   0.7467    0.6143      0.0957      0.9811
## Difference       -0.0056   -0.0774      0.1168     -0.0478

Summary and Conclusion Remarks

Customer credit scoring is a concern for numerous financial institutions as it accesses the risk in providing loan to a particular customer. The recommended regression models conclude that whether a loan applicant can be classified as “good” or “bad”, can be explained in terms of the age of the applicants and the source of income. Separate models were developed to test the goodness of fit in which we were able to derive the best model that contains significant predictor variables. Furthermore, a few other methods such as Cross Validation and classifying the responses were used to evaluate the accuracy of the classifier. By evaluating the classifier, we can compute a better predictor model by reducing the risk of misclassification of the classes.

This study was based on 900 of loan applicants. This is a small dataset when compared to the real current dataset that is being put to practice. One might consider any number of additional variables that could be included; Period of loan payback and existing active loan such as loan for car are some good candidates.

Moreover, my analysis of data is based on the credit data collected in early 2000. The lessons learned from this report will be good practice for future financial analyst to identify the creditworthiness of a loan applicant, which whether the customer is able to repay their loan or not. The techniques explored in this report is useful with the appropriate set of current experience and practices.

References

de Jong, Piet and Gillian Z. Heller, Generalized Linear Models for Insurance Data, 2008, Cambridge University Press.

Frees, Edward W., Regression Modeling with Actuarial and Financial Applications, 2010, Cambridge University Press.

Thomas, L. C., et al. Credit Scoring and Its Applications. Society for Industrial and Applied Mathematics, 2002.

Appendix

Sequence of Models

Model Variables
f00 NULL (intercept only)
f01 age + age^2
f02 age + income
f03 age + income + emp.stat
f04 age + income + house.stat
f05 age + income + house.stat + DOUTCC

Score function

score <- function(newdata){
  db <- newdata
  
  db$age <- 2000 - (1900 + db$DOB)
  idx.age <- which(db$age == 1)
  db$age[idx.age] <- 0
  db$income <- db$DAINC + db$SINC
  
  p <- predict(f02, newdata = db, type = "response")
  ans <- ifelse(p > 0.55, 1, 0)
  return(ans)
}
summary(score(db))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.03889 0.00000 1.00000
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

qq <- quantile(hl, probs = c(0.025, 0.5, 0.975))
ggplot(data.frame("hl" = hl)) +
  aes(x = hl) +
  geom_histogram() +
  geom_vline(xintercept = qq, col = "red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

tbl <- rbind(tapply(db$BAD, db$RES, sum),
             tapply(1-db$BAD, db$RES, sum),
             tapply(db$BAD, db$RES, mean))
dimnames(tbl)[[1]] <- c("BAD", "GOOD", "%BAD")
tbl
##               F          N           O           P          U
## BAD  23.0000000 22.0000000 122.0000000  41.0000000 32.0000000
## GOOD 74.0000000 24.0000000 328.0000000 146.0000000 88.0000000
## %BAD  0.2371134  0.4782609   0.2711111   0.2192513  0.2666667