The end to end project is the third and final one in this portfolio. Similar to the previous projects, it will look to extract insights from data in order to present to others. However it will then build on this to create a customer-facing system that can be run multiple times with different pieces of data to generate different outputs. It will do this by applying a machine learning model to an original set of data and then creating a user-friendly application in order to input new data into the model. The following steps will be used to achieve the above:
Import and clean a data set
Create predictions using training and test data and relevant machine learning models
*Use the model to design a customer-facing application
Credit risk is the risk of a borrow defaulting on a debt and that the lender may lose the principal of the loan or associated interest. When a bank receives a loan application, it has to make a decision as to whether to approve the loan or not based on the applicant’s profile. If the bank deems the applicant to have bad credit risk, it means the applicant is not likely to repay the loan and approving the loan could result in financial loss to the bank.
The purpose of this project is to take a data set of loan applications and build a predictive model for making a decision as to whether to approve a loan based on the applicant’s profile. An application will then be built which is intended to provide guidance to a bank manager for making this decision.
The data set used will be the numeric data which is more suitable for some algorithms that cannot cope with categorical variables.
credit <- read.csv("C:/Users/shafi/OneDrive/Desktop/resume project/credit_app/German_Credit_Data.csv", header = TRUE)
str(credit)
## 'data.frame': 1000 obs. of 21 variables:
## $ checking_status : chr "<0" "0<=X<200" "no checking" "<0" ...
## $ duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ credit_history : chr "critical/other existing credit" "existing paid" "critical/other existing credit" "existing paid" ...
## $ purpose : chr "radio/tv" "radio/tv" "education" "furniture/equipment" ...
## $ credit_amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ savings_status : chr "no known savings" "<100" "<100" "<100" ...
## $ employment : chr ">=7" "1<=X<4" "4<=X<7" "4<=X<7" ...
## $ installment_commitment: int 4 2 2 2 3 2 3 2 2 4 ...
## $ personal_status : chr "male single" "female div/dep/mar" "male single" "male single" ...
## $ other_parties : chr "none" "none" "none" "guarantor" ...
## $ residence_since : int 4 2 3 4 4 4 4 2 4 2 ...
## $ property_magnitude : chr "real estate" "real estate" "real estate" "life insurance" ...
## $ age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ other_payment_plans : chr "none" "none" "none" "none" ...
## $ housing : chr "own" "own" "own" "for free" ...
## $ existing_credits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ job : chr "skilled" "skilled" "unskilled resident" "skilled" ...
## $ num_dependents : int 1 1 2 2 2 2 1 1 1 1 ...
## $ own_telephone : chr "yes" "none" "none" "none" ...
## $ foreign_worker : chr "yes" "yes" "yes" "yes" ...
## $ class : chr "good" "bad" "good" "good" ...
# Find all character columns
char_cols <- sapply(credit, is.character)
# Convert each character column to numeric
credit_numeric <- credit
for (col in names(credit)[char_cols]) {
# Convert to factor first (creates groups)
# Then convert to numeric (gives 1, 2, 3... for each unique value)
credit_numeric[[col]] <- as.numeric(as.factor(credit[[col]]))
}
str(credit_numeric)
## 'data.frame': 1000 obs. of 21 variables:
## $ checking_status : num 1 3 4 1 1 4 4 3 4 3 ...
## $ duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ credit_history : num 2 4 2 4 3 4 4 4 4 2 ...
## $ purpose : num 7 7 3 4 5 3 4 10 7 5 ...
## $ credit_amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ savings_status : num 5 1 1 1 1 5 4 1 2 1 ...
## $ employment : num 2 3 4 4 3 3 2 3 4 5 ...
## $ installment_commitment: int 4 2 2 2 3 2 3 2 2 4 ...
## $ personal_status : num 4 1 4 4 4 4 4 4 2 3 ...
## $ other_parties : num 3 3 3 2 3 3 3 3 3 3 ...
## $ residence_since : int 4 2 3 4 4 4 4 2 4 2 ...
## $ property_magnitude : num 4 4 4 2 3 3 2 1 4 1 ...
## $ age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ other_payment_plans : num 2 2 2 2 2 2 2 2 2 2 ...
## $ housing : num 2 2 2 1 1 1 2 3 2 2 ...
## $ existing_credits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ job : num 2 2 4 2 2 4 2 1 4 1 ...
## $ num_dependents : int 1 1 2 2 2 2 1 1 1 1 ...
## $ own_telephone : num 2 1 1 1 1 2 1 2 1 1 ...
## $ foreign_worker : num 2 2 2 2 2 2 2 2 2 2 ...
## $ class : num 2 1 2 2 1 2 2 2 2 1 ...
One thing to note immediately is that three of the columns contain continuous variables rather than categorical data (duration of credit, credit amount and age). This is potentially important information in deciding credit risk and therefore one solution is to transform the data into categorical variables using the cut function.
# ============================================================================
# SIMPLER VERSION - Each Column Handled Individually
# ============================================================================
# Step 1: Convert continuous to categorical FIRST (before splitting)
credit$duration_cat <- cut(
credit$duration,
breaks = c(0, 12, 24, 36, Inf),
labels = c(1, 2, 3, 4),
include.lowest = TRUE
)
credit$credit_amount_cat <- cut(
credit$credit_amount,
breaks = c(0, 1000, 5000, 10000, Inf),
labels = c(1, 2, 3, 4),
include.lowest = TRUE
)
credit$age_cat <- cut(
credit$age,
breaks = c(0, 25, 40, 60, Inf),
labels = c(1, 2, 3, 4),
include.lowest = TRUE
)
head(credit[, c("duration_cat", "credit_amount_cat", "age_cat")], 10)
## duration_cat credit_amount_cat age_cat
## 1 1 2 4
## 2 4 3 1
## 3 1 2 3
## 4 4 3 3
## 5 2 2 3
## 6 3 3 2
## 7 2 2 3
## 8 3 3 2
## 9 1 2 4
## 10 3 3 2
The new structure of the three columns can be seen above. The categories have changed as follows.
Duration of Credit (month):
0 - 12 months 13 - 18 months 19 - 24 months Over 24 months
Credit Amount:
0 - 1,000 DM 1,001 - 5,000 DM 5,001 - 10,000 DM Over 10,000 DM
Age:
18 - 25 26 - 40 41 - 60 Over 60 Finally, the remaining columns can be converted to factors.
for(i in 1:21) credit_numeric[, i] <- as.factor(credit_numeric[, i])
Before starting the modeling phase, it is important to explore the data to get an idea of any patterns or areas of interest.
The first thing is to examine how many examples of good and bad credit risk there are.
g <- ggplot(credit_numeric, aes(class)) +
geom_bar(fill = "#4EB25A") +
theme(axis.title.x=element_blank()) +
theme(axis.title.y=element_blank()) +
scale_y_continuous(breaks=seq(0,700,100)) +
scale_x_discrete(labels = c("Bad","Good")) +
ggtitle("Count of Good and Bad Credit Risks")
g
The plot shows 300 examples of bad credit risk applicants versus 700
good. This is something that should be noted later when splitting the
data set into training and test sets.
The next step is to explore some of the variables in the data. For example, it might be a fair assumption that amount of total savings is strongly linked to the credit risk of the applicant i.e. an applicant with little money in their account is a higher credit risk than one with a lot of savings? Another plot can be produced to confirm this.
g <- ggplot(credit_numeric, aes(savings_status, fill = class), stat="identity") +
geom_bar() +
scale_fill_manual(values = c("#D3D6D4", "#4EB25A"), labels=c("Bad","Good")) +
theme(axis.title.x=element_blank()) +
theme(axis.title.y=element_blank()) +
scale_y_continuous(breaks=seq(0,700,100)) +
scale_x_discrete(labels = c("< 100 DM", "100-500 DM", "500-1000 DM", "> 1000 DM", "Unknown")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10)) +
theme(axis.text.y = element_text(size = 10)) +
theme(legend.text=element_text(size=10)) +
theme(legend.title=element_text(size=12)) +
ggtitle("Good and Bad Credit Risks by Credit History")
## Warning in fortify(data, ...): Arguments in `...` must be used.
## ✖ Problematic argument:
## • stat = "identity"
## ℹ Did you misspell an argument name?
g
The plot seems to back up the rationale. A higher percentage of
applicants with less savings are deemed as having bad credit risk.
Another area to explore is how credit risk relates to employment status. There are four statuses for employment in the data:
Unemployed/ Unskilled (non-resident) Unskilled (resident) Skilled employee / Official Management / Self-employed / Highly Qualified Employee / Officer
g <- ggplot(credit_numeric, aes(job, fill = class), stat="identity") +
geom_bar() +
scale_fill_manual(values = c("#D3D6D4", "#4EB25A"), labels=c("Bad","Good")) +
theme(axis.title.x=element_blank()) +
theme(axis.title.y=element_blank()) +
scale_y_continuous(breaks=seq(0,700,100)) +
scale_x_discrete(labels = c("Unemployed", "Unskilled", "Skilled", "Management")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10)) +
theme(axis.text.y = element_text(size = 10)) +
theme(legend.text=element_text(size=10)) +
theme(legend.title=element_text(size=12)) +
ggtitle("Good and Bad Credit Risks by Occupation")
## Warning in fortify(data, ...): Arguments in `...` must be used.
## ✖ Problematic argument:
## • stat = "identity"
## ℹ Did you misspell an argument name?
g
There appears to be less of a link with occupation. Most of the
applicants come under ‘unskilled employee’ but the creditability of
unemployed and management/highly qualified employees does not appear
significantly different. However further statistical modeling is needed
to support this initial analysis.
Finally, some exploration can be performed on one of the new categorical variables created above. This example will look at age
g <- ggplot(credit, aes(age_cat, fill = class), stat="identity") +
geom_bar() +
scale_fill_manual(values = c("#D3D6D4", "#4EB25A"), labels=c("Bad","Good")) +
theme(axis.title.x=element_blank()) +
theme(axis.title.y=element_blank()) +
scale_y_continuous(breaks=seq(0,700,100)) +
scale_x_discrete(labels = c("18-25", "26-40", "41-60", "60+")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10)) +
theme(axis.text.y = element_text(size = 10)) +
theme(legend.text=element_text(size=10)) +
theme(legend.title=element_text(size=12)) +
ggtitle("Good and Bad Credit Risks by Age")
## Warning in fortify(data, ...): Arguments in `...` must be used.
## ✖ Problematic argument:
## • stat = "identity"
## ℹ Did you misspell an argument name?
g
This analysis indicates perhaps some decrease in credit risk with age.
However there is probably a lot of correlation between age and other
factors such as savings and property so further statistical analysis is
needed.
The first step before applying models is to create training and test data sets. The data will be split 70/30 and spread evenly between good and bad credit risks using the CreateDataPartition function in the caret package
# Step 2: Convert numeric column to factor
char_cols <- c("checking_status", "credit_history", "purpose",
"savings_status", "employment", "personal_status",
"other_parties", "property_magnitude", "other_payment_plans",
"housing", "job", "own_telephone", "foreign_worker", "class")
for (col in char_cols) {
credit[[col]] <- as.factor(credit[[col]])
}
# Step 3: NOW split into train/test
set.seed(123)
train_index <- sample(1:nrow(credit), 0.7 * nrow(credit))
train <- credit[train_index, ]
test <- credit[-train_index, ]
The first model is logistic regression using the glm() function.
# Step 4: Build model using categorical columns (NOT original numeric)
lmModel <- glm(class ~ duration_cat + credit_amount_cat + age_cat +
checking_status + credit_history + purpose,
data = train, family = "binomial")
# Step 5: Predict - this will work now!
predictions <- predict(lmModel, newdata = test, type = "response")
predicted_class <- ifelse(predictions > 0.5, "good", "bad")
# Check accuracy
accuracy <- mean(predicted_class == as.character(test$class))
cat("Model Accuracy:", round(accuracy * 100, 2), "%\n")
## Model Accuracy: 74 %
# Compare predictions to test set
lmPred <- prediction(predictions, test$class)
# Create Area Under the Curve (AUC) plot
plot(performance(lmPred, 'tpr', 'fpr'))
The AUC of the model is 0.74. This is a measure of the model’s
performance by evaluating the trade off between the true positive and
false positive rate i.e. how good is the model at identifying good
creditability risk without falsely identifying bad risks as good?
This is a fairly good score but the next sections will look at classification trees and random forests to try and improve on this.
set.seed(123)
dtModel <- rpart(class ~ ., data=train)
rpart.plot(dtModel)
As before, the model is fit to the test data to analyse the
performance.
dtFit <- predict(dtModel, test, type = 'prob')[, 2]
dtPred <- prediction(dtFit, test$class)
plot(performance(dtPred, 'tpr', 'fpr'))
performance(dtPred, measure = 'auc')@y.values[[1]]
## [1] 0.7200266
The final model is Random Forest. Random forests operate by constructing a number of decision trees on the training data set and outputting the class that is the mode of the classes across the decision trees.
set.seed(2828)
rfModel <- randomForest(class ~ ., data=train)
rfFit <- predict(rfModel, test, type = 'prob')[,2]
rfPred <- prediction(rfFit, test$class)
plot(performance(rfPred, 'tpr', 'fpr'))
performance(rfPred, measure = 'auc')@y.values[[1]]
## [1] 0.7988664
The Random Forest model returns an AUC of 0.7988664 which is slightly better than the logistic regression model. This is the model which will be used for the final application.
The plot below shows the rank of importance for variables in the Random Forest model. Account balance is ranked as the most significant measurement in the model with purpose second. Purpose identifies the reason for the applicant’s request for credit e.g. car, education, business etc.
par(mfrow=c(1,1))
varImpPlot(rfModel, pch=1, main="Random Forest Model Variables Importance")
The confusion matrix below shows the split between prediction success of good (1) and bad (0) credit risks and an overall accuracy of 77%.
rfCM <- confusionMatrix(test$class,
predict(rfModel, test, type="class"))
rfCM
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 42 54
## good 15 189
##
## Accuracy : 0.77
## 95% CI : (0.7182, 0.8164)
## No Information Rate : 0.81
## P-Value [Acc > NIR] : 0.9646
##
## Kappa : 0.4078
##
## Mcnemar's Test P-Value : 4.77e-06
##
## Sensitivity : 0.7368
## Specificity : 0.7778
## Pos Pred Value : 0.4375
## Neg Pred Value : 0.9265
## Prevalence : 0.1900
## Detection Rate : 0.1400
## Detection Prevalence : 0.3200
## Balanced Accuracy : 0.7573
##
## 'Positive' Class : bad
##
Conclusion The above example shows the way in which a bank manager could use the application to guide decision making on credit applications. Although the application is quite basic, it gives an indication as to how value can be extracted from analysing data.
One limitation of the analysis is the fairly small and historic data used in this example. A good model would draw on thousands (if not millions) of customers data and be constantly adapting to the flow of information. However this was not possible in the example due to access to relevant data and computational limits.
Furthermore, more complicated models could certainly be developed to produce more accurate predictions. This example aims to show that fairly accurate models can be produced using only a few lines of code and well known statistical modeling methods.