==================================================================
The purpose of this database is to provide information about a bank’s customers so that machine learning models can be developed that can predict whether a particular customer will repay the loan or not. We investigated 4 algorithms: Logistic Regression, Decision Tree, Random Forest, and Support Vector Machine (SVM).
The dataset used in this report is Credit Risk data in Kaggle. The dataset can be downloaded here. Link to dataset: https://www.kaggle.com/upadorprofzs/credit-risk
The dataset is downloaded from Kaggle and saved in the data folder. We use read.csv() function to read the dataset and put in CustomerCredit_df data frame.
CustomerCredit_df<-read.csv("C:/Users/dedis/Documents/Data Science Course/Portofolio/data/credit_risk.csv")
See the data dimension. The dataset has 2000 rows and 5 columns.
dim(CustomerCredit_df)
## [1] 2000 5
To find out the column names and types, we used str() function.
str(CustomerCredit_df)
## 'data.frame': 2000 obs. of 5 variables:
## $ clientid: int 1 2 3 4 5 6 7 8 9 10 ...
## $ income : num 66156 34415 57317 42710 66953 ...
## $ age : num 59 48.1 63.1 45.8 18.6 ...
## $ loan : num 8107 6565 8021 6104 8770 ...
## $ default : int 0 0 0 0 1 0 0 1 0 0 ...
From the result above, we know the following:
1. The first column is id. It is unique and unnecessary for prediction. So, it should be removed.
2. The fifth column is default Currently the type is int and it should be converted to factor and change variable name to be payment_status
# remove unnecessary columns
CustomerCredit_df$clientid <- NULL
# change to factor for target variable
colnames(CustomerCredit_df)[4] <- "payment_status"
CustomerCredit_df$payment_status <- factor(CustomerCredit_df$payment_status,
levels = c(0,1),
labels = c("Paid", "Unpaid"))
Analysis of a single variable.
Number of paid (P) and unpaid (UP) in payment_status column.
library(ggplot2)
ggplot(data=CustomerCredit_df, aes(x=payment_status)) + geom_bar(color="white",fill = "#49B3E8")+
geom_text(aes(y = ..count.. -50,label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count', position = position_stack(vjust=0.5), size = 5,color= "white") +
labs(title="Payment Status Of Client", x="Payment Status")
From the result above, we know that the number of customer Paid is more than Unpaid.
Distribution of loan variable in Bar
ggplot(data=CustomerCredit_df, aes(x=loan))+
geom_histogram(color="white",fill = "#49B3E8")+
scale_x_continuous(breaks = c(0,2000,4000,6000,8000,10000,12000,14000),
labels = c("0","2000","4000","6000","8000","10000","12000",
"14000"),
limits = c(1000,14000))+
labs(title = "Loan Of Client",
x="Loan", y="Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the result above, we know that data distribution of client loan is uneven, it’s leaning more to the left
Distribution of age variable in Bar
ggplot(data=CustomerCredit_df, aes(x=age))+
geom_histogram(color="white",fill = "#49B3E8")+
labs(title = "Age Of Client",
x="Age", y="Count")
Distribution of income variable in Bar
ggplot(data=CustomerCredit_df, aes(x=income))+
geom_histogram(color="white",fill = "#49B3E8")+
scale_x_continuous(breaks = c(20000,30000,40000,50000
,60000,70000),
labels = c("20000","30000",
"40000","50000","60000"
,"70000"),
limits = c(20000,70000))+
labs(title = "Income Of Client",
x="Income", y="Count")
From the result above, we know that data distribution of Income Clientis normal
Analysis of two variables.
Distribution of loan variable based on payment_status.
ggplot(data=CustomerCredit_df, aes(x=payment_status,
y=loan,
color = payment_status)) +
scale_y_continuous(breaks = c(0,2000,4000,6000,8000,10000,12000,14000),
labels = c("0","2000","4000","6000","8000","10000","12000",
"14000"),
limits = c(1000,14000))+
geom_boxplot(alpha=.3) +
geom_jitter(alpha = 0.3,
color = "blue",
width = 0.2) +
labs(title="Payment Status based on Loan",
x="Payment Status", y="Loan")
From the result above, we know the following:
1. Based on Loan , the number of paid is above 2000 and below 7000
2. Based on Loan , the number of unpaid is is above 5000 and below 10000
Distribution of income variable based on payment_status.
ggplot(data=CustomerCredit_df, aes(x=payment_status,
y=income,
color = payment_status)) +
scale_y_continuous(breaks = c(20000,30000,40000,50000
,60000,70000),
labels = c("20000","30000",
"40000","50000","60000"
,"70000"),
limits = c(20000,70000))+
geom_boxplot(alpha=.3) +
geom_jitter(alpha = 0.3,
color = "blue",
width = 0.2) +
labs(title="Payment Status based on Income",
x="Payment Status", y="Income")
From the result above, we know the following:
Based on Income of Customer, we see that customer who have a high salary or low salary has a potency to pay or unpaid their loan
Distribution of age variable based on payment_status.
ggplot(data=CustomerCredit_df, aes(x=age,
fill=payment_status)) +
geom_density(alpha=.3)+
labs(title="Payment Status based on Age",
x="Age")
## Warning: Removed 3 rows containing non-finite values (stat_density).
Compute and visualize correlation coefficient of each measurement.
Visualize Pearson’s Correlation Coefficient variables.
library(corrgram)
corrgram(CustomerCredit_df[,c("income", "loan", "age")],
main="Pearson’s Correlation Coefficient variables")
From the result above, we know the following:
variable income has a strong corelation with variable loan than variable age
# remove unnecessary columns
# CustomerCredit_df$clientid <- NULL
In this section, it is a part of getting the outliers value.
Get Outliers Values
out_age <- boxplot.stats(CustomerCredit_df$age)$out
out_age
## [1] -28.21836 -52.42328 -36.49698
Get Outliers Index
out_idx <- which(CustomerCredit_df$age %in% c(out_age))
out_idx
## [1] 16 22 27
Data Without Outliers
CustomerCredit_clean <-CustomerCredit_df[-out_idx,]
dim(CustomerCredit_clean)
## [1] 1997 4
Detect Missing Value on Data
Distribution of Missing Value in Pattern
library(mice)
md.pattern(CustomerCredit_clean)
## income loan payment_status age
## 1994 1 1 1 1 0
## 3 1 1 1 0 1
## 0 0 0 3 3
Imputation Missing Value With Mean
CustomerCredit_df.imput = CustomerCredit_clean
for(i in which(sapply(CustomerCredit_df.imput, is.numeric))){
CustomerCredit_df.imput[is.na(CustomerCredit_df.imput[, i]), i] <- mean(CustomerCredit_df.imput[, i], na.rm = TRUE)
}
set.seed(2021)
m = nrow(CustomerCredit_df.imput)
train_ind <- sample(m, 0.8 * m)
TrainingSet <- CustomerCredit_df.imput[train_ind, ]
TestingSet <- CustomerCredit_df.imput[-train_ind, ]
library(caret)
Model_LG <- train(payment_status ~ ., data = TrainingSet,
method = "glm",
na.action = na.omit,
preProcess=c("scale","center"),
trControl= trainControl(method="none")
)
summary(Model_LG)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.26416 -0.08227 -0.00899 -0.00036 2.71212
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.6447 0.5937 -12.88 <2e-16 ***
## income -3.6388 0.3460 -10.52 <2e-16 ***
## age -4.7082 0.3880 -12.13 <2e-16 ***
## loan 5.3884 0.4544 11.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1316.55 on 1596 degrees of freedom
## Residual deviance: 360.52 on 1593 degrees of freedom
## AIC: 368.52
##
## Number of Fisher Scoring iterations: 9
# Feature importance
print(varImp(Model_LG))
## glm variable importance
##
## Overall
## age 100.00
## loan 82.88
## income 0.00
Importance <- varImp(Model_LG)
plot(Importance, col = "red")
library(party)
Model_CT <- train(payment_status ~ ., data = TrainingSet,
method = "LMT",
na.action = na.omit,
preProcess=c("scale","center"),
trControl= trainControl(method="none")
)
summary(Model_CT)
##
## === Summary ===
##
## Correctly Classified Instances 1594 99.8121 %
## Incorrectly Classified Instances 3 0.1879 %
## Kappa statistic 0.9924
## Mean absolute error 0.0481
## Root mean squared error 0.0655
## Relative absolute error 19.4711 %
## Root relative squared error 18.6587 %
## Total Number of Instances 1597
##
## === Confusion Matrix ===
##
## a b <-- classified as
## 1365 2 | a = Paid
## 1 229 | b = Unpaid
# Feature importance
print(varImp(Model_CT))
## ROC curve variable importance
##
## Importance
## age 100.00
## loan 85.41
## income 0.00
Importance <- varImp(Model_CT)
plot(Importance, col = "red")
library(randomForest)
set.seed(2021)
fit.forest <- randomForest(formula = payment_status ~ .,
data = TrainingSet,
na.action = na.roughfix)
fit.forest
##
## Call:
## randomForest(formula = payment_status ~ ., data = TrainingSet, na.action = na.roughfix)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 1.13%
## Confusion matrix:
## Paid Unpaid class.error
## Paid 1363 4 0.002926116
## Unpaid 14 216 0.060869565
summary(fit.forest)
## Length Class Mode
## call 4 -none- call
## type 1 -none- character
## predicted 1597 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 3194 matrix numeric
## oob.times 1597 -none- numeric
## classes 2 -none- character
## importance 3 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 1597 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
# Feature importance
varImpPlot(fit.forest)
library(e1071)
library(caret )
Model_SVM <- train(payment_status ~ ., data = TrainingSet,
method = "svmPoly",
na.action = na.omit,
preProcess=c("scale","center"),
trControl= trainControl(method="none"),
tuneGrid = data.frame(degree=1,scale=1,C=1)
)
summary(Model_SVM)
## Length Class Mode
## 1 ksvm S4
# Feature importance
print(varImp(Model_SVM))
## ROC curve variable importance
##
## Importance
## age 100.00
## loan 85.41
## income 0.00
Importance <- varImp(Model_SVM)
plot(Importance, col = "red")
Model.Testing.LG <- predict(Model_LG, TestingSet)
Model.testing.confusion.LG <-confusionMatrix(Model.Testing.LG, TestingSet$payment_status,
dnn = c("Predict","Actual"))
Model.testing.confusion.LG
## Confusion Matrix and Statistics
##
## Actual
## Predict Paid Unpaid
## Paid 336 8
## Unpaid 11 45
##
## Accuracy : 0.9525
## 95% CI : (0.9268, 0.9712)
## No Information Rate : 0.8675
## P-Value [Acc > NIR] : 1.362e-08
##
## Kappa : 0.7982
##
## Mcnemar's Test P-Value : 0.6464
##
## Sensitivity : 0.9683
## Specificity : 0.8491
## Pos Pred Value : 0.9767
## Neg Pred Value : 0.8036
## Prevalence : 0.8675
## Detection Rate : 0.8400
## Detection Prevalence : 0.8600
## Balanced Accuracy : 0.9087
##
## 'Positive' Class : Paid
##
ggplot(data=TestingSet, aes(x=payment_status,
y=Model.Testing.LG)) +
geom_boxplot(alpha=.3) +
geom_jitter(alpha = 0.3,
color = "blue",
width = 0.2) +
labs(title="Actual VS Predicted",
x="Actual", y="Predicted")
#### 5.2 Support Vector Machine
Model.Testing.SV <- predict(Model_SVM, TestingSet)
Model.testing.confusion <-confusionMatrix(Model.Testing.SV,
TestingSet$payment_status)
Model.testing.confusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction Paid Unpaid
## Paid 335 8
## Unpaid 12 45
##
## Accuracy : 0.95
## 95% CI : (0.9238, 0.9692)
## No Information Rate : 0.8675
## P-Value [Acc > NIR] : 4.066e-08
##
## Kappa : 0.7892
##
## Mcnemar's Test P-Value : 0.5023
##
## Sensitivity : 0.9654
## Specificity : 0.8491
## Pos Pred Value : 0.9767
## Neg Pred Value : 0.7895
## Prevalence : 0.8675
## Detection Rate : 0.8375
## Detection Prevalence : 0.8575
## Balanced Accuracy : 0.9072
##
## 'Positive' Class : Paid
##
ggplot(data=TestingSet, aes(x=payment_status,
y=Model.Testing.SV)) +
geom_boxplot(alpha=.3) +
geom_jitter(alpha = 0.3,
color = "blue",
width = 0.2) +
labs(title="Actual VS Predicted",
x="Actual", y="Predicted")
Model.Testing.RF <- predict(fit.forest, TestingSet)
Model.testing.confusion <-confusionMatrix(Model.Testing.RF,
TestingSet$payment_status)
Model.testing.confusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction Paid Unpaid
## Paid 347 3
## Unpaid 0 50
##
## Accuracy : 0.9925
## 95% CI : (0.9782, 0.9985)
## No Information Rate : 0.8675
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9666
##
## Mcnemar's Test P-Value : 0.2482
##
## Sensitivity : 1.0000
## Specificity : 0.9434
## Pos Pred Value : 0.9914
## Neg Pred Value : 1.0000
## Prevalence : 0.8675
## Detection Rate : 0.8675
## Detection Prevalence : 0.8750
## Balanced Accuracy : 0.9717
##
## 'Positive' Class : Paid
##
ggplot(data=TestingSet, aes(x=payment_status,
y=Model.Testing.RF)) +
geom_boxplot(alpha=.3) +
geom_jitter(alpha = 0.3,
color = "blue",
width = 0.2) +
labs(title="Actual VS Predicted",
x="Actual", y="Predicted")
Model.Testing.CT <- predict(Model_CT, TestingSet)
Model.testing.confusion <-confusionMatrix(Model.Testing.CT,
TestingSet$payment_status)
Model.testing.confusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction Paid Unpaid
## Paid 346 1
## Unpaid 1 52
##
## Accuracy : 0.995
## 95% CI : (0.9821, 0.9994)
## No Information Rate : 0.8675
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9783
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9971
## Specificity : 0.9811
## Pos Pred Value : 0.9971
## Neg Pred Value : 0.9811
## Prevalence : 0.8675
## Detection Rate : 0.8650
## Detection Prevalence : 0.8675
## Balanced Accuracy : 0.9891
##
## 'Positive' Class : Paid
##
ggplot(data=TestingSet, aes(x=payment_status,
y=Model.Testing.CT)) +
geom_boxplot(alpha=.3) +
geom_jitter(alpha = 0.3,
color = "blue",
width = 0.2) +
labs(title="Actual VS Predicted",
x="Actual", y="Predicted")
From the result above, we know the following:
1. Classification Tree is the best among all the tested algortihms
2. Based Classification Tree Model , the most important variables are age, loan and income
3. For age around 25 to 30 the value of the loan may be lowered The results can be improved by better data preparation