The goal of this task is to conduct an Exploratory Data Analysis (EDA) and model a credit portfolio.
Theese libraries are used throught the code. Libraries specific to the line of code are defined with the code.
library(tidyverse)
library(caret)
library(magrittr)
The Dataset contains 14 labels
Status: The Credit Status of the person
Seniority: Job Seniority
Home: Type of Home Ownership
Time: Time requested for Repayment of Loan
Age: Age
Maritial Status: Maritial Status
Records: If the person has a financial Record
Job: The Type of Job
Expense: Amoubt of Expense
Income: Amount of Income
Assets: Price of Assets owned
Debt: Amount of debt held
Amount: The amount requested in Loan
Price: The cost of the product
library(readxl)
mydata <- read_excel("Spotcap Data Science Case Study.xlsx", sheet=1)
This step introduces NA’s into the dataframe .
mydata[mydata==99999999]<- NA
Introducing addition labels for better visualizations.
Savings = Income - Expenses,
Net_worth = Assets - Debt,
Down_pay = Price - Amount,
Loan_perc = Percentage of the cost requested as loan.
mydata2 <- mydata %>% mutate(Savings = Income-Expenses, Net_worth = Assets-Debt,
Down_pay=Price-Amount,loan_perc= Amount*100/Price)
mydata2 <- mydata2[ c(1:8,10,9,15,11,12,16,13,14,17,18)] #Rearranging the Columns
Changing numerals into factors and adding NA’s to unavailable values.
mydata2$Status <- factor(mydata2$Status, levels = c(1,2), labels = c("Good", "Bad"))
mydata2$Home <- factor(mydata2$Home, levels = c(1,2,3,4,5,6),
labels = c("Rent", "Owner","Private", "Ignore", "Parents", "Other"))
mydata2$Marital <- factor(mydata2$Marital, levels = c(1,2,3,4,5),
labels = c("Single", "Married", "Widow", "Separated", "Divorced"))
mydata2$Records <- factor(mydata2$Records, levels = c(1,2),
labels = c("Not_Avaliable", "Avaliable"))
mydata2$Job <- factor(mydata2$Job, levels = c(1,2,3,4),
labels = c("Fixed", "Partime", "Freelance", "Others"))
summary(mydata2)
## Status Seniority Home Time Age
## Good:3200 Min. : 0.000 Rent : 973 Min. : 6.00 Min. :18.00
## Bad :1254 1st Qu.: 2.000 Owner :2107 1st Qu.:36.00 1st Qu.:28.00
## NA's: 1 Median : 5.000 Private: 247 Median :48.00 Median :36.00
## Mean : 7.987 Ignore : 20 Mean :46.44 Mean :37.08
## 3rd Qu.:12.000 Parents: 783 3rd Qu.:60.00 3rd Qu.:45.00
## Max. :48.000 Other : 319 Max. :72.00 Max. :68.00
## NA's : 6
## Marital Records Job Income
## Single : 978 Not_Avaliable:3682 Fixed :2806 Min. : 0.0
## Married :3241 Avaliable : 773 Partime : 452 1st Qu.: 80.0
## Widow : 67 Freelance:1024 Median :120.0
## Separated: 130 Others : 171 Mean :130.6
## Divorced : 38 NA's : 2 3rd Qu.:165.0
## NA's : 1 Max. :959.0
## NA's :34
## Expenses Savings Assets Debt
## Min. : 35.00 Min. :-165.00 Min. : 0 Min. : 0.0
## 1st Qu.: 35.00 1st Qu.: 30.00 1st Qu.: 0 1st Qu.: 0.0
## Median : 51.00 Median : 65.00 Median : 3000 Median : 0.0
## Mean : 55.57 Mean : 75.02 Mean : 5403 Mean : 342.9
## 3rd Qu.: 72.00 3rd Qu.: 108.00 3rd Qu.: 6000 3rd Qu.: 0.0
## Max. :180.00 Max. : 824.00 Max. :300000 Max. :30000.0
## NA's :34 NA's :47 NA's :18
## Net_worth Amount Price Down_pay
## Min. : -4000 Min. : 100 Min. : 105 Min. : 0.0
## 1st Qu.: 0 1st Qu.: 700 1st Qu.: 1118 1st Qu.: 142.0
## Median : 3000 Median :1000 Median : 1400 Median : 300.0
## Mean : 5058 Mean :1039 Mean : 1463 Mean : 423.9
## 3rd Qu.: 5471 3rd Qu.:1300 3rd Qu.: 1692 3rd Qu.: 594.0
## Max. :300000 Max. :5000 Max. :11140 Max. :10140.0
## NA's :47
## loan_perc
## Min. : 6.702
## 1st Qu.: 60.040
## Median : 77.101
## Mean : 72.610
## 3rd Qu.: 88.422
## Max. :100.000
##
This creates a subset of the main data frame . If there are NA’s in any row, its subsetted here
Na_DF <- mydata2[rowSums(is.na(mydata2)) > 0,]
Compare original with subset using visualization
It is observed that the most NA’s occour with freelancers who do not show an income.
There exist 80 Rows with NA’s which is 1.7% of the dataset.
The Subset needs to be compared to the original dataset.
a= ggplot(data = mydata2, aes(x =Job, y = , fill = Status)) + geom_bar() +
scale_fill_manual("legend", values = c("Good" = "turquoise", "Bad" = "orangered1","NA"= "grey"))
b= ggplot(data = Na_DF, aes(x =Job, y = , fill = Status)) + geom_bar() +
scale_fill_manual("legend", values = c("Good" = "turquoise", "Bad" = "orangered1","NA"= "grey"))
library(cowplot)
plot_grid(a, b, labels = c('Main dataset', "Subset of Na's"),ncol = 1,align = 'V',label_size = 12,
label_x = 0.5, label_y = 1, rel_widths = c(1.8, 1.8))
The overall dataset is similar to NA’s dataset hence dropping the NA’s might not affect the dataset.
Dropping the NA’s.
mydata3 <- mydata2[complete.cases(mydata2),]
Looking for patterns in the data before applying machine learning.
#Comparing Job type with the Credit Status and avaliability of past records.
ggplot(data = mydata3, aes(x =Job, y = , fill = Status)) + geom_bar() +
facet_grid(Records ~ .) + xlab("Job Type") + ylab("Count") +
scale_fill_manual("legend", values = c("Good" = "turquoise", "Bad" = "orangered1"))
>
People with no past records tend to have a good status as compared to the people with records. There exists a bias as having no past records might be helping getting a good credit status.
# Comapring credit status with respect to age.
mydata4 <- mydata3
# discretization of age into 4 groups for plotting
mydata4$Age <- cut(mydata4$Age, breaks=c(17, 30, 43, 52 ,Inf),
labels=c("18 - 30", "31 - 43", "44 - 52", "52 - 68"))
ggplot(data = mydata4, aes(x =Job, y = , fill = Status)) + geom_bar() + facet_grid(Age ~ .) +
xlab("Job Type") + ylab("Count") +
scale_fill_manual("legend", values = c("Good" = "turquoise", "Bad" = "orangered1"))
>
The credit statuses are fairly similarly distributed across different age groups hence there are no patterns to be seen .
# checking the percentage of cost demanded as loan with respect to status
ggplot(data=mydata3, aes(x=loan_perc, y=Status))+
geom_jitter(size=2, na.rm=TRUE, aes(color=Status, shape=Status)) +
xlab("Percentage of cost requested in loan ") +ylab("Status")
>
It is observed that people with a bad credit status tend to pay less in downpayment.
Creating a 70% - 30% partition for the Train and the Test set using the Caret library
set.seed(345)
indexes <- createDataPartition(y=mydata3$Status, times=1,p=0.7,list=FALSE)
trainSet<- mydata3[indexes,]
testSet <- mydata3[-indexes,]
Training the algorithm
set.seed(2334)
ctrl <- trainControl(method="repeatedcv",number=2,repeats = 2)
start.time <- Sys.time() #To note the time taken to train an algorithm
Rf_tune <- train(Status~., data = trainSet, method= "rf",preProcess = c( "center","scale"),
trControl = ctrl, tuneLength = 8)
end.time <- Sys.time()
saveRDS(Rf_tune, file = "Rf_tune") # Saving the model
time.taken <- end.time - start.time
print(time.taken)
## Time difference of 1.120035 mins
Rf_tune
## Random Forest
##
## 3064 samples
## 17 predictor
## 2 classes: 'Good', 'Bad'
##
## Pre-processing: centered (26), scaled (26)
## Resampling: Cross-Validated (2 fold, repeated 2 times)
## Summary of sample sizes: 1532, 1532, 1532, 1532
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7800261 0.3416290
## 5 0.7904700 0.4212345
## 8 0.7919386 0.4337199
## 12 0.7903068 0.4360927
## 15 0.7907963 0.4374193
## 19 0.7903068 0.4374583
## 22 0.7888381 0.4368901
## 26 0.7890013 0.4366069
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 8.
#Rf_tune <- readRDS("Rf_tune") #used to import pretrained model
plot(Rf_tune)
Testing the algorithm
Rf_test = predict(Rf_tune, newdata=testSet,metric= accuracy)
postResample(Rf_test, testSet$Status)
## Accuracy Kappa
## 0.7955759 0.4533897
confusionMatrix(data = Rf_test, testSet$Status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Good Bad
## Good 853 174
## Bad 94 190
##
## Accuracy : 0.7956
## 95% CI : (0.7727, 0.8171)
## No Information Rate : 0.7223
## P-Value [Acc > NIR] : 6.381e-10
##
## Kappa : 0.4534
##
## Mcnemar's Test P-Value : 1.395e-06
##
## Sensitivity : 0.9007
## Specificity : 0.5220
## Pos Pred Value : 0.8306
## Neg Pred Value : 0.6690
## Prevalence : 0.7223
## Detection Rate : 0.6506
## Detection Prevalence : 0.7834
## Balanced Accuracy : 0.7114
##
## 'Positive' Class : Good
##
8 Applying the Support Vector Machine Algorithm
Training the algorithm
set.seed(238)
ctrl <- trainControl(method="repeatedcv",number=2,repeats = 2)
start.time <- Sys.time()
Svm_tune <- train(Status~., data = trainSet, method= "svmLinear",preProcess = c( "center","scale"),
trControl = ctrl, tuneLength = 8)
end.time <- Sys.time()
saveRDS(Svm_tune, file = "Svm_tune")
time.taken <- end.time - start.time
print(time.taken)
## Time difference of 3.446867 secs
Svm_tune
## Support Vector Machines with Linear Kernel
##
## 3064 samples
## 17 predictor
## 2 classes: 'Good', 'Bad'
##
## Pre-processing: centered (26), scaled (26)
## Resampling: Cross-Validated (2 fold, repeated 2 times)
## Summary of sample sizes: 1532, 1532, 1532, 1532
## Resampling results:
##
## Accuracy Kappa
## 0.801077 0.4614003
##
## Tuning parameter 'C' was held constant at a value of 1
Testing the algorithm
Svm_test = predict(Svm_tune, newdata=testSet,metric= accuracy)
postResample(Svm_test, testSet$Status)
## Accuracy Kappa
## 0.8009153 0.4532866
confusionMatrix(data = Svm_test, testSet$Status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Good Bad
## Good 871 185
## Bad 76 179
##
## Accuracy : 0.8009
## 95% CI : (0.7783, 0.8222)
## No Information Rate : 0.7223
## P-Value [Acc > NIR] : 3.223e-11
##
## Kappa : 0.4533
##
## Mcnemar's Test P-Value : 2.309e-11
##
## Sensitivity : 0.9197
## Specificity : 0.4918
## Pos Pred Value : 0.8248
## Neg Pred Value : 0.7020
## Prevalence : 0.7223
## Detection Rate : 0.6644
## Detection Prevalence : 0.8055
## Balanced Accuracy : 0.7058
##
## 'Positive' Class : Good
##
Training the algorithm
set.seed(232)
ctrl <- trainControl(method = "repeatedcv", number = 2, repeats = 1)
start.time <- Sys.time()
Knn_tune <- train(Status~., data = trainSet, method= "knn",preProcess = c( "center","scale"),
trControl = ctrl,tuneLength = 8)
end.time <- Sys.time()
time.taken <- end.time - start.time
print(time.taken)
## Time difference of 2.647449 secs
saveRDS(Knn_tune, file = "Knn_tune")
Knn_tune
## k-Nearest Neighbors
##
## 3064 samples
## 17 predictor
## 2 classes: 'Good', 'Bad'
##
## Pre-processing: centered (26), scaled (26)
## Resampling: Cross-Validated (2 fold, repeated 1 times)
## Summary of sample sizes: 1532, 1532
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.7571802 0.3320484
## 7 0.7597911 0.3224582
## 9 0.7754569 0.3601179
## 11 0.7708877 0.3388454
## 13 0.7689295 0.3334152
## 15 0.7699086 0.3267938
## 17 0.7721932 0.3228294
## 19 0.7699086 0.3093801
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
plot(Knn_tune)
Testing the algorithm
Knn_test = predict(Knn_tune, newdata=testSet,metric= accuracy)
postResample(Knn_test, testSet$Status)
## Accuracy Kappa
## 0.7826087 0.3905240
confusionMatrix(data = Knn_test, testSet$Status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Good Bad
## Good 870 208
## Bad 77 156
##
## Accuracy : 0.7826
## 95% CI : (0.7593, 0.8047)
## No Information Rate : 0.7223
## P-Value [Acc > NIR] : 3.561e-07
##
## Kappa : 0.3905
##
## Mcnemar's Test P-Value : 1.355e-14
##
## Sensitivity : 0.9187
## Specificity : 0.4286
## Pos Pred Value : 0.8071
## Neg Pred Value : 0.6695
## Prevalence : 0.7223
## Detection Rate : 0.6636
## Detection Prevalence : 0.8223
## Balanced Accuracy : 0.6736
##
## 'Positive' Class : Good
##
This is a function that removes outliers.
This function uses the Tukey’s method which use interquartile (IQR) range approach and was written by Dr. Klodian Dhana.
This original function replaces outliers with NA but has been modified to replace Outliers with Median.
outlierKD <- function(dt, var) {
#define variables
var_name <- eval(substitute(var),eval(dt))
tot <- sum(!is.na(var_name))
m1 <- median(var_name, na.rm = T)
outlier <- boxplot.stats(var_name)$out
mo <- median(outlier)
#create 2x2 canvas
par(mfrow=c(2, 2), oma=c(0,0,3,0))
boxplot(var_name, main="With outliers")
hist(var_name, main="With outliers", xlab=NA, ylab=NA)
# If value is an outlier introduce median
# If not, do nothing
var_name <- ifelse(var_name %in% outlier, m1, var_name)
m2 <- median(var_name, na.rm = T)
na <- length(outlier)
boxplot(var_name, main="Without outliers")
hist(var_name, main="Without outliers", xlab=NA, ylab=NA)
title("Outlier Check for var", outer=TRUE)
#print messages
message("Outliers identified: ", na, " from ", tot, " observations")
message("Proportion (%) of outliers: ", na / tot*100)
message("Median of the outliers: ", mo)
message("Median without removing outliers: ", m1)
message("Median if we remove outliers: ", m2)
dt[as.character(substitute(var))] <- invisible(var_name)
assign(as.character(as.list(match.call())$dt), dt, envir = .GlobalEnv)
message("Outliers successfully removed", "\n")
par(mfrow= c(1,1),oma=c(0,0,0,0))
return(invisible(dt))
}
mydata5 <- mydata3 # creating a copy of the dataset
outlierKD(mydata5,Savings) #remove outliers from Savings
outlierKD(mydata5,Assets) #remove outliers from Assets
outlierKD(mydata5,Net_worth) #remove outliers from Net_worth
Creating a new partition
set.seed(345)
indexes <- createDataPartition(y=mydata5$Status, times=1,p=0.7,list=FALSE)
trainSet1<- mydata5[indexes,]
testSet1 <- mydata5[-indexes,]
Training the algorithm
set.seed(238)
ctrl <- trainControl(method="repeatedcv",number=2,repeats = 1)
start.time <- Sys.time()
Svm_tune1 <- train(Status~., data = trainSet1, method= "svmLinear",preProcess = c( "center","scale"),
trControl = ctrl, tuneLength = 8)
end.time <- Sys.time()
saveRDS(Svm_tune1, file = "Svm_tune1")
time.taken <- end.time - start.time
print(time.taken)
## Time difference of 1.350445 secs
Svm_tune1
## Support Vector Machines with Linear Kernel
##
## 3064 samples
## 17 predictor
## 2 classes: 'Good', 'Bad'
##
## Pre-processing: centered (26), scaled (26)
## Resampling: Cross-Validated (2 fold, repeated 1 times)
## Summary of sample sizes: 1532, 1532
## Resampling results:
##
## Accuracy Kappa
## 0.8041775 0.4694132
##
## Tuning parameter 'C' was held constant at a value of 1
Testing the algorithm
Svm_test1 = predict(Svm_tune1, newdata=testSet1,metric= accuracy)
postResample(Svm_test1, testSet1$Status)
## Accuracy Kappa
## 0.8085431 0.4781225
confusionMatrix(data = Svm_test1, testSet$Status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Good Bad
## Good 872 176
## Bad 75 188
##
## Accuracy : 0.8085
## 95% CI : (0.7862, 0.8295)
## No Information Rate : 0.7223
## P-Value [Acc > NIR] : 3.034e-13
##
## Kappa : 0.4781
##
## Mcnemar's Test P-Value : 2.756e-10
##
## Sensitivity : 0.9208
## Specificity : 0.5165
## Pos Pred Value : 0.8321
## Neg Pred Value : 0.7148
## Prevalence : 0.7223
## Detection Rate : 0.6651
## Detection Prevalence : 0.7994
## Balanced Accuracy : 0.7186
##
## 'Positive' Class : Good
##
This helps us find the top labels applied in the predictons
my_control <- rfeControl(functions = rfFuncs, method = "repeatedcv",
repeats = 3, verbose = FALSE)
results <- rfe(mydata4[,2:18], mydata4$Status, rfeControl=my_control)
print(results)
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold, repeated 3 times)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 4 0.7714 0.3772 0.01586 0.04419
## 8 0.7897 0.4370 0.02034 0.05885
## 16 0.7935 0.4475 0.01993 0.05908
## 17 0.7955 0.4508 0.01786 0.05595 *
##
## The top 5 variables (out of 17):
## Records, Job, Savings, Seniority, Income
# list the chosen features
predictors(results)
## [1] "Records" "Job" "Savings" "Seniority" "Income" "loan_perc"
## [7] "Amount" "Net_worth" "Down_pay" "Home" "Assets" "Price"
## [13] "Time" "Expenses" "Age" "Marital" "Debt"
# plot the results
plot(results, type=c("g", "o"))
We applied 3 Algorithms to the problem.
The Random forest had a training as well as testing accuracy of about 79 percent but had a training time of 1.16 minutes.
The Support vector machine had a similar accuracy to Random Forest but the training time was 3.5 seconds.
The Knn had a lower accuracy of 76 percent but the training time was 2.2 seconds.
As SVM has a low training time and a better accuracy hence larger datasets could be trained using this algorithm.
Using the KD outliers function to remove outliers only improved the SVM prediction in the test set by .2 percentage.
It would be fair to assume this model as the optimal model.
The Recursive Feature Engineer provides us with the 5 paramates which affect the Credit Status i.e, “Records”,“Job”,“Savings”,“Seniority”,“Income”.
One business recommendation would be providing more Good Credit Status to people who maintain records. People with no records have more chances of getting a good status, while people with records tend to get a bad status.