1 Goal
2 Universal Libraries
3 Data Import
4 Data Wrangling
5 Checking the NA
6 Visualization
7 Prepraring data for Machine Learning to predict Credit Status
8 Applying the Random Forest Algorithm
9 Applying the KNN Algorithm
10 Function to remove outliers.
11 Remove outliers to and apply SVM to check if accuracy is improved
12 Applying Recursive Feature Evaluation
13 Conclusion

1 Goal

The goal of this task is to conduct an Exploratory Data Analysis (EDA) and model a credit portfolio.

2 Universal Libraries

Theese libraries are used throught the code. Libraries specific to the line of code are defined with the code.

library(tidyverse)
library(caret)  
library(magrittr)

3 Data Import

The Dataset contains 14 labels
Status: The Credit Status of the person
Seniority: Job Seniority
Home: Type of Home Ownership
Time: Time requested for Repayment of Loan
Age: Age
Maritial Status: Maritial Status
Records: If the person has a financial Record
Job: The Type of Job
Expense: Amoubt of Expense
Income: Amount of Income
Assets: Price of Assets owned
Debt: Amount of debt held
Amount: The amount requested in Loan
Price: The cost of the product

library(readxl)
mydata <- read_excel("Spotcap Data Science Case Study.xlsx", sheet=1)

4 Data Wrangling

This step introduces NA’s into the dataframe .

mydata[mydata==99999999]<-  NA

Introducing addition labels for better visualizations.
Savings = Income - Expenses,
Net_worth = Assets - Debt,
Down_pay = Price - Amount,
Loan_perc = Percentage of the cost requested as loan.

mydata2 <- mydata %>%  mutate(Savings = Income-Expenses, Net_worth = Assets-Debt,
                              Down_pay=Price-Amount,loan_perc= Amount*100/Price) 
mydata2 <- mydata2[ c(1:8,10,9,15,11,12,16,13,14,17,18)] #Rearranging the Columns

Changing numerals into factors and adding NA’s to unavailable values.

mydata2$Status <- factor(mydata2$Status, levels = c(1,2), labels = c("Good", "Bad"))
mydata2$Home <- factor(mydata2$Home, levels = c(1,2,3,4,5,6),
                       labels = c("Rent", "Owner","Private", "Ignore", "Parents", "Other"))
mydata2$Marital <- factor(mydata2$Marital, levels = c(1,2,3,4,5),
                         labels =  c("Single", "Married", "Widow", "Separated", "Divorced"))
mydata2$Records <-  factor(mydata2$Records, levels = c(1,2),
                           labels =  c("Not_Avaliable", "Avaliable"))
mydata2$Job <- factor(mydata2$Job, levels = c(1,2,3,4),
                     labels = c("Fixed", "Partime", "Freelance", "Others"))
summary(mydata2)

##   Status       Seniority           Home           Time            Age       
##  Good:3200   Min.   : 0.000   Rent   : 973   Min.   : 6.00   Min.   :18.00  
##  Bad :1254   1st Qu.: 2.000   Owner  :2107   1st Qu.:36.00   1st Qu.:28.00  
##  NA's:   1   Median : 5.000   Private: 247   Median :48.00   Median :36.00  
##              Mean   : 7.987   Ignore :  20   Mean   :46.44   Mean   :37.08  
##              3rd Qu.:12.000   Parents: 783   3rd Qu.:60.00   3rd Qu.:45.00  
##              Max.   :48.000   Other  : 319   Max.   :72.00   Max.   :68.00  
##                               NA's   :   6                                  
##       Marital              Records            Job           Income     
##  Single   : 978   Not_Avaliable:3682   Fixed    :2806   Min.   :  0.0  
##  Married  :3241   Avaliable    : 773   Partime  : 452   1st Qu.: 80.0  
##  Widow    :  67                        Freelance:1024   Median :120.0  
##  Separated: 130                        Others   : 171   Mean   :130.6  
##  Divorced :  38                        NA's     :   2   3rd Qu.:165.0  
##  NA's     :   1                                         Max.   :959.0  
##                                                         NA's   :34     
##     Expenses         Savings            Assets            Debt        
##  Min.   : 35.00   Min.   :-165.00   Min.   :     0   Min.   :    0.0  
##  1st Qu.: 35.00   1st Qu.:  30.00   1st Qu.:     0   1st Qu.:    0.0  
##  Median : 51.00   Median :  65.00   Median :  3000   Median :    0.0  
##  Mean   : 55.57   Mean   :  75.02   Mean   :  5403   Mean   :  342.9  
##  3rd Qu.: 72.00   3rd Qu.: 108.00   3rd Qu.:  6000   3rd Qu.:    0.0  
##  Max.   :180.00   Max.   : 824.00   Max.   :300000   Max.   :30000.0  
##                   NA's   :34        NA's   :47       NA's   :18       
##    Net_worth          Amount         Price          Down_pay      
##  Min.   : -4000   Min.   : 100   Min.   :  105   Min.   :    0.0  
##  1st Qu.:     0   1st Qu.: 700   1st Qu.: 1118   1st Qu.:  142.0  
##  Median :  3000   Median :1000   Median : 1400   Median :  300.0  
##  Mean   :  5058   Mean   :1039   Mean   : 1463   Mean   :  423.9  
##  3rd Qu.:  5471   3rd Qu.:1300   3rd Qu.: 1692   3rd Qu.:  594.0  
##  Max.   :300000   Max.   :5000   Max.   :11140   Max.   :10140.0  
##  NA's   :47                                                       
##    loan_perc      
##  Min.   :  6.702  
##  1st Qu.: 60.040  
##  Median : 77.101  
##  Mean   : 72.610  
##  3rd Qu.: 88.422  
##  Max.   :100.000  
##

5 Checking the NA

This creates a subset of the main data frame . If there are NA’s in any row, its subsetted here

Na_DF <- mydata2[rowSums(is.na(mydata2)) > 0,]

Compare original with subset using visualization
It is observed that the most NA’s occour with freelancers who do not show an income.
There exist 80 Rows with NA’s which is 1.7% of the dataset.
The Subset needs to be compared to the original dataset.

a= ggplot(data = mydata2, aes(x =Job, y = , fill = Status)) + geom_bar() +
  scale_fill_manual("legend", values = c("Good" = "turquoise", "Bad" = "orangered1","NA"= "grey"))
b= ggplot(data = Na_DF, aes(x =Job, y = , fill = Status)) + geom_bar() +
  scale_fill_manual("legend", values = c("Good" = "turquoise", "Bad" = "orangered1","NA"= "grey"))
library(cowplot)
plot_grid(a, b, labels = c('Main dataset', "Subset of Na's"),ncol = 1,align = 'V',label_size = 12,
          label_x = 0.5, label_y = 1, rel_widths = c(1.8, 1.8))

The overall dataset is similar to NA’s dataset hence dropping the NA’s might not affect the dataset.
Dropping the NA’s.

mydata3 <- mydata2[complete.cases(mydata2),]

6 Visualization

Looking for patterns in the data before applying machine learning.

#Comparing Job type with the Credit Status and avaliability of past records.
ggplot(data = mydata3, aes(x =Job, y = , fill = Status)) + geom_bar() +
  facet_grid(Records ~ .) + xlab("Job Type") + ylab("Count") + 
  scale_fill_manual("legend", values = c("Good" = "turquoise", "Bad" = "orangered1"))

People with no past records tend to have a good status as compared to the people with records. There exists a bias as having no past records might be helping getting a good credit status.

# Comapring credit status with respect to age.
mydata4 <- mydata3
# discretization of age into 4 groups for plotting
mydata4$Age <- cut(mydata4$Age, breaks=c(17, 30, 43, 52 ,Inf),
                   labels=c("18 - 30", "31 - 43", "44 - 52", "52 - 68"))
ggplot(data = mydata4, aes(x =Job, y = , fill = Status)) + geom_bar() + facet_grid(Age ~ .) + 
  xlab("Job Type") + ylab("Count") +
  scale_fill_manual("legend", values = c("Good" = "turquoise", "Bad" = "orangered1"))

The credit statuses are fairly similarly distributed across different age groups hence there are no patterns to be seen .

# checking the percentage of cost demanded as loan with respect to status
ggplot(data=mydata3, aes(x=loan_perc, y=Status))+
  geom_jitter(size=2, na.rm=TRUE, aes(color=Status, shape=Status)) + 
  xlab("Percentage of cost requested in loan ") +ylab("Status")

It is observed that people with a bad credit status tend to pay less in downpayment.

7 Prepraring data for Machine Learning to predict Credit Status

Creating a 70% - 30% partition for the Train and the Test set using the Caret library

set.seed(345)
indexes <- createDataPartition(y=mydata3$Status, times=1,p=0.7,list=FALSE) 
trainSet<- mydata3[indexes,]
testSet <- mydata3[-indexes,]

8 Applying the Random Forest Algorithm

Training the algorithm

set.seed(2334)
ctrl <- trainControl(method="repeatedcv",number=2,repeats = 2) 
start.time <- Sys.time()  #To note the time taken to train an algorithm
Rf_tune <- train(Status~., data = trainSet, method= "rf",preProcess = c( "center","scale"),
                 trControl = ctrl, tuneLength = 8)
end.time <- Sys.time()
saveRDS(Rf_tune, file = "Rf_tune") # Saving the model 
time.taken <- end.time - start.time
print(time.taken)

## Time difference of 1.120035 mins

Rf_tune

## Random Forest 
## 
## 3064 samples
##   17 predictor
##    2 classes: 'Good', 'Bad' 
## 
## Pre-processing: centered (26), scaled (26) 
## Resampling: Cross-Validated (2 fold, repeated 2 times) 
## Summary of sample sizes: 1532, 1532, 1532, 1532 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7800261  0.3416290
##    5    0.7904700  0.4212345
##    8    0.7919386  0.4337199
##   12    0.7903068  0.4360927
##   15    0.7907963  0.4374193
##   19    0.7903068  0.4374583
##   22    0.7888381  0.4368901
##   26    0.7890013  0.4366069
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 8.

#Rf_tune <- readRDS("Rf_tune") #used to import pretrained model
plot(Rf_tune)

Testing the algorithm

Rf_test = predict(Rf_tune, newdata=testSet,metric= accuracy)
postResample(Rf_test, testSet$Status)

##  Accuracy     Kappa 
## 0.7955759 0.4533897

confusionMatrix(data = Rf_test, testSet$Status)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Good Bad
##       Good  853 174
##       Bad    94 190
##                                           
##                Accuracy : 0.7956          
##                  95% CI : (0.7727, 0.8171)
##     No Information Rate : 0.7223          
##     P-Value [Acc > NIR] : 6.381e-10       
##                                           
##                   Kappa : 0.4534          
##                                           
##  Mcnemar's Test P-Value : 1.395e-06       
##                                           
##             Sensitivity : 0.9007          
##             Specificity : 0.5220          
##          Pos Pred Value : 0.8306          
##          Neg Pred Value : 0.6690          
##              Prevalence : 0.7223          
##          Detection Rate : 0.6506          
##    Detection Prevalence : 0.7834          
##       Balanced Accuracy : 0.7114          
##                                           
##        'Positive' Class : Good            
##

8 Applying the Support Vector Machine Algorithm

Training the algorithm

set.seed(238)
ctrl <- trainControl(method="repeatedcv",number=2,repeats = 2) 
start.time <- Sys.time()
Svm_tune <- train(Status~., data = trainSet, method= "svmLinear",preProcess = c( "center","scale"),
                 trControl = ctrl, tuneLength = 8)
end.time <- Sys.time()
saveRDS(Svm_tune, file = "Svm_tune")
time.taken <- end.time - start.time
print(time.taken)

## Time difference of 3.446867 secs

Svm_tune

## Support Vector Machines with Linear Kernel 
## 
## 3064 samples
##   17 predictor
##    2 classes: 'Good', 'Bad' 
## 
## Pre-processing: centered (26), scaled (26) 
## Resampling: Cross-Validated (2 fold, repeated 2 times) 
## Summary of sample sizes: 1532, 1532, 1532, 1532 
## Resampling results:
## 
##   Accuracy  Kappa    
##   0.801077  0.4614003
## 
## Tuning parameter 'C' was held constant at a value of 1

Testing the algorithm

Svm_test = predict(Svm_tune, newdata=testSet,metric= accuracy)
postResample(Svm_test, testSet$Status)

##  Accuracy     Kappa 
## 0.8009153 0.4532866

confusionMatrix(data = Svm_test, testSet$Status)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Good Bad
##       Good  871 185
##       Bad    76 179
##                                           
##                Accuracy : 0.8009          
##                  95% CI : (0.7783, 0.8222)
##     No Information Rate : 0.7223          
##     P-Value [Acc > NIR] : 3.223e-11       
##                                           
##                   Kappa : 0.4533          
##                                           
##  Mcnemar's Test P-Value : 2.309e-11       
##                                           
##             Sensitivity : 0.9197          
##             Specificity : 0.4918          
##          Pos Pred Value : 0.8248          
##          Neg Pred Value : 0.7020          
##              Prevalence : 0.7223          
##          Detection Rate : 0.6644          
##    Detection Prevalence : 0.8055          
##       Balanced Accuracy : 0.7058          
##                                           
##        'Positive' Class : Good            
##

9 Applying the KNN Algorithm

Training the algorithm

set.seed(232)
ctrl <- trainControl(method = "repeatedcv", number = 2, repeats = 1) 
start.time <- Sys.time()
Knn_tune <- train(Status~., data = trainSet, method= "knn",preProcess = c( "center","scale"),
                  trControl = ctrl,tuneLength = 8)
end.time <- Sys.time()
time.taken <- end.time - start.time
print(time.taken)

## Time difference of 2.647449 secs

saveRDS(Knn_tune, file = "Knn_tune")
Knn_tune

## k-Nearest Neighbors 
## 
## 3064 samples
##   17 predictor
##    2 classes: 'Good', 'Bad' 
## 
## Pre-processing: centered (26), scaled (26) 
## Resampling: Cross-Validated (2 fold, repeated 1 times) 
## Summary of sample sizes: 1532, 1532 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.7571802  0.3320484
##    7  0.7597911  0.3224582
##    9  0.7754569  0.3601179
##   11  0.7708877  0.3388454
##   13  0.7689295  0.3334152
##   15  0.7699086  0.3267938
##   17  0.7721932  0.3228294
##   19  0.7699086  0.3093801
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

plot(Knn_tune)

Testing the algorithm

Knn_test = predict(Knn_tune, newdata=testSet,metric= accuracy)
postResample(Knn_test, testSet$Status)

##  Accuracy     Kappa 
## 0.7826087 0.3905240

confusionMatrix(data = Knn_test, testSet$Status)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Good Bad
##       Good  870 208
##       Bad    77 156
##                                           
##                Accuracy : 0.7826          
##                  95% CI : (0.7593, 0.8047)
##     No Information Rate : 0.7223          
##     P-Value [Acc > NIR] : 3.561e-07       
##                                           
##                   Kappa : 0.3905          
##                                           
##  Mcnemar's Test P-Value : 1.355e-14       
##                                           
##             Sensitivity : 0.9187          
##             Specificity : 0.4286          
##          Pos Pred Value : 0.8071          
##          Neg Pred Value : 0.6695          
##              Prevalence : 0.7223          
##          Detection Rate : 0.6636          
##    Detection Prevalence : 0.8223          
##       Balanced Accuracy : 0.6736          
##                                           
##        'Positive' Class : Good            
##

10 Function to remove outliers.

This is a function that removes outliers.
This function uses the Tukey’s method which use interquartile (IQR) range approach and was written by Dr. Klodian Dhana.
This original function replaces outliers with NA but has been modified to replace Outliers with Median.

outlierKD <- function(dt, var) {
  #define variables
  var_name <- eval(substitute(var),eval(dt))
  tot <- sum(!is.na(var_name))
  m1 <- median(var_name, na.rm = T)
  outlier <- boxplot.stats(var_name)$out
  mo <- median(outlier)
  
  #create 2x2 canvas
  par(mfrow=c(2, 2), oma=c(0,0,3,0))
  boxplot(var_name, main="With outliers")
  hist(var_name, main="With outliers", xlab=NA, ylab=NA)
  
  # If value is an outlier introduce median
  # If not, do nothing
  var_name <- ifelse(var_name %in% outlier, m1, var_name)
  m2 <- median(var_name, na.rm = T)
  na <- length(outlier)
  boxplot(var_name, main="Without outliers")
  hist(var_name, main="Without outliers", xlab=NA, ylab=NA)
  title("Outlier Check for var", outer=TRUE)
  
  #print messages
  message("Outliers identified: ", na, " from ", tot, " observations")
  message("Proportion (%) of outliers: ", na / tot*100)
  message("Median of the outliers: ", mo)
  message("Median without removing outliers: ", m1)
  message("Median if we remove outliers: ", m2)
  
  dt[as.character(substitute(var))] <- invisible(var_name)
  assign(as.character(as.list(match.call())$dt), dt, envir = .GlobalEnv)
  message("Outliers successfully removed", "\n")
  par(mfrow= c(1,1),oma=c(0,0,0,0))
  return(invisible(dt))
}

11 Remove outliers to and apply SVM to check if accuracy is improved

mydata5 <- mydata3  # creating a copy of the dataset
outlierKD(mydata5,Savings) #remove outliers from Savings

outlierKD(mydata5,Assets)  #remove outliers from Assets

outlierKD(mydata5,Net_worth) #remove outliers from Net_worth

Creating a new partition

set.seed(345)
indexes <- createDataPartition(y=mydata5$Status, times=1,p=0.7,list=FALSE)
trainSet1<- mydata5[indexes,]
testSet1 <- mydata5[-indexes,]

Training the algorithm

set.seed(238)
ctrl <- trainControl(method="repeatedcv",number=2,repeats = 1) 
start.time <- Sys.time()
Svm_tune1 <- train(Status~., data = trainSet1, method= "svmLinear",preProcess = c( "center","scale"),
                 trControl = ctrl, tuneLength = 8)
end.time <- Sys.time()
saveRDS(Svm_tune1, file = "Svm_tune1")
time.taken <- end.time - start.time
print(time.taken)

## Time difference of 1.350445 secs

Svm_tune1

## Support Vector Machines with Linear Kernel 
## 
## 3064 samples
##   17 predictor
##    2 classes: 'Good', 'Bad' 
## 
## Pre-processing: centered (26), scaled (26) 
## Resampling: Cross-Validated (2 fold, repeated 1 times) 
## Summary of sample sizes: 1532, 1532 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8041775  0.4694132
## 
## Tuning parameter 'C' was held constant at a value of 1

Testing the algorithm

Svm_test1 = predict(Svm_tune1, newdata=testSet1,metric= accuracy)
postResample(Svm_test1, testSet1$Status)

##  Accuracy     Kappa 
## 0.8085431 0.4781225

confusionMatrix(data = Svm_test1, testSet$Status)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Good Bad
##       Good  872 176
##       Bad    75 188
##                                           
##                Accuracy : 0.8085          
##                  95% CI : (0.7862, 0.8295)
##     No Information Rate : 0.7223          
##     P-Value [Acc > NIR] : 3.034e-13       
##                                           
##                   Kappa : 0.4781          
##                                           
##  Mcnemar's Test P-Value : 2.756e-10       
##                                           
##             Sensitivity : 0.9208          
##             Specificity : 0.5165          
##          Pos Pred Value : 0.8321          
##          Neg Pred Value : 0.7148          
##              Prevalence : 0.7223          
##          Detection Rate : 0.6651          
##    Detection Prevalence : 0.7994          
##       Balanced Accuracy : 0.7186          
##                                           
##        'Positive' Class : Good            
##

12 Applying Recursive Feature Evaluation

This helps us find the top labels applied in the predictons

my_control <- rfeControl(functions = rfFuncs, method = "repeatedcv", 
                         repeats = 3, verbose = FALSE)
results <- rfe(mydata4[,2:18], mydata4$Status, rfeControl=my_control)
print(results)

## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold, repeated 3 times) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy  Kappa AccuracySD KappaSD Selected
##          4   0.7714 0.3772    0.01586 0.04419         
##          8   0.7897 0.4370    0.02034 0.05885         
##         16   0.7935 0.4475    0.01993 0.05908         
##         17   0.7955 0.4508    0.01786 0.05595        *
## 
## The top 5 variables (out of 17):
##    Records, Job, Savings, Seniority, Income

# list the chosen features
predictors(results)

##  [1] "Records"   "Job"       "Savings"   "Seniority" "Income"    "loan_perc"
##  [7] "Amount"    "Net_worth" "Down_pay"  "Home"      "Assets"    "Price"    
## [13] "Time"      "Expenses"  "Age"       "Marital"   "Debt"

# plot the results
plot(results, type=c("g", "o"))

13 Conclusion

We applied 3 Algorithms to the problem.
The Random forest had a training as well as testing accuracy of about 79 percent but had a training time of 1.16 minutes.
The Support vector machine had a similar accuracy to Random Forest but the training time was 3.5 seconds.
The Knn had a lower accuracy of 76 percent but the training time was 2.2 seconds.
As SVM has a low training time and a better accuracy hence larger datasets could be trained using this algorithm.
Using the KD outliers function to remove outliers only improved the SVM prediction in the test set by .2 percentage.
It would be fair to assume this model as the optimal model.
The Recursive Feature Engineer provides us with the 5 paramates which affect the Credit Status i.e, “Records”,“Job”,“Savings”,“Seniority”,“Income”.
One business recommendation would be providing more Good Credit Status to people who maintain records. People with no records have more chances of getting a good status, while people with records tend to get a bad status.

Credit portfolio Data Science Case Study

Roshan Kumar Bhuyan

1 Goal

2 Universal Libraries

3 Data Import

4 Data Wrangling

5 Checking the NA

6 Visualization

7 Prepraring data for Machine Learning to predict Credit Status

8 Applying the Random Forest Algorithm

8 Applying the Support Vector Machine Algorithm

9 Applying the KNN Algorithm

10 Function to remove outliers.

11 Remove outliers to and apply SVM to check if accuracy is improved

12 Applying Recursive Feature Evaluation

13 Conclusion