Use any of the 3 classification algorithms you’ve learned to predict the risk status of a bank loan. The variable default
in the dataset indicates whether the applicant did default on the loan issued by the bank. Start by reading the loan.csv
dataset in, a dataset that is originally from Professor Dr. Hans Hofmann:
library(tm)
## Loading required package: NLP
library(mlr)
## Loading required package: ParamHelpers
library(e1071)
##
## Attaching package: 'e1071'
## The following object is masked from 'package:mlr':
##
## impute
loans <- read.csv("loan.csv")
Use an R Markdown document to lay out your process, and explain the methodology in 1 or 2 brief paragraph. The student should be awarded the full (3) points when:
- The preprocessing steps are done, and the student show an understanding of holding out a test / cross validation set for an estimate of the model’s performance on unseen data
- The model’s performance is sufficiently explained (accuracy may not be the most helpful metric here! Recall about what you’ve learned regarding specificity and sensitivity)
- The student demonstrated extra effort in evaluating his/her model, and proposes ways to improve the accuracy obtained from the initial model
# spliting data to train and test.
split_80 <- sample(nrow(loans), nrow(loans)*0.80)
loans.train <- loans[split_80, ]
loans.test <- loans[-split_80, ]
# portion of default
prop.table(table(loans$default))
##
## no yes
## 0.7 0.3
#Create a classification task for learning on loans Dataset and specify the target feature
task <- makeClassifTask(data = loans.train, target = "default")
#Initialize the Naive Bayes classifier
selected_model <- makeLearner("classif.naiveBayes")
#Train the model
NB_mlr <- train(selected_model, task)
NB_mlr$learner.model
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## no yes
## 0.6925 0.3075
##
## Conditional probabilities:
## checking_balance
## Y < 0 DM > 200 DM 1 - 200 DM unknown
## no 0.20036101 0.06137184 0.24187726 0.49638989
## yes 0.43495935 0.04878049 0.35365854 0.16260163
##
## months_loan_duration
## Y [,1] [,2]
## no 19.56859 11.29674
## yes 24.75203 13.40421
##
## credit_history
## Y critical good perfect poor very good
## no 0.35198556 0.50361011 0.01805054 0.09566787 0.03068592
## yes 0.15853659 0.56504065 0.09349593 0.09349593 0.08943089
##
## purpose
## Y business car car0 education furniture/appliances
## no 0.08844765 0.32851986 0.01263538 0.04512635 0.50541516
## yes 0.12601626 0.36585366 0.02032520 0.08130081 0.37804878
## purpose
## Y renovations
## no 0.01985560
## yes 0.02845528
##
## amount
## Y [,1] [,2]
## no 3023.919 2477.857
## yes 3872.890 3499.636
##
## savings_balance
## Y < 100 DM > 1000 DM 100 - 500 DM 500 - 1000 DM unknown
## no 0.54151625 0.05956679 0.10108303 0.07220217 0.22563177
## yes 0.73983740 0.02032520 0.10975610 0.02845528 0.10162602
##
## employment_duration
## Y < 1 year > 7 years 1 - 4 years 4 - 7 years unemployed
## no 0.13898917 0.27075812 0.33754513 0.19675090 0.05595668
## yes 0.23170732 0.23170732 0.34552846 0.13008130 0.06097561
##
## percent_of_income
## Y [,1] [,2]
## no 2.949458 1.117496
## yes 3.032520 1.083720
##
## years_at_residence
## Y [,1] [,2]
## no 2.828520 1.111106
## yes 2.849593 1.105524
##
## age
## Y [,1] [,2]
## no 36.30325 11.18390
## yes 34.49187 11.23506
##
## other_credit
## Y bank none store
## no 0.12454874 0.82851986 0.04693141
## yes 0.18699187 0.75203252 0.06097561
##
## housing
## Y other own rent
## no 0.09747292 0.75992780 0.14259928
## yes 0.15447154 0.60975610 0.23577236
##
## existing_loans_count
## Y [,1] [,2]
## no 1.422383 0.5848640
## yes 1.394309 0.5811742
##
## job
## Y management skilled unemployed unskilled
## no 0.13898917 0.62996390 0.02527076 0.20577617
## yes 0.17073171 0.63414634 0.02032520 0.17479675
##
## dependents
## Y [,1] [,2]
## no 1.160650 0.3675395
## yes 1.158537 0.3659880
##
## phone
## Y no yes
## no 0.5884477 0.4115523
## yes 0.6016260 0.3983740
predictions_mlr = as.data.frame(predict(NB_mlr, newdata = loans.train[,1:3]))
##Confusion matrix to check accuracy
table(predictions_mlr[,1],loans.train$default)
##
## no yes
## no 507 159
## yes 47 87
reca <- round(104/(104+154),2)
spec <- round(485/(485+57),2)
paste("Recall:", reca)
## [1] "Recall: 0.4"
paste("Specificity:", spec)
## [1] "Specificity: 0.89"
library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
default <- ctree(default ~ ., loans.train)
plot(default, type="simple")