Classification in Machine Learning II

Analyze the loan dataset which shows historical data of customers who are likely to default or not in a bank. The data stored in this repository as loan.csv. To complete this assignment, you will need to build classification models using Naive Bayes, Decision Tree, and Random Forest algorithms by following these steps:

Data Exploration

Before we jump into modeling, we will try to explore the data. Load the data given (loan.csv) and assign it to an object named loan, followed by investigating the data using str() or glimpse() function.

# your code here
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Based on our investigation above, the loan data consists of 1000 observations and 17 variables. The description of each feature explained below:

  • checking_balance and savings_balance: Status of existing checking/savings account
  • months_loan_duration: Duration of the loan period in months
  • credit_history: Between critical, good, perfect, poor, and very good
  • purpose: Between business, car(new), car(used), education, furniture, and renovations
  • amount: Loan amount in DM (Deutsche Mark)
  • employment_duration: Length of time at current job
  • percent_of_income: Installment rate in percentage of disposable income
  • years_at_residence: Number of years at current residence
  • age: Customer’s age
  • other_credit: Other installment plans (bank/store)
  • housing: Between rent, own, or for free
  • existing_loans_count: Number of ongoing loans
  • job: Between management, skilled, unskilled and unemployed
  • dependents: Number of people being liable to provide maintenance for
  • phone: Either no or yes (registered under customer name)
  • default: Either no or yes. A loan’s default is considered as yes when it is defaulted, charged off, or past due date

You should also make sure that each column store the right data types. You can do data wrangling below if you need to.

Tips: You can also use parameter stringsAsFactors = TRUE from read.csv() so that all character column will automatically stored as factors.

# your code
loan <- read.csv("loan.csv",stringsAsFactors = T)
glimpse(loan)
## Rows: 1,000
## Columns: 17
## $ checking_balance     <fct> < 0 DM, 1 - 200 DM, unknown, < 0 DM, < 0 DM, unkn~
## $ months_loan_duration <int> 6, 48, 12, 42, 24, 36, 24, 36, 12, 30, 12, 48, 12~
## $ credit_history       <fct> critical, good, critical, good, poor, good, good,~
## $ purpose              <fct> furniture/appliances, furniture/appliances, educa~
## $ amount               <int> 1169, 5951, 2096, 7882, 4870, 9055, 2835, 6948, 3~
## $ savings_balance      <fct> unknown, < 100 DM, < 100 DM, < 100 DM, < 100 DM, ~
## $ employment_duration  <fct> > 7 years, 1 - 4 years, 4 - 7 years, 4 - 7 years,~
## $ percent_of_income    <int> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4, 3, 3, 1, 4, 2, 4, 4~
## $ years_at_residence   <int> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2, 1, 4, 1, 4, 4, 2, 4~
## $ age                  <int> 67, 22, 49, 45, 53, 35, 53, 35, 61, 28, 25, 24, 2~
## $ other_credit         <fct> none, none, none, none, none, none, none, none, n~
## $ housing              <fct> own, own, own, other, other, other, own, rent, ow~
## $ existing_loans_count <int> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2~
## $ job                  <fct> skilled, skilled, unskilled, skilled, skilled, un~
## $ dependents           <int> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ phone                <fct> yes, no, no, no, no, yes, no, yes, no, no, no, no~
## $ default              <fct> no, yes, no, no, yes, no, no, no, no, yes, yes, y~

As a data scientist, you will develop a model that aids management with their decision-making process. The first thing we need to know is what kind of business question we would like to solve. Loans are risky but at the same time it is also a product that generates profits for the institution through differential borrowing/lending rates. So identifying risky customers is one way to minimize lender losses. From there, we will try to predict using the given set of predictors and how we model the default variable.

Before we go through the modeling section, take your time to do the exploration step. Try to investigate the historical number of defaulted customers for each loan purpose. Please do some data aggregation to get the answer.

Hint: Because we only focused of the customers who defaulted, filter the data based on the condition needed (default == “yes”)

# your code here
x <- loan %>% filter(default == "yes")
summary(x$purpose)
##             business                  car                 car0 
##                   34                  106                    5 
##            education furniture/appliances          renovations 
##                   23                  124                    8

  1. Based on the exploration above, which purpose is most often to default?
  • Furniture/appliances
  • Car
  • Business
  • Education ___

Cross-Validation

Before we build our model, we should split the dataset into training and test data. Please split the data into 80% training and 20% test using sample() function, set.seed(100), and store it as data_train and data_test.

Notes: Make sure you use RNGkind() and set.seed() before splitting and run them together with your sample() code

#RNGkind(sample.kind = "Rounding")
#set.seed(100)
intrain <- sample(nrow(loan), nrow(loan)*0.8)  
data_train <- loan[intrain, ]  
data_test <- loan[-intrain, ]

Let’s look at the proportion of our target classes in train data using prop.table(table(object$target)) to make sure we have a balanced proportion in train data.

# your code here
prop.table(table(data_train$default))
## 
##   no  yes 
## 0.71 0.29

Based on the proportion above, we can conclude that our target variable can be considered imbalanced; hence we will have to balance the data before using it for our models. One important thing to be kept in mind is that all sub-sampling operations have to be applied only to training dataset. So please do it on data_train using the downSample() function from the caret package, and then store the downsampled data in data_train_down object. You also need to make sure that the target variable already stored in factor data type.

Notes: set the argument yname = "default"

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(100)
# your code here
data_train_down <- downSample(x = data_train %>% select(-default), 
                              y = data_train$default,
                              yname = "default")
table(data_train_down$default)
## 
##  no yes 
## 232 232

In the following step, please use data_train_down to build Naive Bayes, Decision Tree, and Random Forest model below.

Naive Bayes

After splitting our data into train and test set and downsample our train data, let us build our first model of Naive Bayes. There are several advantages in using this model, for example:

  • The model is relatively fast to train
  • It is estimating a probabilistic prediction
  • It can handle irrelevant features

  1. Below are the characteristics of Naive Bayes, EXCEPT
  • Assume that among the predictor variables are independent
  • Assume that between target and predictor variables are independent
  • Skewness due to data scarcity ___

Build a Naive Bayes model using naiveBayes() function from the e1071 package, then set the laplace parameter as 1. Store the model under model_naive before moving on to the next section.

library(e1071)
# your code here
model_naive <- naiveBayes(default~., data = data_train_down, laplace = 1)
model_naive
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##  no yes 
## 0.5 0.5 
## 
## Conditional probabilities:
##      checking_balance
## Y         < 0 DM   > 200 DM 1 - 200 DM    unknown
##   no  0.20762712 0.08050847 0.25847458 0.45338983
##   yes 0.42796610 0.05508475 0.37288136 0.14406780
## 
##      months_loan_duration
## Y         [,1]     [,2]
##   no  18.70259 11.13329
##   yes 25.03879 13.65573
## 
##      credit_history
## Y       critical       good    perfect       poor  very good
##   no  0.39240506 0.47257384 0.02531646 0.08016878 0.02953586
##   yes 0.16877637 0.56118143 0.08860759 0.08860759 0.09282700
## 
##      purpose
## Y       business        car       car0  education furniture/appliances
##   no  0.09663866 0.33613445 0.02521008 0.02941176           0.48319328
##   yes 0.12605042 0.34033613 0.01260504 0.08403361           0.41596639
##      purpose
## Y     renovations
##   no   0.02941176
##   yes  0.02100840
## 
##      amount
## Y         [,1]     [,2]
##   no  2748.065 2042.307
##   yes 4048.832 3658.293
## 
##      savings_balance
## Y       < 100 DM  > 1000 DM 100 - 500 DM 500 - 1000 DM    unknown
##   no  0.53586498 0.05485232   0.11814346    0.07594937 0.21518987
##   yes 0.70464135 0.02531646   0.13080169    0.04219409 0.09704641
## 
##      employment_duration
## Y       < 1 year  > 7 years 1 - 4 years 4 - 7 years unemployed
##   no  0.12658228 0.28691983  0.31645570  0.20675105 0.06329114
##   yes 0.24472574 0.21940928  0.32911392  0.13502110 0.07172996
## 
##      percent_of_income
## Y         [,1]     [,2]
##   no  2.857759 1.140187
##   yes 3.099138 1.070501
## 
##      years_at_residence
## Y         [,1]     [,2]
##   no  2.818966 1.140359
##   yes 2.857759 1.077728
## 
##      age
## Y         [,1]     [,2]
##   no  36.72845 10.52062
##   yes 33.73276 11.12141
## 
##      other_credit
## Y           bank       none      store
##   no  0.11914894 0.83829787 0.04255319
##   yes 0.19574468 0.73617021 0.06808511
## 
##      housing
## Y          other        own       rent
##   no  0.08510638 0.77021277 0.14468085
##   yes 0.14042553 0.62978723 0.22978723
## 
##      existing_loans_count
## Y         [,1]      [,2]
##   no  1.409483 0.5586748
##   yes 1.366379 0.5499223
## 
##      job
## Y     management    skilled unemployed  unskilled
##   no  0.13983051 0.60593220 0.02966102 0.22457627
##   yes 0.16949153 0.61016949 0.02118644 0.19915254
## 
##      dependents
## Y         [,1]      [,2]
##   no  1.172414 0.3785564
##   yes 1.159483 0.3669173
## 
##      phone
## Y            no       yes
##   no  0.5897436 0.4102564
##   yes 0.6324786 0.3675214

Naive Bayes Model Prediction

Try to predict our test data using model_naive and use type = "class" to obtain class prediction. Store the prediction under pred_naive object.

# your code here
pred_naive <- predict(model_naive, newdata = data_test, type = "class")
pred_naive
##   [1] no  no  no  no  no  no  yes no  no  yes no  yes no  yes no  yes no  yes
##  [19] yes no  yes no  no  yes no  no  no  yes yes no  no  no  yes no  yes no 
##  [37] yes yes no  no  yes yes no  no  yes no  yes yes no  no  no  no  no  yes
##  [55] no  yes no  no  no  yes no  no  no  no  no  yes yes no  yes yes no  yes
##  [73] yes no  no  no  no  yes no  no  yes no  no  no  no  yes no  no  no  yes
##  [91] no  no  yes no  yes yes yes yes no  no  yes no  no  yes no  yes no  yes
## [109] no  yes yes no  no  yes yes no  no  no  no  no  no  yes no  no  no  no 
## [127] yes no  yes yes yes no  yes no  no  no  no  yes no  no  no  no  yes no 
## [145] no  yes no  yes no  no  no  no  no  yes yes no  no  no  yes no  yes no 
## [163] yes yes no  no  yes no  no  no  no  yes yes yes no  no  no  yes yes yes
## [181] yes yes yes no  yes no  no  yes yes yes no  no  no  yes no  no  no  yes
## [199] yes yes
## Levels: no yes

Naive Bayes Model Evaluation

The last part of model building would be the model evaluation. You can check the model performance for the Naive Bayes model using confusionMatrix() and compare the predicted class (pred_naive) with the actual label in data_test. Make sure that you’re using defaulted customer as the positive class (positive = "yes").

# your code here
library(caret)
confusionMatrix(data = pred_naive, reference = data_test$default, positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  96  24
##        yes 36  44
##                                           
##                Accuracy : 0.7             
##                  95% CI : (0.6314, 0.7626)
##     No Information Rate : 0.66            
##     P-Value [Acc > NIR] : 0.1310          
##                                           
##                   Kappa : 0.359           
##                                           
##  Mcnemar's Test P-Value : 0.1556          
##                                           
##             Sensitivity : 0.6471          
##             Specificity : 0.7273          
##          Pos Pred Value : 0.5500          
##          Neg Pred Value : 0.8000          
##              Prevalence : 0.3400          
##          Detection Rate : 0.2200          
##    Detection Prevalence : 0.4000          
##       Balanced Accuracy : 0.6872          
##                                           
##        'Positive' Class : yes             
## 

Decision Tree

The next model we’re trying to build is Decision Tree. Use ctree() function to build the model and store it under the model_dt object. To tune our model, let’s set the parameter mincriterion = 0.90.

library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
set.seed(100)
# your code here
model_dt <- ctree(formula = default~., 
                  data = data_train_down,
                  control = ctree_control(mincriterion = 0.90))
model_dt
## 
## Model formula:
## default ~ checking_balance + months_loan_duration + credit_history + 
##     purpose + amount + savings_balance + employment_duration + 
##     percent_of_income + years_at_residence + age + other_credit + 
##     housing + existing_loans_count + job + dependents + phone
## 
## Fitted party:
## [1] root
## |   [2] checking_balance < 0 DM, 1 - 200 DM
## |   |   [3] months_loan_duration <= 11: no (n = 45, err = 33.3%)
## |   |   [4] months_loan_duration > 11: yes (n = 250, err = 31.2%)
## |   [5] checking_balance > 200 DM, unknown
## |   |   [6] other_credit in bank, store: yes (n = 33, err = 45.5%)
## |   |   [7] other_credit in none
## |   |   |   [8] employment_duration < 1 year, unemployed: no (n = 31, err = 41.9%)
## |   |   |   [9] employment_duration > 7 years, 1 - 4 years, 4 - 7 years
## |   |   |   |   [10] job in management, unskilled: no (n = 32, err = 28.1%)
## |   |   |   |   [11] job in skilled: no (n = 73, err = 6.8%)
## 
## Number of inner nodes:    5
## Number of terminal nodes: 6

  1. In our decision tree model, the goal of setting the mincriterion = 0.90 is …
  • To prune our model, we let the tree that has p-value <= 0.90 to split the node
  • To prune our model, we let the tree that has p-value <= 0.10 to split the node
  • To prune our model, we let the tree that has a maximum of 10% of the data in the terminal nodes

To have a better grasp of our model, please try to plot the model and set type = "simple".

# your code here
plot(model_dt, type = "simple")

  1. Based on the plot, which of the following interpretation is TRUE?
  • a customer who has checking_balance > 200 DM, with credit_history labelled “perfect”, and saving_balance that is “unknown” is expected to default
  • a customer who has checking_balance 1-200 DM, with months_loan_duration < 21 is expected to default
  • a customer who has checking_balance that is “unknown”, with other_credit consist of “store” is expected to default

Decision Tree Model Prediction

Now that we have the model, please predict towards the test data based on model_dt using predict() function and set the parameter type = "response" to obtain class prediction.

# your code here
pred_dt <- predict(model_dt, newdata = data_test,type = "response")
pred_dt
##   1   7  22  23  25  34  40  46  49  55  57  77  78  79  81  90  94  98 106 110 
##  no  no  no  no  no  no  no  no  no yes yes yes  no  no  no yes  no yes yes yes 
## 112 124 126 132 133 145 152 153 155 159 162 166 176 186 187 188 192 202 210 224 
##  no  no yes yes yes  no  no  no yes yes  no  no yes  no  no yes yes yes  no  no 
## 231 237 239 247 248 252 253 256 261 268 270 297 299 305 306 310 312 315 319 321 
##  no  no  no  no yes  no yes yes yes yes  no  no  no yes  no  no yes  no  no yes 
## 326 336 340 342 349 354 356 359 360 361 363 368 369 372 377 385 387 388 391 395 
##  no  no  no yes  no yes yes  no yes yes yes yes yes  no  no  no  no yes  no  no 
## 396 401 406 411 413 418 421 424 430 432 436 443 447 448 451 461 468 476 489 496 
## yes  no yes yes yes yes  no  no yes yes yes yes yes  no  no yes yes yes  no yes 
## 505 516 521 522 524 531 535 541 545 549 550 563 572 584 586 589 595 602 610 615 
## yes  no  no yes  no yes  no yes  no yes  no yes  no yes yes yes yes  no yes  no 
## 622 631 633 643 644 645 651 653 654 658 659 662 679 681 684 694 701 702 711 714 
## yes yes yes  no  no yes yes yes yes yes yes yes yes  no  no  no  no yes yes  no 
## 716 725 728 734 743 745 750 752 755 757 758 765 774 776 783 792 798 803 815 818 
##  no  no yes  no  no yes  no yes  no  no  no  no yes yes yes  no  no yes yes  no 
## 819 831 832 833 837 840 854 857 860 861 862 870 882 886 901 911 913 920 922 926 
## yes  no yes yes  no  no yes  no  no  no  no yes  no yes yes  no yes yes  no yes 
## 927 938 939 943 945 948 950 952 959 969 970 971 972 973 975 977 979 981 989 999 
## yes  no yes yes yes  no  no yes yes  no  no yes  no yes  no  no  no yes yes yes 
## Levels: no yes

Decision Tree Model Evaluation

We can use confusionMatrix() to get our model performance. Make sure that you’re using defaulted customer as the positive class (positive = "yes").

# your code here
confusionMatrix(pred_dt, reference = data_test$default, positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  85  15
##        yes 47  53
##                                           
##                Accuracy : 0.69            
##                  95% CI : (0.6209, 0.7533)
##     No Information Rate : 0.66            
##     P-Value [Acc > NIR] : 0.2066          
##                                           
##                   Kappa : 0.38            
##                                           
##  Mcnemar's Test P-Value : 8.251e-05       
##                                           
##             Sensitivity : 0.7794          
##             Specificity : 0.6439          
##          Pos Pred Value : 0.5300          
##          Neg Pred Value : 0.8500          
##              Prevalence : 0.3400          
##          Detection Rate : 0.2650          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.7117          
##                                           
##        'Positive' Class : yes             
## 

Random Forest

The last model that we want to build is Random Forest. The following are among the advantages of the random forest model:

  • Reduce bias in a model as it aggregates multiple decision trees
  • Automatic feature selection
  • It generates an unbiased estimate of the out-of-box error

Now, let’s explore the random forest model we have prepared in model_rf.RDS. The model_rf.RDS was built with the following hyperparameter:

  • set.seed(100) # the seed number
  • number = 5 # the number of k-fold cross-validation
  • repeats = 3 # the number of the iteration

In your environment, please load the random forest model (model_rf.RDS) and save it under the model_rf object using the readRDS() function.

# your code here
set.seed(100)
model_rf <- readRDS('model_rf.RDS')
model_rf
## Random Forest 
## 
## 476 samples
##  16 predictor
##   2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 381, 382, 380, 381, 380, 381, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.6812466  0.3623815
##   18    0.6708007  0.3414363
##   35    0.6694115  0.3387421
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Now check the summary of the final model we built using model_rf$finalModel.

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
# your code here
model_rf$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 33.61%
## Confusion matrix:
##      no yes class.error
## no  158  80   0.3361345
## yes  80 158   0.3361345

In practice, the random forest already have an out-of-bag estimates (OOB) that represent its accuracy on out-of-bag data (the data that is not sampled/used for building random forest).


  1. Based on the model_rf$finalModel summary above, how can we interpret the out-of-bag error rate from our model?
  • We have 33.61% error of our unseen data
  • We have 33.61% error of our train data
  • We have 33.61% error of our loan data
  • We have 33.61% error of our in-sample data ___

We could also use Variable Importance, to get a list of the most important variables used in our random forest. Many would argue that random forest, being a black box model, can offer no true information beyond its job in accuracy; actually paying special attention to attributes like variable importance for example often do help us gain valuable information about our data.

Please take your time to check which variable has a high influence to the prediction. You can use varImp() function and pass it to the plot() function to get the visualization.

# your code here
plot(varImp(model_rf))


  1. From the plot you have created, which variable has the most influence to the prediction?
  • checking_balance
  • months_loan_duration
  • amount
  • purpose ___

Random Forest Model Prediction

After building the model, we can now predict the test data based on model_rf. You can use predict() function and set the parameter type = "raw" to obtain class prediction.

# your code here
pred_rf <- predict(model_rf,newdata=data_test, type = "raw")
pred_rf
##   [1] no  no  no  no  no  no  yes no  no  yes yes yes no  no  no  yes no  yes
##  [19] yes no  no  no  no  yes no  no  no  yes yes yes no  no  no  no  yes no 
##  [37] yes yes no  no  yes yes no  no  yes no  yes no  no  no  no  no  no  yes
##  [55] no  no  no  no  no  yes no  yes no  no  no  yes yes no  yes yes no  yes
##  [73] yes no  no  no  no  no  no  no  no  no  yes no  no  no  no  no  yes yes
##  [91] yes no  yes no  no  yes no  yes no  yes yes no  yes yes no  yes no  yes
## [109] no  yes no  no  no  yes yes yes no  no  no  yes yes yes yes yes no  no 
## [127] yes yes yes no  no  yes no  no  no  no  no  yes no  no  no  yes yes no 
## [145] no  no  no  yes yes no  yes no  no  yes yes no  no  yes yes no  no  no 
## [163] yes yes no  no  yes no  no  no  no  yes no  yes yes no  yes yes yes yes
## [181] yes no  yes no  yes no  no  yes yes no  no  yes no  yes no  no  no  yes
## [199] yes yes
## Levels: no yes

Random Forest Model Evaluation

Next, let us evaluate the random forest model we built using confusionMatrix(). How should you evaluate the model performance?

# your code here
confusionMatrix(pred_rf, reference = data_test$default, positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  108   9
##        yes  24  59
##                                           
##                Accuracy : 0.835           
##                  95% CI : (0.7762, 0.8836)
##     No Information Rate : 0.66            
##     P-Value [Acc > NIR] : 2.436e-08       
##                                           
##                   Kappa : 0.651           
##                                           
##  Mcnemar's Test P-Value : 0.01481         
##                                           
##             Sensitivity : 0.8676          
##             Specificity : 0.8182          
##          Pos Pred Value : 0.7108          
##          Neg Pred Value : 0.9231          
##              Prevalence : 0.3400          
##          Detection Rate : 0.2950          
##    Detection Prevalence : 0.4150          
##       Balanced Accuracy : 0.8429          
##                                           
##        'Positive' Class : yes             
## 

Another way of evaluating model performance is through its ROC and AUC value. To calculate it, we need the probability of a positive class for each observation. Let’s try focusing on the ROC and AUC value from our random forest prediction. First, predict the test data using model_rf but now using the parameter type = "prob". The prediction will results in the probability values for each class. You can store the prediction in prob_test object.

# your code here
prob_test <- predict(model_rf,newdata=data_test, type = "prob")

Now, use the prediction() function from the ROCR package to compare the probability of positive class in prob_test[,"yes"] with the actual data data_test$default and store it as pred_roc object.

library(ROCR)
# your code here
pred_roc <- prediction(prob_test[,"yes"],data_test$default)

Next, please use the performance() function from the ROCR package, define the axes, and assign it to a perf object. To use the performance() function, please define the arguments as below: - prediction.obj = pred_roc - measure = "tpr" - x.measure = "fpr"

# your code here
perf <- performance(pred_roc, "tpr", "fpr")
perf
## A performance instance
##   'False positive rate' vs. 'True positive rate' (alpha: 'Cutoff')
##   with 150 data points

After you created a perf object, plot the performance by passing it in the plot() function.

# your code here
plot(perf)

Try to evaluate the ROC Curve; see if there is any undesirable results from our model. Next, take a look at the AUC value using performance() function by setting the arguments prediction.obj = pred_roc and measure = "auc" then save it under auc object.

# your code here
auc <- auc <-performance(pred_roc, measure = "auc")
print(auc@y.values)
## [[1]]
## [1] 0.9263592

  1. From the result above, how do you interpret the AUC value?
  • 90.51% means that the model performance is good because the closer to 1 the better
  • 90.51% means that the model performance is good in classifying positive classes
  • 95.11% means that the model performance is good in classifying both positive and negative class
  • 95.11% as Area under ROC Curve represent the accuracy of the model ___

Models Comparison

  1. As a data scientist in a financial institution, we are required to generate a rule-based model that can be easily implemented in the existing system. What is the best model for us to pick?
  • Naive Bayes because all the conditional probabilities are well calculated
  • Decision Tree because the model can be easily translated into a set of rules
  • Random Forest because it is possible to traceback the rule using variable importance information
  1. Between all the models we have made, which model has better performance in terms of identifying all high-risk customers?
  • Naive Bayes
  • Decision Tree
  • Random Forest

Last but not least, The goal of a good machine learning model is to generalize well from the training data to any data from the problem domain. This allows us to make predictions in the future on the data the model has never seen. There is a terminology used in machine learning when we talk about how well a machine learning model learns and generalizes to new data, namely overfitting and underfitting.

To validate whether our model is fit enough, we can predict the train and test data and then evaluate model performance in both data. You can check whether the performance is well balanced based on the threshold you have set.


  1. Based on your knowledge about the characteristic of a machine learning model, which statement below is FALSE?
  • Overfitting is a condition where a model performs well on the training data but performs very poorly in test data.
  • Underfitting is a condition where a model performs poor in the training data but performs well on the test data.
  • Machine Learning model that fit just right may have a slightly lower performance in its test data than in its training data. ___