Practical 2

Exercice 1 (5.6 of the book)

a. Estimated profit WITHOUT any predictive model

If the firm conducts another study but with 1000 leads instead of 500 (note that it is IMPORTANT to know that the new 1000 leads are SIMILAR to the 500 that came before), the estimated profit (without using a predictive model) would simply remain the same : $2500 (without sales costs), because the only thing we are doing is increasing the sample size, which won’t change much average results.

Here we do not speak of “economic” profit (benefits MINUS costs), because in the prompt the book refers to the profit by EXCLUDING costs, so I considered (to have consistency with the prompt) that the “estimated profit” that this question asks to compute is ALSO exclusing costs (hence, I do not subtract the costs to the average estimated profit).

If I would have considered the economic profit, then the estimated profit without any predictive model would be $2500 - $2500 = $0.

For the rest of the exercice, I will not be considering the usual economic profit, but the one excluding costs (as it is given in the prompt).

b. How many deciles for the profit to roughly double the cost?

Thanks to the decile chart, the firm can identify which are the top leads, in order to target those instead of choosing leads at random. From the decile chart, we can see that, for the first decile, we are doing twice as good with the predictive model than with no model at all (random assignment), because the y-axis value equals to 2.0.

If with no model we get an average estimated profit of $2500, then by concentrating on the first decile (top 10% leads) the firm can double this number and get an average profit of $5000.

c. Cutoff of $2500

A cutoff of $2500 means that the predictive model is doing exactly as well as no model would, because $2500 is the estimated global profit before any model as implemented. So, this means that the predictive model is useless.

This corresponds to the 50th percentile, because its y-axis value equals 1.0.

In this context, a 1.0 means that there is no difference between the model we are considering and a random assigment of leads.

d. Why a two-stage process?

The two stages are :

First, create our model.
Second, choose the best records in line with our variable of interest (e.g. chosing the best 10% most likely to accept the loan). We do this by contructing lift charts or decile charts.

If we had wanted to directly build a model for the 1000 new leads, then we would not be able to recognize the BEST observations based on our variable of interest, because we would be treating all the observations equally.

Thus, applying this two-step procedure is really necessary to be sure that we concentrate our efforts on the best records, depending on our interests.

Exercice 2 (7.2 of the book)

Preliminaries (importing data & visualization)

rm(list = ls()) # clean environment
cat("\014") # clean console

setwd("~/UNIGE/MASTERS/MaBAn (2020-2022)/Part 1/CVTDM/HW2")
library(class)
library(caret)
library(dummies)
library(tibble)
library(ggplot2)
library(ggpmisc)
library(reshape)
library(e1071)
library(emo)

# Importing the data
bank <- read.csv("UniversalBank.csv", header = TRUE, sep = ",")

t(t(names(bank)))

##       [,1]                
##  [1,] "ID"                
##  [2,] "Age"               
##  [3,] "Experience"        
##  [4,] "Income"            
##  [5,] "ZIP.Code"          
##  [6,] "Family"            
##  [7,] "CCAvg"             
##  [8,] "Education"         
##  [9,] "Mortgage"          
## [10,] "Personal.Loan"     
## [11,] "Securities.Account"
## [12,] "CD.Account"        
## [13,] "Online"            
## [14,] "CreditCard"

str(bank)

## 'data.frame':    5000 obs. of  14 variables:
##  $ ID                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age               : int  25 45 39 35 35 37 53 50 35 34 ...
##  $ Experience        : int  1 19 15 9 8 13 27 24 10 9 ...
##  $ Income            : int  49 34 11 100 45 29 72 22 81 180 ...
##  $ ZIP.Code          : int  91107 90089 94720 94112 91330 92121 91711 93943 90089 93023 ...
##  $ Family            : int  4 3 1 1 4 4 2 1 3 1 ...
##  $ CCAvg             : num  1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
##  $ Education         : int  1 1 1 2 2 2 2 3 2 3 ...
##  $ Mortgage          : int  0 0 0 0 0 155 0 0 104 0 ...
##  $ Personal.Loan     : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ Securities.Account: int  1 1 0 0 0 0 0 0 0 0 ...
##  $ CD.Account        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Online            : int  0 0 0 0 0 1 1 0 1 0 ...
##  $ CreditCard        : int  0 0 0 0 1 0 0 1 0 0 ...

# We see that many variables are considered integers, so we must convert them into factors
bank$Education = as.factor(bank$Education)
bank$Personal.Loan = as.factor(bank$Personal.Loan)
bank$Securities.Account = as.factor(bank$Securities.Account)
bank$CD.Account = as.factor(bank$CD.Account)
bank$Online = as.factor(bank$Online)
bank$CreditCard = as.factor(bank$CreditCard)

# Now, we create dummies for our education
bank_2 <-
  cbind(bank[1:7], dummy(bank$Education, sep = "_"), bank[9:14])

names(bank_2)[8:10] <- c("Education_1", "Education_2", "Education_3")

attach(bank_2)

Data partition (training : 60%, validation : 40%)

set.seed(1)

# Partitioning the data
train.obs <- sample(rownames(bank_2), dim(bank_2)[1]*0.6)
train.set <- bank_2[train.obs, c(-1,-5)]  # We take away ID and ZIP.Code

valid.obs <- setdiff(rownames(bank_2), train.obs)
valid.set <- bank_2[valid.obs, c(-1,-5)]

a. Performing k-NN (with k = 1) to classify new customer

The first thing to know, is that we want to predict the variable called “Personal.Loan”. And to do this, we are using all the other explanatory variables except “ID” and “ZIP.Code”.

But first of all, we should normalize the data, since we have different scales and big values can dominate small values! We normalize only the numerical data.

# Normalizing the data

normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

train.set.norm <- as.data.frame(lapply(train.set[, c(1:5, 9)], normalize))

valid.set.norm <- as.data.frame(lapply(valid.set[, c(1:5, 9)], normalize))

# Regrouping into final dataset, with replacement of non-normalized variables
train.final <- cbind(train.set.norm, train.set[, c(6:8, 10:14)])

valid.final <- cbind(valid.set.norm, valid.set[, c(6:8, 10:14)])

And now, we perform the k-NN :

# Creating new customer and then normalizing its variables :

# First, only create the numerical variables, in order to normalize
new_cust <-
  data.frame(
    Age = 40,
    Experience = 10,
    Income = 84,
    Family = 2,
    CCAvg = 2,
    Mortgage = 0
)

norm.values <- preProcess(train.set[, c(1:5, 9)], method = "range")

new_cust_temp <- predict(norm.values, new_cust)


# Add the remaining non-numerical predictors

new_cust_final <- add_column(new_cust_temp, Education_1 = 0, Education_2 = 1, Education_3 = 0, Securities.Account = 0, CD.Account = 0, Online = 1, CreditCard =1, .after = "Mortgage")


# Building the model :
set.seed(1)
k_nn <-
  knn(
    train = train.final[, -10],
    test = new_cust_final,
    cl = train.final[, 10],
    k = 1
  )

k_nn  # The model classifies this customer as NOT accepting the personal loan (value of "0")

## [1] 0
## Levels: 0 1

So the new customer will be classified as NOT accepting the personal loan, because the outcome is a “0”.

b. What is the best “k”?

set.seed(1)

# We do k-NN this time with validation data instead of the single customer

k_nn_2 <- knn(
    train = train.final[, -10],
    test = valid.final[, -10],
    cl = train.final[, 10],
    k = 1,
    prob = TRUE
    
  )


# The following is the resulting confusion matrix

confusionMatrix(data = k_nn_2, as.factor(valid.final[, 10]), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1769   52
##          1   26  153
##                                           
##                Accuracy : 0.961           
##                  95% CI : (0.9516, 0.9691)
##     No Information Rate : 0.8975          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7754          
##                                           
##  Mcnemar's Test P-Value : 0.004645        
##                                           
##             Sensitivity : 0.7463          
##             Specificity : 0.9855          
##          Pos Pred Value : 0.8547          
##          Neg Pred Value : 0.9714          
##              Prevalence : 0.1025          
##          Detection Rate : 0.0765          
##    Detection Prevalence : 0.0895          
##       Balanced Accuracy : 0.8659          
##                                           
##        'Positive' Class : 1               
##

From the confusion matrix, we see that concerning accuracy we did pretty well (96.1% of records were correctly classified). However, if we have a look at sensitivity, which captures the number of “1s” correctly classified, we see that we did worse (only 74.63%). So we may take into account sensitivity (instead of accuracy) when finding the best “k”, since it concerns the classification of our class of interest, the fact of accepting a loan or not!

# Doing Cross-validation to find best "k"

ctrl = trainControl(method = "cv",
                    number = 10,
                    summaryFunction = twoClassSummary)

set.seed(1)

knn_cross = train(
  as.factor(Personal.Loan) ~.,
  data = train.final,
  method = "knn",
  trControl = ctrl,
  metric = "Spec",
  tuneGrid = expand.grid(k = seq(50)),
)

# Plotting the results and finding the maximum sensitivity
ggplot(data = knn_cross, aes(x = x, y = y)) + geom_line() + stat_peaks(col = "red", span = NULL)

# Or getting the maximum with this line of code :
which.max(knn_cross$results$Spec)

## [1] 1

We clearly see from this graph (the maximum sensitivity is located at the RED point) that the best “k” in terms of maximizing sensitivity is :

k = 1.

IMPORTANT NOTE :

in the `train()` function, “Spec” means “Sens” and viceversa! So in the graph, the y-axis represents the SENSITIVITY, and not the specificity…

c. Confusion matrix for validation using best “k”

In fact, this has already been done during previous letter b :

confusionMatrix(data = k_nn_2, as.factor(valid.final[, 10]), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1769   52
##          1   26  153
##                                           
##                Accuracy : 0.961           
##                  95% CI : (0.9516, 0.9691)
##     No Information Rate : 0.8975          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7754          
##                                           
##  Mcnemar's Test P-Value : 0.004645        
##                                           
##             Sensitivity : 0.7463          
##             Specificity : 0.9855          
##          Pos Pred Value : 0.8547          
##          Neg Pred Value : 0.9714          
##              Prevalence : 0.1025          
##          Detection Rate : 0.0765          
##    Detection Prevalence : 0.0895          
##       Balanced Accuracy : 0.8659          
##                                           
##        'Positive' Class : 1               
##

d. Classify a new customer using best “k”

Since the best “k” is k = 1, this question has already been answered. Because this customer is exactly the same as the one in the previous letter b.!

e. Repartitioning the data TRIFOLD (50 - 30 - 20) and comparing

set.seed(1)

# Partitioning the data (50% train, 30% validation, 20% test)

train_2.obs <- sample(rownames(bank_2), dim(bank_2)[1]*0.5)
train_2.set <- bank_2[train_2.obs, c(-1,-5)]  # We take away ID and ZIP.Code

valid_2.obs <- sample(setdiff(rownames(bank_2), train_2.obs), dim(bank_2)[1]*0.3)
valid_2.set <- bank_2[valid_2.obs, c(-1,-5)]

test_2.obs <- setdiff(rownames(bank_2), union(train_2.obs, valid_2.obs))
test_2.set <- bank_2[test_2.obs, c(-1, -5)]

Now, we normalize the data

# Normalizing the data

train_2.set.norm <- as.data.frame(lapply(train_2.set[, c(1:5, 9)], normalize))

valid_2.set.norm <- as.data.frame(lapply(valid_2.set[, c(1:5, 9)], normalize))

test_2.set.norm <- as.data.frame(lapply(test_2.set[, c(1:5, 9)], normalize))

# Regrouping into final dataset, with replacement of non-normalized variables

train_2.final <- cbind(train_2.set.norm, train_2.set[, c(6:8, 10:14)])

valid_2.final <- cbind(valid_2.set.norm, valid_2.set[, c(6:8, 10:14)])

test_2.final <- cbind(test_2.set.norm, test_2.set[, c(6:8, 10:14)])

And now, we perform the k-NN, first with validation set, and second with test set :

set.seed(1)

# We do k-NN this time with validation data instead of the single customer

k_nn_first <- knn(
    train = train_2.final[, -10],
    test = valid_2.final[, -10],
    cl = train_2.final[, 10],
    k = 1,
    prob = TRUE
    
  )


# The following is the resulting confusion matrix

confusionMatrix(data = k_nn_first, as.factor(valid_2.final[, 10]), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1335   33
##          1   29  103
##                                           
##                Accuracy : 0.9587          
##                  95% CI : (0.9473, 0.9682)
##     No Information Rate : 0.9093          
##     P-Value [Acc > NIR] : 1.294e-13       
##                                           
##                   Kappa : 0.746           
##                                           
##  Mcnemar's Test P-Value : 0.7032          
##                                           
##             Sensitivity : 0.75735         
##             Specificity : 0.97874         
##          Pos Pred Value : 0.78030         
##          Neg Pred Value : 0.97588         
##              Prevalence : 0.09067         
##          Detection Rate : 0.06867         
##    Detection Prevalence : 0.08800         
##       Balanced Accuracy : 0.86805         
##                                           
##        'Positive' Class : 1               
##

# And now, we do k-NN with train and test sets :

k_nn_second <- knn(
    train = train_2.final[, -10],
    test = test_2.final[, -10],
    cl = train_2.final[, 10],
    k = 1,
    prob = TRUE
    
  )


# The resulting confusion matrix is :

confusionMatrix(data = k_nn_second, as.factor(test_2.final[, 10]), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 870  23
##          1  18  89
##                                           
##                Accuracy : 0.959           
##                  95% CI : (0.9448, 0.9704)
##     No Information Rate : 0.888           
##     P-Value [Acc > NIR] : 6.875e-16       
##                                           
##                   Kappa : 0.7898          
##                                           
##  Mcnemar's Test P-Value : 0.5322          
##                                           
##             Sensitivity : 0.7946          
##             Specificity : 0.9797          
##          Pos Pred Value : 0.8318          
##          Neg Pred Value : 0.9742          
##              Prevalence : 0.1120          
##          Detection Rate : 0.0890          
##    Detection Prevalence : 0.1070          
##       Balanced Accuracy : 0.8872          
##                                           
##        'Positive' Class : 1               
##

We can see that the main difference between the first (with validation) and the second (with test) concerns the SENSITIVITY. Clearly, with the test set the sensitivity is higher than with the validation, which means that the model is really doing well in correctly classifying the positive class (here, accepting the loan). And more generally, we see that all the metrics (accuracy, sensitivity and specificity) are higher when using the test set, so we may think that our model is a good one.

However, as always in stochastics, it is not a good practice to take a number and not to question it, meaning that every number has some uncertainty in it. For this reason, it is always necessary to create confidence intervals to have a better idea of the variation range for a given number.

In that sense, probably the higher accuracy with the test set would have been smaller if we simply took another sample from the population!

Exercice 3 (8.1 of the book)

Preliminaries (partitioning the data and selecting only 2 predictors)

# Selecting only CreditCard and Online

new_bank <- bank[, c(10, 13, 14)]

set.seed(1)

# Partitioning the data

train_3.obs <- sample(rownames(new_bank), dim(new_bank)[1]*0.6)
train_3.set <- new_bank[train_3.obs, ]

attach(train_3.set)

valid_3.obs <- setdiff(rownames(new_bank), train_3.obs)
valid_3.set <- new_bank[valid_3.obs, ]

attach(valid_3.set)

a. Creating a pivot table

First, we need to create a new variable that computes the sum of all the columns by rows. And only then can the pivot table be created.

# Creating a new "sum" variable :
train_3.set$Sum <- rowSums(sapply(train_3.set, as.numeric))

# Putting all the columns together :
mlt <- melt(train_3.set, id=c("Online", "CreditCard", "Personal.Loan"), measure=c("Sum"))

# Finally, creating the pivot table :
cast(mlt, CreditCard + Personal.Loan ~ Online, subset=variable=="Sum", margins=c("grand_row", "grand_col"))

## Aggregation requires fun.aggregate: length used as default

##   CreditCard Personal.Loan    0    1 (all)
## 1          0             0  805 1119  1924
## 2          0             1   79  119   198
## 3          1             0  332  469   801
## 4          1             1   30   47    77
## 5      (all)         (all) 1246 1754  3000

It is true that a simple `table()` is easier to do, as shown in the following few lines of code :

bank_ordered <- data.frame("CreditCard" =  train_3.set$CreditCard, "Online" = train_3.set$Online, "Personal.Loan" = train_3.set$Personal.Loan)

table(bank_ordered)

## , , Personal.Loan = 0
## 
##           Online
## CreditCard    0    1
##          0  805 1119
##          1  332  469
## 
## , , Personal.Loan = 1
## 
##           Online
## CreditCard    0    1
##          0   79  119
##          1   30   47

b. Calculating P(Personal.Loan = 1 | CC = 1, Online = 1) using pivot table

The faster and more intuitive way to do this, is simply the following :

First, find the total population consisting of ONLY having both CC = 1 and Online = 1. In this case, it is 469 + 47 = 516.
Second, find how many cases are favorable. In this case, it is 47, cause we are looking for Personal.Loan = 1.
Finally, since all probabilities are based on (# Favorable cases / Total # of cases), just compute 47/516. The result is thus 0.0916 (approximately).

Now, there is a more complicated way. Given that CreditCard = 1 and Online = 1, the probability of accepting the loan offer is :

\[\frac{P(CC = 1, Online = 1|PL = 1)*P(PL=1)}{P(CC = 1, Online = 1|PL = 0)*P(PL=0)+ P(CC = 1, Online = 1|PL = 1)*P(PL=1)}\]

where :

\[P(PL=1) =\frac{198 + 77}{3000}\] \[P(PL=0) =\frac{1924 + 801}{3000}\] \[P(CC = 1, Online = 1|PL = 1) = \frac{47}{275}\] \[P(CC = 1, Online = 1|PL = 0) = \frac{469}{2725}\]

The result is, approximately :

\[P(PL = 1 | CC = 1, Online = 1) = 0.0916\]

c. Creating TWO pivot tables

# Loan as function of Online

table(Personal.Loan, Online)

##              Online
## Personal.Loan    0    1
##             0  690 1105
##             1   80  125

# Loan as function of CreditCard

table(Personal.Loan, CreditCard)

##              CreditCard
## Personal.Loan    0    1
##             0 1269  526
##             1  139   66

d. Computing several probabilities (results are approximated)

\[ P(CC = 1|PL=1) = 0.322\]

\[ P(Online=1|PL=1) = 0.61\]

\[ P(PL=1) = 0.0916\]

\[ P(CC = 1|PL=0) = 0.294\] \[ P(Online=1|PL=0) = 0.583\] \[ P(PL=0) = 0.908\]

e. Compute P(Personal.Loan = 1 | CC = 1, Online = 1) using the above probabilities

The formula that makes use of the previous probabilities is :

\[\frac{P(CC = 1|PL = 1)*P(PL=1)*P(Online=1|PL=1)}{P(CC = 1|PL = 1)*P(PL=1)*P(Online=1|PL=1)+ P(CC = 1|PL = 0)*P(PL=0)*P(Online=1|PL=0)}\]

By replacing with the corresponding (approximated) numbers, we get :

\[P(PL= 1 | CC = 1, Online = 1) = \frac{0.017992}{0.017992+0.15563301} = 0.1036\]

f. Compare result in e. with result in b.

In b. we got 0.0916, and in e. we got 0.1036.

Due to approximations in the decimals, it is normal that both numbers are not completely equal. However, they are almost the same.

Concerning accuracy, the result in letter b. is more accurate, because in letter b. we actually performed an Optimal Bayes classifier, since we knew the posterior probabilities thanks to the table. It is “optimal” because we do not need to estimate anything, we already have everything we need. So, it is more accurate. This is possible because we ONLY have two predictor variables. If we had many predictors, just imagine the resulting pivot table… IMPOSSIBLE! We would need to create SEPARATE tables.

If we had many predictors, instead of doing the pivot table in letter b. (which, again, is impossible), we would only be able to do separated pivot tables, one for each predictor and the outcome variable. Now, THIS is actually what we did for letter e. However, we had to FIRST assume independence of the predictors, which changes the formula to the one of letter e. (where, for instance, P(CC = 1, Online = 1|PL = 1) is equal to P(CC = 1|PL = 1)*P(Online=1|PL=1), thanks to independence!). This corresponds to the “Naïve Bayes” classifier.

g. Run Naïve Bayes on the data with R

# Running the Naïve Bayes :

NB <- naiveBayes(Personal.Loan ~., data = train_3.set)

NB

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##          0          1 
## 0.90833333 0.09166667 
## 
## Conditional probabilities:
##    Online
## Y           0         1
##   0 0.4172477 0.5827523
##   1 0.3963636 0.6036364
## 
##    CreditCard
## Y          0        1
##   0 0.706055 0.293945
##   1 0.720000 0.280000
## 
##    Sum
## Y       [,1]      [,2]
##   0 3.876697 0.6726704
##   1 4.883636 0.6680300

# Creating a new record :

new_obs <- data.frame(Online = factor(1), CreditCard = factor(1), Sum = 2)

# Predicting this new record using our Naïve Bayes classifier :

predict(NB, newdata = new_obs)

## [1] 0
## Levels: 0 1

It is very logical for the prediction to be Personal.Loan = 0, because since we saw already many times, the probability of accepting the loan (Personal.Loan = 1) given that Online AND CreditCard both = 1 is quite low (around 9%).

Also, we can see that on the “NB” output, all the probabilities correspond to the ones we found previously in letter d. So this means that we did a good job! 😄

Machine Learning with R - Homework 2

Ángel Tomás-Ripoll

3/10/2020