Practical 4

Exercice 1 (13.1 of the book)

Preliminaries (importing data, preprocessing data, & visualization)

rm(list = ls()) # clean environment
cat("\014") # clean console
library(data.table)
library(class)
library(pander)
library(ROCR)
library(tidyverse)
library(caret)
library(rpart)
library(rpart.plot)
library(dummies)
library(modeest)
library(lift)
library(randomForest)
library(adabag)

setwd("~/UNIGE/MASTERS/MaBAn (2020-2022)/Part 1/CVTDM/HW4")

# Importing the data
bank <- read.csv("UniversalBank.csv", header = TRUE, sep = ",")

t(t(names(bank)))
##       [,1]                
##  [1,] "ID"                
##  [2,] "Age"               
##  [3,] "Experience"        
##  [4,] "Income"            
##  [5,] "ZIP.Code"          
##  [6,] "Family"            
##  [7,] "CCAvg"             
##  [8,] "Education"         
##  [9,] "Mortgage"          
## [10,] "Personal.Loan"     
## [11,] "Securities.Account"
## [12,] "CD.Account"        
## [13,] "Online"            
## [14,] "CreditCard"
str(bank)
## 'data.frame':    5000 obs. of  14 variables:
##  $ ID                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age               : int  25 45 39 35 35 37 53 50 35 34 ...
##  $ Experience        : int  1 19 15 9 8 13 27 24 10 9 ...
##  $ Income            : int  49 34 11 100 45 29 72 22 81 180 ...
##  $ ZIP.Code          : int  91107 90089 94720 94112 91330 92121 91711 93943 90089 93023 ...
##  $ Family            : int  4 3 1 1 4 4 2 1 3 1 ...
##  $ CCAvg             : num  1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
##  $ Education         : int  1 1 1 2 2 2 2 3 2 3 ...
##  $ Mortgage          : int  0 0 0 0 0 155 0 0 104 0 ...
##  $ Personal.Loan     : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ Securities.Account: int  1 1 0 0 0 0 0 0 0 0 ...
##  $ CD.Account        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Online            : int  0 0 0 0 0 1 1 0 1 0 ...
##  $ CreditCard        : int  0 0 0 0 1 0 0 1 0 0 ...
# We see that many variables are considered integers, so we must convert them into factors
bank$Education = as.factor(bank$Education)
bank$Personal.Loan = as.factor(bank$Personal.Loan)
bank$Securities.Account = as.factor(bank$Securities.Account)
bank$CD.Account = as.factor(bank$CD.Account)
bank$Online = as.factor(bank$Online)
bank$CreditCard = as.factor(bank$CreditCard)

# Now, we create dummies for our education
bank_2 <-
  cbind(bank[1:7], dummy(bank$Education, sep = "_"), bank[9:14])

names(bank_2)[8:10] <- c("Education_1", "Education_2", "Education_3")

attach(bank_2)


Now we must partition the data (training : 60%, validation : 40%)
set.seed(1)

# Partitioning the data
train.obs <- sample(rownames(bank_2), dim(bank_2)[1]*0.6)
train.set <- bank_2[train.obs, -c(1,5)]  # We take away ID and ZIP.Code

valid.obs <- setdiff(rownames(bank_2), train.obs)
valid.set <- bank_2[valid.obs, -c(1,5)]


Also, do not forget to normalize the data! We will need this when performing k-NN!
# Normalizing the data

normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

train.set.norm <- as.data.frame(lapply(train.set[, c(1:5, 9)], normalize))

valid.set.norm <- as.data.frame(lapply(valid.set[, c(1:5, 9)], normalize))

# Regrouping into final dataset, with replacement of non-normalized variables
train.norm <- cbind(train.set.norm, train.set[, c(6:8, 10:14)])

valid.norm <- cbind(valid.set.norm, valid.set[, c(6:8, 10:14)])


And now, we are ready to begin!



a. Fiting each of the three models

Logistic regression

# Running the logistic model :
set.seed(1)

logistic <-
  glm(Personal.Loan ~ ., data = train.set, family = binomial)

summary(logistic)
## 
## Call:
## glm(formula = Personal.Loan ~ ., family = binomial, data = train.set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0741  -0.1898  -0.0701  -0.0232   4.1048  
## 
## Coefficients: (1 not defined because of singularities)
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -8.1846978  2.3185999  -3.530 0.000416 ***
## Age                 -0.0324946  0.0858273  -0.379 0.704982    
## Experience           0.0390812  0.0851023   0.459 0.646072    
## Income               0.0591487  0.0038527  15.353  < 2e-16 ***
## Family               0.5787243  0.0993320   5.826 5.67e-09 ***
## CCAvg                0.1548502  0.0571994   2.707 0.006785 ** 
## Education_1         -3.9324934  0.3449324 -11.401  < 2e-16 ***
## Education_2         -0.1110033  0.2453163  -0.452 0.650916    
## Education_3                 NA         NA      NA       NA    
## Mortgage             0.0011437  0.0007617   1.501 0.133241    
## Securities.Account1 -0.7739587  0.3957411  -1.956 0.050498 .  
## CD.Account1          3.7923306  0.4801242   7.899 2.82e-15 ***
## Online1             -0.6626600  0.2103154  -3.151 0.001628 ** 
## CreditCard1         -1.1142969  0.2877270  -3.873 0.000108 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1838.26  on 2999  degrees of freedom
## Residual deviance:  704.97  on 2987  degrees of freedom
## AIC: 730.97
## 
## Number of Fisher Scoring iterations: 8
# Making predictions for the validation set :

pred_prob <- predict(logistic, valid.set, type = "response")


Now we have the predicted PROBABILITIES. However, we want to have the predicted class. But for that, we must choose a cutoff value to decide whether a given probability indicates class “0” or class “1”.
Usually, one goes for the “default” 0.5 cutoff. However, we can do something smarter! 😎


We can perform various classifications with many different cutoffs, and then compare the specificity versus the sensitivity.
We can use the function performance() inside the ROCR package.
# Creating a "prediction" object (compulsory) :

pred = prediction(pred_prob, labels = valid.set$Personal.Loan)


# Creating a "performance" object and ploting :

perfacc <- performance(prediction.obj = pred, measure="acc", x.measure="cutoff")

plot(perfacc)

# Finding the highest accuracy and the corresponding cutoff value :

which.max(perfacc@y.values[[1]])  # It gives "129", so :
## [1] 129
perfacc@y.values[[1]][129]        # It gives "0.9565", so :
## [1] 0.9565
perfacc@x.values[[1]][which(perfacc@y.values[[1]] == 0.9565)]
##        79      3146      3989      4017 
## 0.6592312 0.5228421 0.5192039 0.5115680
We can see that there are 4 cutoff values which yield the same highest accuracy (which is 0.9565). So, in fact we could choose any of them! Let’s choose the one fairly “in the middle”, so 0.5228421.


Now, let’s apply the cutoff value to classify into “0” or “1”, and then compute the confusion matrix.
# Applying the cutoff value for classification :

pred_class <- ifelse(pred_prob > 0.5228421, 1, 0)
head(pred_class)
##  2  5  6  8 10 12 
##  0  0  0  0  1  0
cm_logistic = confusionMatrix(as.factor(as.integer(pred_class)), as.factor(valid.norm[, 10]), positive = "1")

cm_logistic
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1777   69
##          1   18  136
##                                          
##                Accuracy : 0.9565         
##                  95% CI : (0.9466, 0.965)
##     No Information Rate : 0.8975         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7343         
##                                          
##  Mcnemar's Test P-Value : 8.296e-08      
##                                          
##             Sensitivity : 0.6634         
##             Specificity : 0.9900         
##          Pos Pred Value : 0.8831         
##          Neg Pred Value : 0.9626         
##              Prevalence : 0.1025         
##          Detection Rate : 0.0680         
##    Detection Prevalence : 0.0770         
##       Balanced Accuracy : 0.8267         
##                                          
##        'Positive' Class : 1              
## 


k-NN (k = 3)

set.seed(1)

k_nn <-
  knn3(Personal.Loan ~ ., data = train.norm,
       k = 3)

pred_knn_class = predict(k_nn, valid.norm, type = "class")


# The following is the resulting confusion matrix :

cm_knn = confusionMatrix(data = pred_knn_class, as.factor(valid.norm[, 10]), positive = "1")

cm_knn
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1780   68
##          1   15  137
##                                           
##                Accuracy : 0.9585          
##                  95% CI : (0.9488, 0.9668)
##     No Information Rate : 0.8975          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7453          
##                                           
##  Mcnemar's Test P-Value : 1.145e-08       
##                                           
##             Sensitivity : 0.6683          
##             Specificity : 0.9916          
##          Pos Pred Value : 0.9013          
##          Neg Pred Value : 0.9632          
##              Prevalence : 0.1025          
##          Detection Rate : 0.0685          
##    Detection Prevalence : 0.0760          
##       Balanced Accuracy : 0.8300          
##                                           
##        'Positive' Class : 1               
## 


Classification tree

set.seed(1)

class_tree <-
  rpart(Personal.Loan ~ ., data = train.set, method = "class")


plot_class_tree = prp(
  class_tree,
  type = 1,
  extra = 1,
  under = TRUE,
  split.font = 1,
  varlen = -10
)

# Getting the predictions from the tree :

pred_valid <- predict(class_tree, valid.set, type = "class")


# The following is the resulting confusion matrix :

cm_tree = confusionMatrix(data = pred_valid, as.factor(valid.set[, 10]), positive = "1")

cm_tree
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1787   32
##          1    8  173
##                                           
##                Accuracy : 0.98            
##                  95% CI : (0.9729, 0.9857)
##     No Information Rate : 0.8975          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8854          
##                                           
##  Mcnemar's Test P-Value : 0.0002762       
##                                           
##             Sensitivity : 0.8439          
##             Specificity : 0.9955          
##          Pos Pred Value : 0.9558          
##          Neg Pred Value : 0.9824          
##              Prevalence : 0.1025          
##          Detection Rate : 0.0865          
##    Detection Prevalence : 0.0905          
##       Balanced Accuracy : 0.9197          
##                                           
##        'Positive' Class : 1               
## 


b. Creating the data frame

my_df <-
  data.frame(
    actual = valid.set[, 10],
    predicted_logistic = as.factor(pred_class),
    predicted_knn = pred_knn_class,
    predicted_tree = pred_valid
  )


# Here are the 10 first rows :

pander(head(my_df, 10))
  actual predicted_logistic predicted_knn predicted_tree
2 0 0 0 0
5 0 0 0 0
6 0 0 0 0
8 0 0 0 0
10 1 1 1 1
12 0 0 0 0
13 0 0 0 0
19 1 1 1 1
20 0 0 0 0
21 0 0 0 0


c. Performing TWO ensemble methods

Now, we will perform two ensemble methods. To do this, we will create two more columns in our data frame, which will correspond, respectively, to :
  • Majority vote of predictions
  • Average of the predicted probabilities
# Creating the "majority vote" column :

my_df$majority_vote <- apply(my_df[, c(2:4)], 1, mfv)



# Accessing the probabilities for each of the methods :

# k-NN
pred_knn_prob_vect = predict(k_nn, valid.norm, type = "prob")

pred_knn_prob_matrix = matrix(
  pred_knn_prob_vect,
  nrow = dim(valid.set)[1],
  ncol = 2,
  dimnames = list(rownames(valid.set), c("0", "1"))
)

head(pred_knn_prob_matrix)
##            0         1
## 2  1.0000000 0.0000000
## 5  1.0000000 0.0000000
## 6  1.0000000 0.0000000
## 8  1.0000000 0.0000000
## 10 0.3333333 0.6666667
## 12 1.0000000 0.0000000
# Classification tree
pred_tree_prob_vect = predict(class_tree, valid.set)

pred_tree_prob_matrix = matrix(
  pred_tree_prob_vect,
  nrow = dim(valid.set)[1],
  ncol = 2,
  dimnames = list(rownames(valid.set), c("0", "1"))
)

head(pred_tree_prob_matrix)
##             0           1
## 2  0.99734865 0.002651348
## 5  0.99734865 0.002651348
## 6  0.99734865 0.002651348
## 8  0.99734865 0.002651348
## 10 0.01595745 0.984042553
## 12 0.99734865 0.002651348
# Logistic regression
pred_logistic_prob = matrix(
  data = NA,
  nrow = dim(valid.set)[1],
  ncol = 2,
  dimnames = list(rownames(valid.set), c("0", "1"))
)

pred_logistic_prob[, 2] <-
  predict(logistic, valid.set, type = "response")
pred_logistic_prob[, 1] <-
  1 - predict(logistic, valid.set, type = "response")

head(pred_logistic_prob)
##             0            1
## 2  0.99993437 6.563347e-05
## 5  0.99395955 6.040445e-03
## 6  0.99542784 4.572165e-03
## 8  0.99968396 3.160403e-04
## 10 0.02494355 9.750565e-01
## 12 0.99499324 5.006758e-03
# Creating the "probability average" column :

# For the "0" category :
for (i in 1:dim(valid.set)[1]) {
  my_df$prob_average_0[i] <-
    (pred_knn_prob_matrix[i, 1] +  pred_logistic_prob[i, 1] + pred_tree_prob_matrix[i, 1]) /
    3
  
}

# For the "1" category :
for (i in 1:dim(valid.set)[1]) {
  my_df$prob_average_1[i] <-
    (pred_knn_prob_matrix[i, 2] +  pred_logistic_prob[i, 2] + pred_tree_prob_matrix[i, 2]) /
    3
  
}


head(my_df$prob_average_0)
## [1] 0.9990943 0.9971027 0.9975922 0.9990109 0.1247448 0.9974473
head(my_df$prob_average_1)
## [1] 0.0009056604 0.0028972644 0.0024078375 0.0009891294 0.8752552236
## [6] 0.0025527021


Now, we are asked to produce a confusion matrix for each of these two ensemble methods.
However, a very important step BEFORE proceding to do that, is to convert the averaged probabilities in class membership (“0” or “1”). Otherwise, a confusion table for comparing with the real validation outcomes would be impossible!
For this, let’s use the same cutoff value that we found to be optimal in our logistic regression :
ave_prob.toclass = ifelse(my_df$prob_average_1 > 0.5228421, 1, 0)


Now, let’s do the 2 confusion matrices :
# For ensemble method nº1 (majority vote) :

cm_ens_1 = confusionMatrix(data = table(my_df$majority_vote, valid.set$Personal.Loan), positive = "1")

cm_ens_1
## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0 1793   48
##   1    2  157
##                                           
##                Accuracy : 0.975           
##                  95% CI : (0.9672, 0.9814)
##     No Information Rate : 0.8975          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8491          
##                                           
##  Mcnemar's Test P-Value : 1.966e-10       
##                                           
##             Sensitivity : 0.7659          
##             Specificity : 0.9989          
##          Pos Pred Value : 0.9874          
##          Neg Pred Value : 0.9739          
##              Prevalence : 0.1025          
##          Detection Rate : 0.0785          
##    Detection Prevalence : 0.0795          
##       Balanced Accuracy : 0.8824          
##                                           
##        'Positive' Class : 1               
## 
# For ensemble method nº2 (average probability) :

cm_ens_2 = confusionMatrix(data = table(ave_prob.toclass, valid.set$Personal.Loan), positive = "1")

cm_ens_2
## Confusion Matrix and Statistics
## 
##                 
## ave_prob.toclass    0    1
##                0 1794   46
##                1    1  159
##                                           
##                Accuracy : 0.9765          
##                  95% CI : (0.9689, 0.9827)
##     No Information Rate : 0.8975          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8585          
##                                           
##  Mcnemar's Test P-Value : 1.38e-10        
##                                           
##             Sensitivity : 0.7756          
##             Specificity : 0.9994          
##          Pos Pred Value : 0.9937          
##          Neg Pred Value : 0.9750          
##              Prevalence : 0.1025          
##          Detection Rate : 0.0795          
##    Detection Prevalence : 0.0800          
##       Balanced Accuracy : 0.8875          
##                                           
##        'Positive' Class : 1               
## 


The overall accuracy for the 1rst method (majority vote) is 0.975.
The overall accuracy for the 2nd method (average probability) is 0.9765.


d. Comparing the individual methods with the ensemble methods

The error rate is nothing else than (1 - accuracy).
So, we will collect the accuracies of each method, and compute their respective error rate :
err_compare <-
  matrix(NA,
         nrow = 5,
         ncol = 1,
         dimnames = list(
           c("logistic", "k-NN", "tree", "ensemble nº1", "ensemble nº2"),
           c("Error rate")
         ))

err_compare[1, ] = as.numeric(1 - cm_logistic$overall[1])
err_compare[2, ] = as.numeric(1 - cm_knn$overall[1])
err_compare[3, ] = as.numeric(1 - cm_tree$overall[1])
err_compare[4, ] = as.numeric(1 - cm_ens_1$overall[1])
err_compare[5, ] = as.numeric(1 - cm_ens_2$overall[1])

pander(err_compare)
  Error rate
logistic 0.0435
k-NN 0.0415
tree 0.02
ensemble nº1 0.025
ensemble nº2 0.0235


As we can see from this table, the lowest error rate is for the classification tree.
However, the ensemble methods ALSO have very low error rates, and this is even more remarkable when we take into account that they comprised two methods with relatively quite high error rates (k-NN and logistic regression).


So, as a conclusion, one might think that ensemble methods perform even better than individual methods.



Exercice 2 (13.2 of the book)

Preliminaries (importing data, preprocessing data, & visualization)

ebay <- read.csv("eBayAuctions.csv", header = TRUE, sep = ",")

t(t(names(ebay)))
##      [,1]          
## [1,] "Category"    
## [2,] "currency"    
## [3,] "sellerRating"
## [4,] "Duration"    
## [5,] "endDay"      
## [6,] "ClosePrice"  
## [7,] "OpenPrice"   
## [8,] "Competitive."
str(ebay)
## 'data.frame':    1972 obs. of  8 variables:
##  $ Category    : chr  "Music/Movie/Game" "Music/Movie/Game" "Music/Movie/Game" "Music/Movie/Game" ...
##  $ currency    : chr  "US" "US" "US" "US" ...
##  $ sellerRating: int  3249 3249 3249 3249 3249 3249 3249 3249 3249 3249 ...
##  $ Duration    : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ endDay      : chr  "Mon" "Mon" "Mon" "Mon" ...
##  $ ClosePrice  : num  0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ...
##  $ OpenPrice   : num  0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ...
##  $ Competitive.: int  0 0 0 0 0 0 0 0 0 0 ...
attach(ebay)


Now we must partition the data (training : 60%, validation : 40%)
set.seed(1)

# Partitioning the data
train.obs <- sample(rownames(ebay), dim(ebay)[1]*0.6)
train.set <- ebay[train.obs,]  # We take away ID and ZIP.Code

valid.obs <- setdiff(rownames(ebay), train.obs)
valid.set <- ebay[valid.obs,]


a. Running default classification tree & computing lift

class_tree <-
  rpart(Competitive. ~ ., data = train.set, method = "class")


plot_class_tree = prp(
  class_tree,
  type = 1,
  extra = 1,
  under = TRUE,
  split.font = 1,
  varlen = -10
)


Now, to check the overall accuracy, we need to get the predictions as classes, and then we create the confusion matrix :
pred_class = predict(class_tree, valid.set, type = "class")


# Creating the confusion matrix :

confusionMatrix(as.factor(pred_class), as.factor(valid.set$Competitive.))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 335  99
##          1  47 308
##                                           
##                Accuracy : 0.815           
##                  95% CI : (0.7861, 0.8415)
##     No Information Rate : 0.5158          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6311          
##                                           
##  Mcnemar's Test P-Value : 2.434e-05       
##                                           
##             Sensitivity : 0.8770          
##             Specificity : 0.7568          
##          Pos Pred Value : 0.7719          
##          Neg Pred Value : 0.8676          
##              Prevalence : 0.4842          
##          Detection Rate : 0.4246          
##    Detection Prevalence : 0.5501          
##       Balanced Accuracy : 0.8169          
##                                           
##        'Positive' Class : 0               
## 
The overall accuracy is 0.815.


Now, let’s calculate the lift chart to know the lift corresponding to the first decile. But first, we must get the predicted PROBABILITIES :
# Getting predicted probabilities :

pred_prob = predict(class_tree, valid.set, type = "prob")


# Ploting the lift chart :

lift.class_tree <- lift(relevel(as.factor(valid.set$Competitive.), ref = "1") ~ pred_prob[,2], data = valid.set)

xyplot(lift.class_tree, plot = "gain")


Our class of interest is whether or not an auction is “competitive”, so classified as “1”. Therefore, to calculate the lift, we must first order the observations in decreasing order, according to the probability of being classified as “1”.
After that, we take the first decile (here, it is 79 observations, because 0.1*789 is approximately equal to 79) and count how many observations were classified as “1”.
We compare this number to a baseline (prediction with no model, so RANDOM), which would correspond to the number of “1”s in the validation dataset divided by the total number of observations in the same dataset. In this case, we have 407 “1”s in our validation dataset, and so 407/789 is our baseline. This comparison will be made by a ratio, and the result will be the LIFT of the first decile.


So, without further do, let’s get into it! 💪
# Calculating the baseline :

baseline = sum(valid.set$Competitive.)/789


# Creating data frame with both probabilities and classes :

full_df = as.data.frame(pred_class)
full_df$pred_prob = pred_prob[,2]


# Ordering the data frame in decreasing order according to probabilities :

full_df <- full_df[order(full_df$pred_prob, decreasing = T),]
We can see that ALL the rows in this data frame which are comprised between the 1rst and the 80th (so, our first decile) are “1”s, which means that with our model, the ENTIRE first decile has been classified as “1”.
Now, it is time to compute the LIFT :
lift = 79/(baseline*79)

lift
## [1] 1.938575
The lift we find is 1.938575. This means that with our model we are doing almost TWICE as good as if no model was implemented to identify the competitive auctions.


Actually, there is a package called lift which contains a formula specifically designed to DIRECTLY compute the lift for the Top 10% Decile.
Let’s try it!
TopDecileLift(pred_prob[,2], valid.set$Competitive.)
## [1] 1.914
Well, the lift is not exactly the same as the one we just calculated, BUT it is almost the same!
Therefore, in the following sections, I suppose that it would be better (less time consuming) to directly use this shortcut.


b. Running a boosted tree & computing lift

# IMPORTANT : need to convert the outcome variable into a FACTOR! Otherwise, the boosting algorithm won't work!

train.set$Competitive. = factor(train.set$Competitive.)
str(train.set$Competitive.)
##  Factor w/ 2 levels "0","1": 2 2 1 2 1 1 2 1 2 1 ...
# The category "Photography" seems to have problems when boosting, so we delete it:

train.set = train.set[train.set$Category!="Photography",]
valid.set = valid.set[valid.set$Category!="Photography",]


# Doing boosting (takes quite a little while...)

boost <- boosting(Competitive. ~ ., data = train.set[, c(-1)])


# Getting the predictions :

pred_boost <- predict(boost, valid.set)


# Doing the confusion matrix :

confusionMatrix(as.factor(pred_boost$class), as.factor(valid.set$Competitive.))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 352  47
##          1  30 354
##                                           
##                Accuracy : 0.9017          
##                  95% CI : (0.8786, 0.9216)
##     No Information Rate : 0.5121          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.8034          
##                                           
##  Mcnemar's Test P-Value : 0.06825         
##                                           
##             Sensitivity : 0.9215          
##             Specificity : 0.8828          
##          Pos Pred Value : 0.8822          
##          Neg Pred Value : 0.9219          
##              Prevalence : 0.4879          
##          Detection Rate : 0.4496          
##    Detection Prevalence : 0.5096          
##       Balanced Accuracy : 0.9021          
##                                           
##        'Positive' Class : 0               
## 
# Computing the lift for the top 10% decile :

TopDecileLift(pred_boost$prob[,2], valid.set$Competitive.)
## [1] 1.953


We can see that, in terms of accuracy, the boosted tree did better than the “normal” and single tree. However, the lift remains roughly the same in both instances.


c. Running a bagged tree & computing lift

# Doing bagging :

bagg = bagging(Competitive. ~ ., data = train.set)


# Getting the predictions :

pred_bagg <- predict(bagg, valid.set, type = "class")


# Doing the confusion matrix :

confusionMatrix(as.factor(pred_bagg$class), as.factor(valid.set$Competitive.))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 347  66
##          1  35 335
##                                           
##                Accuracy : 0.871           
##                  95% CI : (0.8455, 0.8937)
##     No Information Rate : 0.5121          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7424          
##                                           
##  Mcnemar's Test P-Value : 0.002835        
##                                           
##             Sensitivity : 0.9084          
##             Specificity : 0.8354          
##          Pos Pred Value : 0.8402          
##          Neg Pred Value : 0.9054          
##              Prevalence : 0.4879          
##          Detection Rate : 0.4432          
##    Detection Prevalence : 0.5275          
##       Balanced Accuracy : 0.8719          
##                                           
##        'Positive' Class : 0               
## 
# Computing the lift for the top 10% decile :

TopDecileLift(pred_bagg$prob[,2], valid.set$Competitive.)
## [1] 1.953
Again, in terms of accuracy we did better than with the “normal” and single tree. This is because these methods (bagging and boosting) combine the predictions of many different trees, and so the results are more accurate (have less variance) and more “powerful”.
But the drawback is that we lose some interpretability and transparency, since a “normal” and single tree is very easy to explain to others, not a boosted or bagged tree!


The lift is exactly the SAME between the boosted and the bagged trees (1.939)!


d. Running a random forest & comparing with bagged tree

# Doing bagging :

rf = randomForest(Competitive. ~ ., data = train.set, mtry = 4)


# Getting the predictions :

pred_rf_class <- predict(rf, valid.set, type = "class")

pred_rf_prob <- predict(rf, valid.set, type = "prob")


# Doing the confusion matrix :

confusionMatrix(as.factor(pred_rf_class), as.factor(valid.set$Competitive.))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 342  45
##          1  40 356
##                                           
##                Accuracy : 0.8914          
##                  95% CI : (0.8675, 0.9124)
##     No Information Rate : 0.5121          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7828          
##                                           
##  Mcnemar's Test P-Value : 0.6644          
##                                           
##             Sensitivity : 0.8953          
##             Specificity : 0.8878          
##          Pos Pred Value : 0.8837          
##          Neg Pred Value : 0.8990          
##              Prevalence : 0.4879          
##          Detection Rate : 0.4368          
##    Detection Prevalence : 0.4943          
##       Balanced Accuracy : 0.8915          
##                                           
##        'Positive' Class : 0               
## 
# Computing the lift for the top 10% decile :

TopDecileLift(pred_rf_prob[,2], valid.set$Competitive.)
## [1] 1.953
The bagged tree has accuracy of 0.8682, while the random forest has accuracy of 0.8619, so a bit less.


Concerning the lift on the top 10% decile, actually the 3 methods (boosting, bagging and random forest) yield the same number of 1.939.


The difference between both methods (random forest and bagging) is a quite “small” one : in random forest, we do NOT only take a bootstrap sample of the sample, but ALSO of the predictors. In bagging, this is not the case, because we ONLY take a bootstrap sample of the sample, considering all the predictors at once.


Small anecdote 😜

Just a small anecdote : Bagging = Bootstrap aggregating 😄



Exercice 3 (15.3 of the book)

Preliminaries (importing data, preprocessing data, & visualization)

cereals <- read.csv("Cereals.csv", header = TRUE, sep = ",")

t(t(names(cereals)))
##       [,1]      
##  [1,] "name"    
##  [2,] "mfr"     
##  [3,] "type"    
##  [4,] "calories"
##  [5,] "protein" 
##  [6,] "fat"     
##  [7,] "sodium"  
##  [8,] "fiber"   
##  [9,] "carbo"   
## [10,] "sugars"  
## [11,] "potass"  
## [12,] "vitamins"
## [13,] "shelf"   
## [14,] "weight"  
## [15,] "cups"    
## [16,] "rating"
str(cereals)
## 'data.frame':    77 obs. of  16 variables:
##  $ name    : chr  "100%_Bran" "100%_Natural_Bran" "All-Bran" "All-Bran_with_Extra_Fiber" ...
##  $ mfr     : chr  "N" "Q" "K" "K" ...
##  $ type    : chr  "C" "C" "C" "C" ...
##  $ calories: int  70 120 70 50 110 110 110 130 90 90 ...
##  $ protein : int  4 3 4 4 2 2 2 3 2 3 ...
##  $ fat     : int  1 5 1 0 2 2 0 2 1 0 ...
##  $ sodium  : int  130 15 260 140 200 180 125 210 200 210 ...
##  $ fiber   : num  10 2 9 14 1 1.5 1 2 4 5 ...
##  $ carbo   : num  5 8 7 8 14 10.5 11 18 15 13 ...
##  $ sugars  : int  6 8 5 0 8 10 14 8 6 5 ...
##  $ potass  : int  280 135 320 330 NA 70 30 100 125 190 ...
##  $ vitamins: int  25 0 25 25 25 25 25 25 25 25 ...
##  $ shelf   : int  3 3 3 3 3 1 2 3 1 3 ...
##  $ weight  : num  1 1 1 1 1 1 1 1.33 1 1 ...
##  $ cups    : num  0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
##  $ rating  : num  68.4 34 59.4 93.7 34.4 ...
attach(cereals)


# Eliminate all cereals with NAs :

table(is.na(cereals))
## 
## FALSE  TRUE 
##  1228     4
cereals_clean = na.omit(cereals)


It is VERY important that we normalize the data beforehand! This is because is clustering we are constantly working with distances, and therefore big differences in scaling can disturb distances very dramatically.
# Normalizing :

cereals_clean.norm_temp <- as.data.frame(lapply(cereals_clean[, 4:16], normalize))

cereals_clean.norm <- cbind(cereals_clean[, 1:3], cereals_clean.norm_temp)


Ok, we are ready to begin the party! 🎉



a. Applying hierarchical clustering

# Setting cereal names to each row :

row.names(cereals_clean.norm) <- cereals_clean.norm[, 1] 


# Computing EUCLIDEAN distance for ONLY numerical (standardized) variables :

distance = dist(cereals_clean.norm[, 4:16], method = "euclidean")


Now we are ready to apply the hierarchical clustering. Let’s do Single Linkage first!


Single Linkage

single.link = hclust(distance, method = "single")

plot(single.link, hang = -1, ann = FALSE)


Complete Linkage

compl.link = hclust(distance, method = "complete")

plot(compl.link, hang = -1, ann = FALSE)


Comparison

Just by the eye, complete linkage seems more “nice”, in the sense that it is less “crowded” and more “organized” than single linkage. Complete linkage seems more STRUCTURED in the space, we can without great difficulties count the number of clusters.
However, single linkage seems to “mix” everything and it is more difficult to get an idea of the clusters.


Probably because complete linkage considers bigger distances between clusters, it is more likely that we end with a more “separated” structure, because to find a near cluster the considered distance is bigger than with single linkage.


Determining optimal number of clusters

To determine the optimal number of clusters, we will create a graph that represents the Within Summ of Squares (WSS) (y-axis) against the number of clusters (x-axis).
We expect that, as the number of clusters increase, the WSS will drop (as each observation withing each cluster is closer to its mean). Our task is to determine at with point the WSS is not droping (or not decreasing so much MARGINALLY) as we increase the number of clusters.
# Computing the WSS :

set.seed(1)

wss <- (nrow(cereals_clean.norm)-1)*sum(apply(cereals_clean.norm[, 4:16],2,var))

wss
## [1] 52.14865
# Drawing the graph :

set.seed(1)

for (i in 2:24) wss[i] <- sum(kmeans(cereals_clean.norm[, 4:16], centers=i)$withinss)

plot(1:24, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", lty=1, col = 4, lwd = 1) +
abline(v=8, lty=2, lwd = 3, col = "green")
## integer(0)
legend(10, 40, legend = "Optimal number of clusters (8)", col = "green", lty=2, cex=0.8)


From the graph, we clearly see that just after the 8th cluster, the WSS increases (there is an “elbow” in the line).
So, the optimal number of clusters should be 8.


Once we found the optimal number of clusters, we can go ahead and look at the cluster centroids. We could not do this before, because we needed to know how many clusters to choose!


Finding the cluster centroids

Single Linkage

There is a function called cutree which actually performs a division into clusters. Let’s use it!
# Creating the clusters

opt_clusters_single = cutree(single.link, k = 8)

head(opt_clusters_single)
##                 100%_Bran         100%_Natural_Bran                  All-Bran 
##                         1                         2                         1 
## All-Bran_with_Extra_Fiber   Apple_Cinnamon_Cheerios               Apple_Jacks 
##                         3                         4                         4


Here we obtain, as output, to which of the 8 clusters each cereal belongs.
In the prompt it is indicated, as a hint, that to find the cluster centroids one can use the aggregate() function and compute the mean of the members in each cluster. So, let’s go!


centroids_single <- aggregate(cereals_clean.norm[, 4:16], by = list(opt_clusters_single), FUN = mean)


Et voilà! We found the centroids for each cluster.


Now, let’s visualize them!
centroids_MATRIX_single = as.matrix(centroids_single[, -1])

plot(c(0), xaxt = 'n', ylab = "", xlab = "", type = "l", ylim = c(min(centroids_MATRIX_single), max(centroids_MATRIX_single)), xlim = c(0, 15)) 

axis(1, at = c(1:13), labels = names(cereals_clean.norm[, 4:16]), las = 2) 

for(i in c(1:8))
  lines(centroids_MATRIX_single[i,], lty = i, lwd = 2, col = ifelse(i %in% c(1,3,5,7), "black", "dark grey"))

text(x = 0.2, y = centroids_MATRIX_single[1:4,1], labels = paste("Cluster", c(1:4)), cex = 0.5)

text(x = 14, y = centroids_MATRIX_single[5:8,13], labels = paste("Cluster", c(5:8)), cex = 0.5)


And now, do the same but with complete linkage.


Complete Linkage

# Creating the clusters :

opt_clusters_compl = cutree(compl.link, k = 8)

head(opt_clusters_compl)
##                 100%_Bran         100%_Natural_Bran                  All-Bran 
##                         1                         2                         1 
## All-Bran_with_Extra_Fiber   Apple_Cinnamon_Cheerios               Apple_Jacks 
##                         1                         3                         4
# Getting the centroids :

centroids_compl <- aggregate(cereals_clean.norm[4:16], by = list(opt_clusters_compl), FUN = mean)


# Plotting!

centroids_MATRIX_compl = as.matrix(centroids_compl[, -1])

plot(c(0), xaxt = 'n', ylab = "", xlab = "", type = "l", ylim = c(min(centroids_MATRIX_compl), max(centroids_MATRIX_compl)), xlim = c(0, 15)) 

axis(1, at = c(1:13), labels = names(cereals_clean.norm[, 4:16]), las = 2) 

for(i in c(1:8))
  lines(centroids_MATRIX_compl[i,], lty = i, lwd = 2, col = ifelse(i %in% c(1,3,5,7), "black", "dark grey"))

text(x = 0.2, y = centroids_MATRIX_compl[1:4,1], labels = paste("Cluster", c(1:4)), cex = 0.5)

text(x = 14, y = centroids_MATRIX_compl[5:8,13], labels = paste("Cluster", c(5:8)), cex = 0.5)


Comparison

We can see that, for single linkage, many clusters are quite similar in terms of centroids. So there is a kind of homogeneity. This is because some of the lines follow similar paths across the variables.
This leads to a clear answer to question b. (see next!)



b. Which method (single or complete linkage) leads to better clusters?

Clearly, complete linkage is better here, since we get more heterogenous clusters, and that is what we want when we do clustering, we want to get groups as different as possible!
Again, we reached this conclusion by looking at the aforementioned centroid graphs, where we saw that for complete linkage, each line follows a quite independant way.


c. Choose one of the methods (Number of clusters? / Distance used?)

From our answer in b., it would be dumb to choose single linkage, wouldn’t it? 😝
So we choose complete linkage, of course!


Also, we already saw that the optimal number of clusters is 8.


To see which distance between clusters has been used to get those 8 clusters, let’s look at the dendogram!
plot(compl.link, hang = -1, ann = FALSE)
abline(h = 1.42)

It was difficult to spot… 😵 but I finally got the (normalized) distance which separates the dendogram into 8 clusters (distance = 1.42).
We can see that we have one singleton (100%_Natural_Bran).



d. Finding a “healthy” cluster of cereals

melt.centroids <- melt(data = centroids_compl, id.vars = "Group.1")

ggplot(melt.centroids, aes(x = variable, y = value, group = Group.1, color = factor(Group.1), size = 1)) +
  labs(color = "Cluster numbers") +
    geom_line(size=0.5) +
  theme(axis.text.x = element_text(angle = 90))


From this profile plot, we can choose the cluster of cereals which have the less “sugars” and the less “fat”, since that is something to be considered “healthy”.


Cluster 8 has these properties. Let’s see which cereals are inside this cluster :
which(opt_clusters_compl == 8)
##  Puffed_Rice Puffed_Wheat 
##           53           54


Among the 77 cereals, these two names above correspond to those with both less fat and less sugars.
However, they don’t have much fibers either, and not much potassium… so we may choose perhaps Cluster 1, which also has quite low fat and sugars, but has quite a lot of potassium and fibers. So there is a trade-off…