The Confusion Matrix

In this exercise a decision tree is learned on this dataset, that aims to predict whether a person would have survived the accident based on the variables Age, Sex and Pclass (travel class). The decision the tree makes can be deemed right or wrong if now what the true outcome was.
Since the true fate of the passengers, Survived, is also provided in titanic, you can compare it to the prediction of the tree. As you’ve seen in the video, the results can be summarized in a confusion matrix. - In R, you can use the table() function for this. In this exercise, you will only focus on assessing the performance of the decision tree. In chapter 3, you will learn how to actually build a decision tree yourself.

# Set random seed. Don't remove this line
set.seed(1)

# Have a look at the structure of titanic
titanic = read.csv("titanic_train.csv")
str(titanic)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

A decision tree classification model is built on the data

library(rpart)

## Warning: package 'rpart' was built under R version 3.2.2

tree <- rpart(Survived ~ ., data = titanic, method = "class")

Use tree to predict() who survived in the titanic dataset. Use tree as the first argument and titanic as the second argument. Make sure to set the type parameter to “class”. Assign the result to pred.

pred <- predict(tree, titanic, type="class")

Build the confusion matrix with the table() function. This function builds a contingency table. The first argument corresponds to the rows in the matrix and should be the Survived column of titanic: the true labels. The second argument, corresponding to the columns, should be pred: the predicted labels.

table(titanic$Survived, pred)

##    pred
##       0   1
##   0 549   0
##   1   0 342

So to summarize, 549 out of all 549 survivors were correctly predicted to have survived. On the other hand, 342 out of the 342 deceased were correctly predicted to have perished

Deriving ratios from the Confusion Matrix

The survivors correctly predicted to have survived: true positives (TP)
The deceased who were wrongly predicted to have survived: false positives (FP)
The survivors who were wrongly predicted to have perished: false negatives (FN)
The deceased who were correctly predicted to have perished: true negatives (TN)

#     P   N
# p  TP  FN
# n  FP  TN

##Accuracy
#  TP+TN/TP+FN+FP+TN
##Precision
#  TP/TP+FP
##Recall
#  TP/FP+FN

Build up conf

PredictionPositive <- c(212, 53)
PredictionNegative <- c(78,371)
conf <- data.frame(PredictionPositive, PredictionNegative)
conf

##   PredictionPositive PredictionNegative
## 1                212                 78
## 2                 53                371

Calculate accurancy, precision and recall

# Assign TP, FN, FP and TN using conf
TP=212
FN=78
FP=53
TN=371
# Calculate and print the accuracy: acc
acc=(TP+TN)/(TP+FN+FP+TN)
acc

## [1] 0.8165266

# Calculate and print out the precision: prec
prec=TP/(TP+FP)
prec

## [1] 0.8

# Calculate and print out the recall: rec
rec=TP/(TP+FN)
rec

## [1] 0.7310345

With an accuracy of around 82%, only 18% of the passengers were wrongly predicted to have survived or perished. Overall you have an acceptable, but not truly satisfying classifier.

The quality of a regression

Measure the sound pressure produced by an airplane’s wing under different settings. These settings are the frequency of the wind, the angle of the wing, and several more. The results of this experiment are listed in the air dataset (Source: UCIMLR).

air <- read.csv("airfoil_self_noise.csv")
str(air)

## 'data.frame':    1503 obs. of  6 variables:
##  $ freq     : int  800 1000 1250 1600 2000 2500 3150 4000 5000 6300 ...
##  $ angle    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ch_length: num  0.305 0.305 0.305 0.305 0.305 ...
##  $ velocity : num  71.3 71.3 71.3 71.3 71.3 71.3 71.3 71.3 71.3 71.3 ...
##  $ thickness: num  0.00266 0.00266 0.00266 0.00266 0.00266 ...
##  $ dec      : num  126 125 126 128 127 ...

Build a model that’s able to predict the sound pressure based on these settings, instead of having to do those tedious experiments every time. Prepare a multivariable linear regression model, fit. It takes as input the predictors: wind frequency (freq), wing’s angle (angle) and chord’s length (ch_length). The response is the sound pressure (dec). All these variables can be found in air.

fit <- lm(dec ~ freq + angle + ch_length, data = air)

Use the model to predict for all values: pred

pred <- predict(fit)

Assess the quality of the model by calculating the RMSE:

rmse <- sqrt(sum((air$dec-pred)^2)/nrow(air))
rmse

## [1] 5.215778

So the RMSE is given by 5 . 5 what? Well 5 decibels, the unit of the sound pressure, your response. As a standalone number, it doesn’t tell you a whole bunch. In order to derive its meaning, it should be compared to the RMSE of a different model for the same problem.

Adding complexity to increase quality

Adding the new variables will definitely increase the complexity of your model, but will it increase the performance? Now we have the measurements on free-stream velocity, velocity and suction side displacement thickness, thickness available for use in the air dataset as well!

fit2 <- lm(dec ~ freq + angle + ch_length + velocity + thickness, data = air)

Use the model to predict for all values: pred2

pred2 <- predict(fit2)
rmse2 <- sqrt(sum((air$dec - pred2) ^ 2) / nrow(air))
rmse2

## [1] 4.799244

Adding complexity seems to have caused the RMSE to decrease, from 5.216 to 4.799. But there’s more going on here; maybe adding variables always leads to a decrease of your RMSE?

Do some clustering!

The seeds’ labels were lost. Hence, we don’t know which metrics belong to which type of seed. What we do know, is that there were three types of seeds.

seeds <- read.csv("seeds_dataset.csv")
str(seeds)

## 'data.frame':    210 obs. of  8 variables:
##  $ area         : num  15.3 14.9 14.3 13.8 16.1 ...
##  $ perimeter    : num  14.8 14.6 14.1 13.9 15 ...
##  $ compactness  : num  0.871 0.881 0.905 0.895 0.903 ...
##  $ length       : num  5.76 5.55 5.29 5.32 5.66 ...
##  $ width        : num  3.31 3.33 3.34 3.38 3.56 ...
##  $ asymmetry    : num  2.22 1.02 2.7 2.26 1.35 ...
##  $ groove_length: num  5.22 4.96 4.83 4.8 5.17 ...
##  $ X            : int  1 1 1 1 1 1 1 1 1 1 ...

set.seed(1)

The below code clusters the seeds in three clusters (km_seeds), but is it likely that these three clusters represent our seed types? There are two initial steps you could take:

Visualize the resulting clusters for two variables, for example length versus compactness.
Verify if the clusters are well separated and compact. To this end, you can calculate the between and within cluster sum of squares respectively.

km_seeds <- kmeans(seeds, 3)
km_seeds

## K-means clustering with 3 clusters of sizes 75, 61, 74
## 
## Cluster means:
##       area perimeter compactness   length    width asymmetry groove_length
## 1 11.90907  13.25027   0.8515493 5.222333 2.865093  4.722187      5.093040
## 2 18.72180  16.29738   0.8850869 6.208934 3.722672  3.603590      6.066098
## 3 14.63203  14.45324   0.8790973 5.561784 3.274892  2.744043      5.184932
##          X
## 1 2.866667
## 2 1.983607
## 3 1.135135
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [36] 3 3 2 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 3 3 3 3 3 3 3
##  [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2
## [106] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 2 2 2 2 2 2 2 3 3 3 3 2 3 3 3
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 217.4687 185.0922 223.1591
##  (between_SS / total_SS =  78.1 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Color the points in the plot based on the clusters

plot(length ~ compactness, data = seeds, col=km_seeds$cluster)

Print out the ratio of the within sum of squares to the between cluster sum of squares, so WSS/BSS. These measures can be found in the cluster object km_seeds as tot.withinss and betweenss. Is the within sum of squares substantionally lower than the between sum of squares?

# Print out the ratio of the WSS to the BSS
km_seeds$tot.withinss/km_seeds$betweenss

## [1] 0.2800729

The within sum of squares is far lower than the between sum of squares. Indicating the clusters are well seperated and overall compact. This is further strengthened by the plot you made, where the clusters you made were visually distinct for these two variables. It’s likely that these three clusters represent the three seed types well, even if there’s no way to truly verify this.

# Defining performance metrics for unsupervised learning (cf. clustering) is easy and straightforward.
# You can use the RMSE to determine if a classification was good.
# If you want to build a system that can automatically categorize mails as spam or not, a confusion matrix can help you assess its quality.  <----RIGHT!
# The classification model you learned on a dataset shows an error rate of 0.1. You're sure this model will be useful.

Split the sets

Let’s return to the titanic dataset for which we set up a decision tree. Previously you assessed its performance by calculating the confusion matrix. However, the tree was built using the entire set of observations. That confusion matrix doesn’t assess the predictive power of the tree. The training set and the test set were one and the same thing: this can be done better!

First off, you’ll want to split the dataset into a train and test set. You’ll notice that the titanic dataset is sorted on titanic$Survived , so you’ll have to shuffle the dataset first in order to have a fair distribution of the output variable in both sets.

# Set random seed. Don't remove this line.
set.seed(1)

# Shuffle the dataset, call the result shuffled
n <- nrow(titanic)
shuffled <- titanic[sample(n), ]
train_indices <- 1:round(0.7 * n)

# Split the data in train and test
train <- shuffled[train_indices, ]
test_indices <- (round(0.7 * n) + 1):n
test <- shuffled[test_indices, ]

# Print the structure of train and test
str(train)

## 'data.frame':    624 obs. of  12 variables:
##  $ PassengerId: int  237 332 510 807 179 796 837 585 556 55 ...
##  $ Survived   : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ Pclass     : int  2 1 3 1 2 2 3 3 1 1 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 375 642 460 35 325 626 643 645 881 625 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Age        : num  44 45.5 26 39 30 39 21 NA 62 65 ...
##  $ SibSp      : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Parch      : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 199 21 81 13 170 224 269 292 53 30 ...
##  $ Fare       : num  26 28.5 56.5 0 13 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 58 1 13 1 1 1 1 1 25 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 4 2 4 2 ...

str(test)

## 'data.frame':    267 obs. of  12 variables:
##  $ PassengerId: int  561 316 468 786 306 677 559 790 82 789 ...
##  $ Survived   : int  0 1 0 0 1 0 1 0 1 1 ...
##  $ Pclass     : int  3 3 1 3 1 3 1 1 3 3 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 563 595 771 335 17 735 805 315 742 212 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 1 2 2 2 ...
##  $ Age        : num  NA 26 56 25 0.92 24.5 39 46 29 1 ...
##  $ SibSp      : int  0 0 0 0 1 0 1 0 0 1 ...
##  $ Parch      : int  0 0 0 0 2 0 1 0 0 2 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 472 344 45 475 38 295 2 593 312 549 ...
##  $ Fare       : num  7.75 7.85 26.55 7.25 151.55 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 1 1 1 64 1 136 46 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 3 4 4 4 4 4 4 2 4 4 ...

First you train, then you test

This time, however, you’ll want to build a decision tree on the training set, and next assess its predictive power on a set that has not been used for training: the test set.

tree <- rpart(Survived ~ ., train, method = "class")
# Predict the outcome on the test set with tree: pred
pred <- predict(tree, test, type="class")

Use the table() function to calculate the confusion matrix. Assign this table to conf. Remember that the actual labels of the test set should come in the rows, and the predicted labels in the columns.

conf <- table(test$Survived, pred)
conf

##    pred
##       0   1
##   0 137  27
##   1  36  67

The confusion matrix reveals an accuracy of (137+67)/(137+67+27+36) = 76,41%. This is less than the 81,65% we calculated in the first section. However, this is a much more trustworthy estimate of the model’s true predictive power.

Using Cross Validation

You will fold the dataset 6 times and calculate the accuracy for each fold. The mean of these accuracies forms a more robust estimation of the real accuracy on unseen data, as it depends less on the choice of training and test set.

str(shuffled)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  237 332 510 807 179 796 837 585 556 55 ...
##  $ Survived   : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ Pclass     : int  2 1 3 1 2 2 3 3 1 1 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 375 642 460 35 325 626 643 645 881 625 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Age        : num  44 45.5 26 39 30 39 21 NA 62 65 ...
##  $ SibSp      : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Parch      : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 199 21 81 13 170 224 269 292 53 30 ...
##  $ Fare       : num  26 28.5 56.5 0 13 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 58 1 13 1 1 1 1 1 25 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 4 2 4 2 ...

# Set random seed. Don't remove this line.
set.seed(1)

# Initialize the accs vector
accs <- rep(0,6)
for (i in 1:6) {
  # These indices indicate the interval of the test set
  indices <- (((i-1) * round((1/6)*nrow(shuffled))) + 1):((i*round((1/6) * nrow(shuffled))))
  # Exclude them from the train set
  train <- shuffled[-indices,]
  # Include them in the test set
  test <- shuffled[indices,]
  # A model is learned using each training set
  tree <- rpart(Survived ~ ., train, method = "class")
  # Make a prediction on the test set using tree
  pred <- predict(tree, test, type="class")
  # Assign the confusion matrix to conf
  conf <- table(test$Survived, pred)
  # Assign the accuracy of this model to the ith index in accs
  accs[i] <-sum(diag(conf))/sum(conf)
}
# Print out the mean of accs
  accs

## [1] 0.8040541 0.7770270 0.7972973 0.7702703 0.7500000 0.7905405

  mean(accs)

## [1] 0.7815315

How many folds?

Say, you’re doing cross validation on a dataset with 22680 observations. This number is stored as n in your workspace. You want the training set to contain 21420 entries, saved as tr. How many folds can you use for your cross validation? In other words, how many iterations with other test sets can you do on the dataset?

# 18

Overfitting the spam!

Do you remember the crude spam filter, spam_classifier(), from chapter 1? It filters spam based on the average sequential use of capital letters (avg_capital_seq) to decide whether an email was spam (1) or not (0).

You may recall that it perfectly filtered the spam. However the set, emails_small, you used to test your classifier was only a small fraction of the entire dataset, emails_full (Source: UCIMLR).

Your job is to verify whether the spam_classifier that is built for you, generalizes to the entire set of emails. The accuracy for the set, emails_small is equal to 1. Is the accuracy for the entire set emails_full substantially lower?

emails_full <- read.csv("emails_full.csv")
# The spam filter that has been 'learned' for you
spam_classifier <- function(x){
  prediction <- rep(NA,length(x))
  prediction[x > 4] <- 1
  prediction[x >= 3 & x <= 4] <- 0
  prediction[x >= 2.2 & x < 3] <- 1
  prediction[x >= 1.4 & x < 2.2] <- 0
  prediction[x > 1.25 & x < 1.4] <- 1
  prediction[x <= 1.25] <- 0
  return(factor(prediction, levels=c("1","0")))
}

# Apply spam_classifier to emails_full: pred_full
pred_full <- spam_classifier(emails_full$avg_capital_seq)

# Build confusion matrix for classifier on emails_full: conf_full
conf_full <- table(emails_full$spam, pred_full)

# Calculate the accuracy with conf_full: acc_full
acc_full <- sum(diag(conf_full)) / sum(conf_full)

# Print out acc_full
acc_full

## [1] 0.3438383

the official answer is [1] 0.6561617!

##HINT
# The confusion matrix is given by
##    P     N
## p  TP    FN
## n  FP    TN

# The accuracy is given by
##   Accuracy=(TP+TN)/(TP+FN+FP+TN)

Increasing the bias

It’s official now, the spam_classifier from chapter 1 is bogus. It simply overfits on the emails_small set and as a result doesn’t generalize to larger dataset such as emails_full.

So let’s try something else. On average, emails with a high frequency of sequential capital letters are spam. What if you simply filtered spam based on one threshold for avg_capital_seq?

For example, you could filter all emails with avg_capital_seq > 4 as spam. By doing this, you increase the interpretability of the classifier and restrict its complexity. However, this increases the bias, i.e. the error due to restricting your model.

Your job is to simplify the rules of spam_classifier and calculate the accuracy for the full set emails_full. Next, compare it to that of the small set emails_small, which is coded for you. Does the model generalize now?

Simplify the rules of the spam_classifier. emails with an avg_capital_seq strictly longer than 4 are spam, all others are seen as no spam.
Inspect the code that defines conf_small and acc_small.
Set up the confusion matrix for the emails_full dataset. Put the true labels found in emails_full$spam in the rows and the predicted spam values in the columns. Assign to conf_full.
Use conf_full to calculate the accuracy. Assign this value to acc_full and print it out. Before, acc_small and acc_full were 100% and 65%, respectively; what do you conclude?

emails_small <- read.csv("emails_small.csv")
# You should change the code of the classifier, simplifying it
spam_classifier <- function(x){
  prediction <- rep(NA,length(x))
  prediction[x > 4] <- 1
  prediction[x <= 4] <- 0
  return(factor(prediction, levels=c("1","0")))
}

# conf_small and acc_small have been calculated for you
conf_small <- table(emails_small$spam, spam_classifier(emails_small$avg_capital_seq))
acc_small <- sum(diag(conf_small)) / sum(conf_small)
acc_small

## [1] 0.2307692

the official answer is [1] 0.7692308!!

# Apply spam_classifier to emails_full and calculate the confusion matrix: conf_full
pred_full <- spam_classifier(emails_full$avg_capital_seq)
conf_full <- table(emails_full$spam, pred_full)

# Calculate acc_full
acc_full <- sum(diag(conf_full)) / sum(conf_full)

# Print acc_full
acc_full

## [1] 0.2740709

the official answer is [1] 0.7259291!!
Great! Your model no longer fits the small dataset perfectly but it fits the big dataset better. You increased the bias on the model and caused it to generalize better over the complete dataset. While the first classifier overfits the data, an accuracy of 73% is far from satisfying for a spam filter

Interpretability

Which of the following models do you think would have the highest interpretability? Remember that a high interpretability usually implies a high bias.

We’ve made a regression which fits perfectly through all 15 points in the training set. We used a high-degree polynomial. 1
A model predicts whether a person with certain attributes is male or female. It uses one threshold on the height attribute to make its prediction. 2
A spam filter uses various rules on 10 attributes to classify whether an email is spam or not. It uses attributes such as specific word count, character count, and so on. 3
We made a prediction model for car insurance claims. It uses a few attributes and relations between these attributes to predict the amount of claims one will make. 4