Introduction to Machine Learning chapter 1

2. Performance measures

The Confusion Matrix

set.seed(1)
str(titanic)

'data.frame':   714 obs. of  4 variables:
 $ Survived: Factor w/ 2 levels "1","0": 2 1 1 1 2 2 2 1 1 1 ...
 $ Pclass  : int  3 1 3 1 3 1 3 3 2 3 ...
 $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 1 1 1 ...
 $ Age     : num  22 38 26 35 35 54 2 27 14 4 ...

tree <- rpart(Survived ~ ., data = titanic, method = "class")
pred <- predict(tree, titanic, type="class")
table(titanic$Survived, pred)

Confusion Matrix

The survivors correctly predicted to have survived: true positives (TP)
The deceased who were wrongly predicted to have survived: false positives (FP)
The survivors who were wrongly predicted to have perished: false negatives (FN)
The deceased who were correctly predicted to have perished: true negatives (TN)

	P	N
p	TP	FN
n	FP	TN

\[ Accuracy = \frac{TP+TN}{TP+FN+FP+TN} \]

\[ Precision = \frac{TP}{TP+FP} \]

\[ Recall = \frac{TP}{TP+FN} \]

TP <- conf[1, 1] # this will be 212
FN <- conf[1, 2] # this will be 78
FP <- conf[2, 1]
TN <- conf[2, 2]

acc <- (TP + TN) / (TP + FN + FP + TN)
prec <- (TP) / (TP + FP)
rec <- TP / (TP + FN)
acc   # [1] 0.8165266
prec  # [1] 0.8
rec   # [1] 0.7310345

Reference

The quality of a regression

당신이 나사에서 다음과 같은 일을 한다고 상상해보자. 당신은 각기 다른 환경에서 비행기 날개로 생성되는 음압을 측정해야한다. 이 음압은 바람의 빈도, 날개의 각도 및 기타 여러 가지 설정과 연관이 되어 있다. 지루한 실험 대신 이런 설정을 기반으로 음압을 예측할 수 있는 모델을 만들어보는건 어떨까 하는 생각이 스치고 지나간다.

이전 실험 자료 출처 : UCIMLR

다 변수 선형 회귀 모델

예측인자

freq : 바람 주파수
angle : 날개 각도
ch_length : 코드 길이

\[ RMSE = \sqrt{ \frac{1}{N} \sum_{i=1}^N ( y_i - \widehat{y})^2} \]

air <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/00291/airfoil_self_noise.dat")
colnames(air) <- c("freq", "angle", "ch_length", "velocity", "thickness", "dec")
str(air)

## 'data.frame':    1503 obs. of  6 variables:
##  $ freq     : int  800 1000 1250 1600 2000 2500 3150 4000 5000 6300 ...
##  $ angle    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ch_length: num  0.305 0.305 0.305 0.305 0.305 ...
##  $ velocity : num  71.3 71.3 71.3 71.3 71.3 71.3 71.3 71.3 71.3 71.3 ...
##  $ thickness: num  0.00266 0.00266 0.00266 0.00266 0.00266 ...
##  $ dec      : num  126 125 126 128 127 ...

# Inspect your colleague's code to build the model
fit <- lm(dec ~ freq + angle + ch_length, data = air)

# Use the model to predict for all values: pred
pred <- predict(fit)

# Use air$dec and pred to calculate the RMSE 
rmse <- sqrt((1/nrow(air)) * sum((air$dec - pred)^2))
rmse

## [1] 5.215778

More complex model is better?

당신의 모델은 바람의 주파수, 날개의 각도, 코드의 길이 이 세 가지 변수를 이용했다. 여기에 속도, 날개의 두께까지 추가한 새 모델이 더 정확한 예측을 하는지 이전 결과와 비교해 보자.

fit2 <- lm(dec ~ freq + angle + ch_length + velocity + thickness, data = air)
pred2 <- predict(fit2)
rmse2 <- sqrt(sum( (air$dec - pred2) ^ 2) / nrow(air))
rmse2

## [1] 4.799244

더 공부를 하고 싶다면 software carpentry

Clustering

In the dataset seeds you can find various metrics such as area, perimeter and compactness for 210 seeds. (Source: UCIMLR). However, the seeds’ labels were lost. Hence, we don’t know which metrics belong to which type of seed. What we do know, is that there were three types of seeds.

seeds <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt")
colnames(seeds) <- c("area", "perimeter", "compactness", "length", "width", "asymmetry", "groove_length")
set.seed(1)
str(seeds)

## 'data.frame':    210 obs. of  8 variables:
##  $ area         : num  15.3 14.9 14.3 13.8 16.1 ...
##  $ perimeter    : num  14.8 14.6 14.1 13.9 15 ...
##  $ compactness  : num  0.871 0.881 0.905 0.895 0.903 ...
##  $ length       : num  5.76 5.55 5.29 5.32 5.66 ...
##  $ width        : num  3.31 3.33 3.34 3.38 3.56 ...
##  $ asymmetry    : num  2.22 1.02 2.7 2.26 1.35 ...
##  $ groove_length: num  5.22 4.96 4.83 4.8 5.17 ...
##  $ NA           : int  1 1 1 1 1 1 1 1 1 1 ...

km_seeds <- kmeans(seeds, 3)
plot(length ~ compactness, data = seeds, col = km_seeds$cluster)

# Print out the ratio of the WSS to the BSS
# WSS : Within sum of squares
# BSS : between cluster sum of squares
km_seeds$tot.withinss / km_seeds$betweenss

## [1] 0.2800729

km_seeds

## K-means clustering with 3 clusters of sizes 75, 61, 74
## 
## Cluster means:
##       area perimeter compactness   length    width asymmetry groove_length
## 1 11.90907  13.25027   0.8515493 5.222333 2.865093  4.722187      5.093040
## 2 18.72180  16.29738   0.8850869 6.208934 3.722672  3.603590      6.066098
## 3 14.63203  14.45324   0.8790973 5.561784 3.274892  2.744043      5.184932
##       <NA>
## 1 2.866667
## 2 1.983607
## 3 1.135135
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [36] 3 3 2 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 3 3 3 3 3 3 3
##  [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2
## [106] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 2 2 2 2 2 2 2 3 3 3 3 2 3 3 3
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 217.4687 185.0922 223.1591
##  (between_SS / total_SS =  78.1 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

더 공부를 하고 싶다면

Training and Testing

set.seed(1)

# Shuffle the dataset, call the result shuffled
n <- nrow(titanic)
shuffled <- titanic[sample(n),]

# Split the data in train and test using a 7/3 split
train <- shuffled[1:round(0.7*n),]
test <- shuffled[(round(0.7*n) + 1):n,]

str(train)
str(test)

tree <- rpart(Survived ~ ., train, method = "class")
pred <- predict(tree, test, type="class")
conf <- table(test$Survived, pred) # confusion matrix
conf

Using Cross Validation

set.seed(1)

# Initialize the accs vector
accs <- rep(0,6)

for (i in 1:6) {
  # These indices indicate the interval of the test set
  indices <- (((i-1) * round((1/6)*nrow(shuffled))) + 1):((i*round((1/6) * nrow(shuffled))))
  
  # Exclude them from the train set
  train <- shuffled[-indices,]
  
  # Include them in the test set
  test <- shuffled[indices,]
  
  # A model is learned using each training set
  tree <- rpart(Survived ~ ., train, method = "class")
  
  # Make a prediction on the test set using tree
  pred <- predict(tree, test, type="class")
  
  # Assign the confusion matrix to conf
  conf <- table(test$Survived, pred)
  
  # Assign the accuracy of this model to the ith index in accs
  accs[i] <- sum(diag(conf))/sum(conf)
}

mean(accs) # Print out the mean of accs

How many folds for cross validation

\[ K = \frac{N}{N* ratio \; of \; test \; set} \]

Overfitting the spam

> head(emails_full)
  avg_capital_seq spam
1           1.500    0
2           4.941    1
3           3.429    1
4           3.493    1
5           3.380    0
6           3.689    1

spam_classifier <- function(x){
  prediction <- rep(NA, length(x)) # initialize prediction vector
  prediction[x > 4] <- 1 
  prediction[x >= 3 & x <= 4] <- 0
  prediction[x >= 2.2 & x < 3] <- 1
  prediction[x >= 1.4 & x < 2.2] <- 0
  prediction[x > 1.25 & x < 1.4] <- 1
  prediction[x <= 1.25] <- 0
  return(factor(prediction, levels = c("1", "0"))) # prediction is either 0 or 1
}

pred_full <- spam_classifier(emails_full$avg_capital_seq)
conf_full <- table(emails_full$spam, pred_full) # confusion matrix
acc_full <- sum(diag(conf_full))/sum(conf_full) # accuracy
acc_full

It’s official now, the spam_classifier() from chapter 1 is bogus. It simply overfits on the emails_small set and, as a result, doesn’t generalize to larger datasets such as emails_full.

So let’s try something else. On average, emails with a high frequency of sequential capital letters are spam. What if you simply filtered spam based on one threshold for avg_capital_seq?

For example, you could filter all emails with avg_capital_seq > 4 as spam. By doing this, you increase the interpretability of the classifier and restrict its complexity. However, this increases the bias, i.e. the error due to restricting your model.

Your model no longer fits the small dataset perfectly but it fits the big dataset better. You increased the bias on the model and caused it to generalize better over the complete dataset. While the first classifier overfits the data, an accuracy of 73% is far from satisfying for a spam filter.

Interpretability

We made a prediction model for car insurance claims. It uses a few attributes and relations between these attributes to predict the amount of claims one will make.

Using few attributes will reduce the complexity, but adding relationships between these attributes will increase the complexity.