set.seed(1)
str(titanic)
'data.frame': 714 obs. of 4 variables:
$ Survived: Factor w/ 2 levels "1","0": 2 1 1 1 2 2 2 1 1 1 ...
$ Pclass : int 3 1 3 1 3 1 3 3 2 3 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 1 1 1 ...
$ Age : num 22 38 26 35 35 54 2 27 14 4 ...
tree <- rpart(Survived ~ ., data = titanic, method = "class")
pred <- predict(tree, titanic, type="class")
table(titanic$Survived, pred)
pred
1 0
1 212 78
0 53 371
| P | N | |
|---|---|---|
| p | TP | FN |
| n | FP | TN |
\[ Accuracy = \frac{TP+TN}{TP+FN+FP+TN} \]
\[ Precision = \frac{TP}{TP+FP} \]
\[ Recall = \frac{TP}{TP+FN} \]
TP <- conf[1, 1] # this will be 212
FN <- conf[1, 2] # this will be 78
FP <- conf[2, 1]
TN <- conf[2, 2]
acc <- (TP + TN) / (TP + FN + FP + TN)
prec <- (TP) / (TP + FP)
rec <- TP / (TP + FN)
acc # [1] 0.8165266
prec # [1] 0.8
rec # [1] 0.7310345
당신이 나사에서 다음과 같은 일을 한다고 상상해보자. 당신은 각기 다른 환경에서 비행기 날개로 생성되는 음압을 측정해야한다. 이 음압은 바람의 빈도, 날개의 각도 및 기타 여러 가지 설정과 연관이 되어 있다. 지루한 실험 대신 이런 설정을 기반으로 음압을 예측할 수 있는 모델을 만들어보는건 어떨까 하는 생각이 스치고 지나간다.
다 변수 선형 회귀 모델
freq : 바람 주파수
angle : 날개 각도
ch_length : 코드 길이
\[ RMSE = \sqrt{ \frac{1}{N} \sum_{i=1}^N ( y_i - \widehat{y})^2} \]
air <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/00291/airfoil_self_noise.dat")
colnames(air) <- c("freq", "angle", "ch_length", "velocity", "thickness", "dec")
str(air)
## 'data.frame': 1503 obs. of 6 variables:
## $ freq : int 800 1000 1250 1600 2000 2500 3150 4000 5000 6300 ...
## $ angle : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ch_length: num 0.305 0.305 0.305 0.305 0.305 ...
## $ velocity : num 71.3 71.3 71.3 71.3 71.3 71.3 71.3 71.3 71.3 71.3 ...
## $ thickness: num 0.00266 0.00266 0.00266 0.00266 0.00266 ...
## $ dec : num 126 125 126 128 127 ...
# Inspect your colleague's code to build the model
fit <- lm(dec ~ freq + angle + ch_length, data = air)
# Use the model to predict for all values: pred
pred <- predict(fit)
# Use air$dec and pred to calculate the RMSE
rmse <- sqrt((1/nrow(air)) * sum((air$dec - pred)^2))
rmse
## [1] 5.215778
당신의 모델은 바람의 주파수, 날개의 각도, 코드의 길이 이 세 가지 변수를 이용했다. 여기에 속도, 날개의 두께까지 추가한 새 모델이 더 정확한 예측을 하는지 이전 결과와 비교해 보자.
fit2 <- lm(dec ~ freq + angle + ch_length + velocity + thickness, data = air)
pred2 <- predict(fit2)
rmse2 <- sqrt(sum( (air$dec - pred2) ^ 2) / nrow(air))
rmse2
## [1] 4.799244
더 공부를 하고 싶다면 software carpentry
In the dataset seeds you can find various metrics such as area, perimeter and compactness for 210 seeds. (Source: UCIMLR). However, the seeds’ labels were lost. Hence, we don’t know which metrics belong to which type of seed. What we do know, is that there were three types of seeds.
seeds <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt")
colnames(seeds) <- c("area", "perimeter", "compactness", "length", "width", "asymmetry", "groove_length")
set.seed(1)
str(seeds)
## 'data.frame': 210 obs. of 8 variables:
## $ area : num 15.3 14.9 14.3 13.8 16.1 ...
## $ perimeter : num 14.8 14.6 14.1 13.9 15 ...
## $ compactness : num 0.871 0.881 0.905 0.895 0.903 ...
## $ length : num 5.76 5.55 5.29 5.32 5.66 ...
## $ width : num 3.31 3.33 3.34 3.38 3.56 ...
## $ asymmetry : num 2.22 1.02 2.7 2.26 1.35 ...
## $ groove_length: num 5.22 4.96 4.83 4.8 5.17 ...
## $ NA : int 1 1 1 1 1 1 1 1 1 1 ...
km_seeds <- kmeans(seeds, 3)
plot(length ~ compactness, data = seeds, col = km_seeds$cluster)
# Print out the ratio of the WSS to the BSS
# WSS : Within sum of squares
# BSS : between cluster sum of squares
km_seeds$tot.withinss / km_seeds$betweenss
## [1] 0.2800729
km_seeds
## K-means clustering with 3 clusters of sizes 75, 61, 74
##
## Cluster means:
## area perimeter compactness length width asymmetry groove_length
## 1 11.90907 13.25027 0.8515493 5.222333 2.865093 4.722187 5.093040
## 2 18.72180 16.29738 0.8850869 6.208934 3.722672 3.603590 6.066098
## 3 14.63203 14.45324 0.8790973 5.561784 3.274892 2.744043 5.184932
## <NA>
## 1 2.866667
## 2 1.983607
## 3 1.135135
##
## Clustering vector:
## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [36] 3 3 2 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 3 3 3 3 3 3 3
## [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2
## [106] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 2 2 2 2 2 2 2 3 3 3 3 2 3 3 3
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 217.4687 185.0922 223.1591
## (between_SS / total_SS = 78.1 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
더 공부를 하고 싶다면
set.seed(1)
# Shuffle the dataset, call the result shuffled
n <- nrow(titanic)
shuffled <- titanic[sample(n),]
# Split the data in train and test using a 7/3 split
train <- shuffled[1:round(0.7*n),]
test <- shuffled[(round(0.7*n) + 1):n,]
str(train)
str(test)
tree <- rpart(Survived ~ ., train, method = "class")
pred <- predict(tree, test, type="class")
conf <- table(test$Survived, pred) # confusion matrix
conf
set.seed(1)
# Initialize the accs vector
accs <- rep(0,6)
for (i in 1:6) {
# These indices indicate the interval of the test set
indices <- (((i-1) * round((1/6)*nrow(shuffled))) + 1):((i*round((1/6) * nrow(shuffled))))
# Exclude them from the train set
train <- shuffled[-indices,]
# Include them in the test set
test <- shuffled[indices,]
# A model is learned using each training set
tree <- rpart(Survived ~ ., train, method = "class")
# Make a prediction on the test set using tree
pred <- predict(tree, test, type="class")
# Assign the confusion matrix to conf
conf <- table(test$Survived, pred)
# Assign the accuracy of this model to the ith index in accs
accs[i] <- sum(diag(conf))/sum(conf)
}
mean(accs) # Print out the mean of accs
\[ K = \frac{N}{N* ratio \; of \; test \; set} \]
> head(emails_full)
avg_capital_seq spam
1 1.500 0
2 4.941 1
3 3.429 1
4 3.493 1
5 3.380 0
6 3.689 1
spam_classifier <- function(x){
prediction <- rep(NA, length(x)) # initialize prediction vector
prediction[x > 4] <- 1
prediction[x >= 3 & x <= 4] <- 0
prediction[x >= 2.2 & x < 3] <- 1
prediction[x >= 1.4 & x < 2.2] <- 0
prediction[x > 1.25 & x < 1.4] <- 1
prediction[x <= 1.25] <- 0
return(factor(prediction, levels = c("1", "0"))) # prediction is either 0 or 1
}
pred_full <- spam_classifier(emails_full$avg_capital_seq)
conf_full <- table(emails_full$spam, pred_full) # confusion matrix
acc_full <- sum(diag(conf_full))/sum(conf_full) # accuracy
acc_full
It’s official now, the spam_classifier() from chapter 1 is bogus. It simply overfits on the emails_small set and, as a result, doesn’t generalize to larger datasets such as emails_full.
So let’s try something else. On average, emails with a high frequency of sequential capital letters are spam. What if you simply filtered spam based on one threshold for avg_capital_seq?
For example, you could filter all emails with avg_capital_seq > 4 as spam. By doing this, you increase the interpretability of the classifier and restrict its complexity. However, this increases the bias, i.e. the error due to restricting your model.
Your model no longer fits the small dataset perfectly but it fits the big dataset better. You increased the bias on the model and caused it to generalize better over the complete dataset. While the first classifier overfits the data, an accuracy of 73% is far from satisfying for a spam filter.
Using few attributes will reduce the complexity, but adding relationships between these attributes will increase the complexity.