In this lesson students will …
Learn the basics of image analysis
Apply machine learning algorithms to classify numeric characters
Compare model fit using confusion matrices
The first step in handling mail received in the post office is sorting letters by zip code. Originally, humans had to sort these by hand. To do this, they had to read the zip codes on each letter. Today, thanks to machine learning algorithms, a computer can read zip codes and then a robot sorts the letters.
Source: http://rafalab.dfci.harvard.edu/dsbook/introduction-to-machine-learning.html
Let’s start with a simplified case.
Goal: Classify 2’s vs 7’s
Group Discussion Question:
USE THIS SPACE FOR YOUR NOTES
library(tidyverse)
#install.packages("dslabs")
library(dslabs)
data(mnist_27)
# learn about the data
?mnist_27
# Let's make the data easier to work with
train27<-mnist_27$train
test27<-mnist_27$test
str(train27)
## 'data.frame': 800 obs. of 3 variables:
## $ y : Factor w/ 2 levels "2","7": 1 2 1 1 2 1 2 2 2 1 ...
## $ x_1: num 0.0395 0.1607 0.0213 0.1358 0.3902 ...
## $ x_2: num 0.1842 0.0893 0.2766 0.2222 0.3659 ...
Group Discussion Question:
USE THIS SPACE FOR YOUR NOTES
The data are already partitioned into training and testing sets. First, let’s create tables to assess the proportion of 2’s and 7’s in the partitions.
## TRAIN TABLE
tabTrain<-table(train27$y)
prop.table(tabTrain)
##
## 2 7
## 0.47375 0.52625
## TEST TABLE
tabTest<-table(test27$y)
prop.table(tabTest)
##
## 2 7
## 0.53 0.47
Now, let’s create graphics to assess the distributions of the feature values across the two categories (2 and 7).
## DENSITY for X_1
ggplot(train27, aes(x=x_1, fill=y))+
geom_density(alpha=.5)
## DENSITY for X_2
ggplot(train27, aes(x=x_2, fill=y))+
geom_density(alpha=.5)
Is there a bivariate relationships for \(X_1\) and \(X_2\) together?
## SCATTERPLOT
ggplot(train27, aes(x=x_1, y=x_2, color=y))+
geom_point()
Question:
USE THIS SPACE FOR YOUR NOTES
One the simplest and most widely used modeling technique is linear regression. While it might not be the most appropriate technique for classification, if we transform our groups into binary variables we can fit a line to separate the two groups.
train27$y01<-as.numeric(train27$y)-1
test27$y01<-as.numeric(test27$y)-1
lm27<-lm(y01~x_1+x_2, data=train27)
summary(lm27)
##
## Call:
## lm(formula = y01 ~ x_1 + x_2, data = train27)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.25163 -0.28624 0.01555 0.29685 1.03077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.17749 0.04976 3.567 0.000383 ***
## x_1 3.78654 0.16698 22.677 < 2e-16 ***
## x_2 -1.22031 0.17207 -7.092 2.92e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3898 on 797 degrees of freedom
## Multiple R-squared: 0.3927, Adjusted R-squared: 0.3912
## F-statistic: 257.7 on 2 and 797 DF, p-value: < 2.2e-16
Consider the model:
\[0.5=\hat{\beta}_0+\hat{\beta}_1 x_1+\hat{\beta}_2 x_2\] A probability greater than 0.5 would be results in the classification of “7”. A probability less than 0.5 would be results in the classification of “2”.
Then, we can fit a line
\[x_2=\frac{0.5-\hat{\beta}_0}{\hat{\beta}_2}-\frac{\hat{\beta}_1}{\hat{\beta}_2}x_1\]
Hint: Think of \(x_2\) as the response and \(x_1\) as the explanatory.
## BETAs
beta0<-lm27$coefficients[1]
beta_x1<-lm27$coefficients[2]
beta_x2<-lm27$coefficients[3]
## NEW COEFFICIENTS
new_yint<-(0.5-beta0)/beta_x2
new_slope<- -1*(beta_x1/beta_x2)
Add the line to the scatterplot
ggplot(train27, aes(x=x_1, y=x_2, color=y))+
geom_point()+
geom_abline(intercept=new_yint,
slope=new_slope)
Fit points using the new equation for the line defined above.
Then we can classify using the following rules:
## FIT FOR PREDICTED VALUES AND CLASSIFY
lmClass<-test27%>%
mutate(pred=new_yint+new_slope*x_1)%>%
mutate(class=as.numeric(pred>x_2))
## CORRECT RATE
mean(lmClass$y01==lmClass$class)
## [1] 0.75
Not too bad for an inappropriately applied model.
We will now try a logistic regression because that is suited for binary variables, such as this.
log27<-glm(y~x_1*x_2, data=train27, family = "binomial")
summary(log27)
##
## Call:
## glm(formula = y ~ x_1 * x_2, family = "binomial", data = train27)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4126 -0.6596 0.1544 0.6933 3.0199
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.9096 0.9449 0.963 0.335722
## x_1 7.7811 4.9746 1.564 0.117780
## x_2 -18.9360 3.6413 -5.200 1.99e-07 ***
## x_1:x_2 58.8139 17.6345 3.335 0.000853 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1106.83 on 799 degrees of freedom
## Residual deviance: 706.16 on 796 degrees of freedom
## AIC: 714.16
##
## Number of Fisher Scoring iterations: 5
Question:
USE THIS SPACE FOR YOUR NOTES
## Predict the probability of being a 7
log27_prop<-predict(log27, test27, type="response")
head(log27_prop)
## 1 2 3 4 5 6
## 0.3499461 0.9091655 0.9135826 0.8284153 0.5434213 0.0491970
Use 0.5 as the threshold:
## THRESHOLD
log27_pred<-ifelse(log27_prop>.5, "7", "2")
### confusion mat
table(log27_pred, test27$y)
##
## log27_pred 2 7
## 2 82 24
## 7 24 70
## correct
mean(log27_pred==test27$y)
## [1] 0.76
How about a non-parametric method? K-nearest neighbors will do the trick.
## TRAINING
### FEATURE
trainFea<-train27%>%
select(-c(y, y01))
### OUTCOME
trainOut<-train27$y
## TESTING
### FEATURE
testFea<-test27%>%
select(-c(y, y01))
### OUTCOME
testOut<-test27$y
library(class)
set.seed(123)
### KNN
knn.pred=knn(train = trainFea,
test = testFea,
cl = trainOut,
k=1)
### CONFUSION MATRIX
cm<-table(knn.pred,testOut)
cm
## testOut
## knn.pred 2 7
## 2 85 31
## 7 21 63
### CORRECT RATE
mean(knn.pred==testOut)
## [1] 0.74
set.seed(123)
error <- c()
for (i in 1:30){
knnPred<- knn(train = trainFea,
test = testFea,
cl = trainOut,
k = i)
error[i] = 1- mean(knnPred==testOut)
}
ggplot(data = data.frame(error), aes(x = 1:30, y = error)) +
geom_line(color = "Blue")+
xlab("Neighborhood Size")
which.min(error)
## [1] 8
### k=8
knn.pred8=knn(train = trainFea,
test = testFea,
cl = trainOut,
k=8)
cm8<-table(knn.pred8,testOut)
cm8
## testOut
## knn.pred8 2 7
## 2 94 17
## 7 12 77
### correct rate
mean(knn.pred8==testOut)
## [1] 0.855
That’s a substantial increase!
Trees are nice because they are highly interpretable and mirror human decision making.
set.seed(123)
library(rpart)
### AVOID DATA LEAKAGE
train27<-train27%>%
select(-y01)
test27<-test27%>%
select(-y01)
### FIT A TREE
classTree<- rpart(y ~ .,
data = train27,
method = "class")
### PLOT TREE
library(rpart.plot)
rpart.plot(classTree)
Question:
USE THIS SPACE FOR YOUR NOTES
### Plot CP
plotcp(classTree)
## LOOKS LIKE THE FULL TREE IS BEST, NO NEED TO PRUNE
### PREDICT
predTree1<-predict(classTree , test27, type="class")
### CONFUSION MATRIX
cmTree1<-table(predTree1, test27$y)
cmTree1
##
## predTree1 2 7
## 2 85 18
## 7 21 76
## CORRECT RATE
mean(predTree1==test27$y)
## [1] 0.805
What’s better than one tree? Many trees!
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
set.seed(123)
caretRF <- train(y ~.,
data = train27,
method = "rf",
trControl = trainControl("cv", number = 10),
importance = TRUE
)
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
# Best tuning parameter
caretRF$bestTune
## mtry
## 1 2
caretRF$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 17.75%
## Confusion matrix:
## 2 7 class.error
## 2 305 74 0.1952507
## 7 68 353 0.1615202
## PREDICT
predCaretRF <- caretRF %>% predict(test27)
## TABLE
table(predCaretRF, test27$y)
##
## predCaretRF 2 7
## 2 83 22
## 7 23 72
## CORRECT RATE
mean(predCaretRF==test27$y)
## [1] 0.775
Linear discriminant analysis (LDA) is a modeling technique that can be used for classification and dimension reduction. The goal of LDA is to maximize the distance of the projected means and to minimize the projected within-class variance. We can then take an orthogonal vector to create a decision boundary.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# Fit the model
lda27 <- lda(y~., data = train27)
lda27
## Call:
## lda(y ~ ., data = train27)
##
## Prior probabilities of groups:
## 2 7
## 0.47375 0.52625
##
## Group means:
## x_1 x_2
## 2 0.1287798 0.2831542
## 7 0.2341133 0.2881862
##
## Coefficients of linear discriminants:
## LD1
## x_1 15.509380
## x_2 -4.998317
# Make predictions
predLDA <- lda27 %>% predict(test27)
# Predicted classes
head(predLDA $class, 6)
## [1] 2 7 7 7 7 2
## Levels: 2 7
# Predicted probabilities of class memebership.
head(predLDA $posterior, 6)
## 2 7
## 1 0.6319696 0.36803040
## 2 0.1063043 0.89369568
## 3 0.1656544 0.83434560
## 4 0.1387310 0.86126898
## 5 0.4707492 0.52925081
## 6 0.9617298 0.03827015
# Linear discriminants
head(predLDA $x, 3)
## LD1
## 1 -0.4436966
## 2 1.2160639
## 3 0.8975601
# Model accuracy
mean(predLDA$class==test27$y)
## [1] 0.75
plot(lda27)
library(klaR)
partimat(y~., data = train27, method = "lda")
Rather than fitting a straight boundary function, we can use a quadratic.
library(MASS)
# Fit the model
qda27 <- qda(y~., data = train27)
qda27
## Call:
## qda(y ~ ., data = train27)
##
## Prior probabilities of groups:
## 2 7
## 0.47375 0.52625
##
## Group means:
## x_1 x_2
## 2 0.1287798 0.2831542
## 7 0.2341133 0.2881862
# Make predictions
predQDA <- qda27 %>% predict(test27)
# Model accuracy
mean(predQDA$class==test27$y)
## [1] 0.82
library(klaR)
partimat(y~., data = train27, method = "qda")
## CORRECT RATE
mean(lmClass$y01==lmClass$class)
## [1] 0.75
## correct
mean(log27_pred==test27$y)
## [1] 0.76
### correct rate
mean(knn.pred8==testOut)
## [1] 0.855
## CORRECT RATE
mean(predTree1==test27$y)
## [1] 0.805
## CORRECT RATE
mean(predCaretRF==test27$y)
## [1] 0.775
mean(predLDA$class==test27$y)
## [1] 0.75
mean(predQDA$class==test27$y)
## [1] 0.82