Learning Objectives

In this lesson students will …

  • Learn the basics of image analysis

  • Apply machine learning algorithms to classify numeric characters

    • Linear Regression
    • K-nearest Neighbors
    • Logistic Regression
    • Classification Tree
    • Random Forest
    • Linear Discriminant Analysis
    • Quadratic Discriminant Analysis
  • Compare model fit using confusion matrices

Data Inspiration

The first step in handling mail received in the post office is sorting letters by zip code. Originally, humans had to sort these by hand. To do this, they had to read the zip codes on each letter. Today, thanks to machine learning algorithms, a computer can read zip codes and then a robot sorts the letters.

Source: http://rafalab.dfci.harvard.edu/dsbook/introduction-to-machine-learning.html

Example 1

Let’s start with a simplified case.

Goal: Classify 2’s vs 7’s

Group Discussion Question:

  • Task: Each group member, write down a 2 and a 7
  • Combine the data for your group. Do you see variability?
  • How do you distinguish the difference between a 2 and a 7?
  • How might you teach a computer to do this?
USE THIS SPACE FOR YOUR NOTES

Introducing the Data

library(tidyverse)
#install.packages("dslabs")
library(dslabs)

data(mnist_27)

# learn about the data
?mnist_27

# Let's make the data easier to work with
train27<-mnist_27$train
test27<-mnist_27$test
str(train27)
## 'data.frame':    800 obs. of  3 variables:
##  $ y  : Factor w/ 2 levels "2","7": 1 2 1 1 2 1 2 2 2 1 ...
##  $ x_1: num  0.0395 0.1607 0.0213 0.1358 0.3902 ...
##  $ x_2: num  0.1842 0.0893 0.2766 0.2222 0.3659 ...

Features Available

  • \(X_1\): Proportion of dark pixels that are in the upper left quadrant
  • \(X_2\): Proportion of dark pixels that are in the lower right quadrant

Group Discussion Question:

  • Now that you know what variables are available, how would you teach a computer how to distinguish between a 2 and a 7?
  • Phrase a hypothesis in terms of the variables \(X_1\) and \(X_2\).
USE THIS SPACE FOR YOUR NOTES

STEP 0: Visualize

The data are already partitioned into training and testing sets. First, let’s create tables to assess the proportion of 2’s and 7’s in the partitions.

## TRAIN TABLE
tabTrain<-table(train27$y)
prop.table(tabTrain)
## 
##       2       7 
## 0.47375 0.52625
## TEST TABLE
tabTest<-table(test27$y)
prop.table(tabTest)
## 
##    2    7 
## 0.53 0.47

Now, let’s create graphics to assess the distributions of the feature values across the two categories (2 and 7).

## DENSITY for X_1
ggplot(train27, aes(x=x_1, fill=y))+
  geom_density(alpha=.5)

## DENSITY for X_2
ggplot(train27, aes(x=x_2, fill=y))+
  geom_density(alpha=.5)

Is there a bivariate relationships for \(X_1\) and \(X_2\) together?

## SCATTERPLOT
ggplot(train27, aes(x=x_1, y=x_2, color=y))+
  geom_point()

Question:

  • What insights can be drawn from these graphics?
USE THIS SPACE FOR YOUR NOTES

Model A: Linear Regression

One the simplest and most widely used modeling technique is linear regression. While it might not be the most appropriate technique for classification, if we transform our groups into binary variables we can fit a line to separate the two groups.

Step 1: Binary Variables

train27$y01<-as.numeric(train27$y)-1
test27$y01<-as.numeric(test27$y)-1

Step 2: Fit a MLR

lm27<-lm(y01~x_1+x_2, data=train27)
summary(lm27)
## 
## Call:
## lm(formula = y01 ~ x_1 + x_2, data = train27)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.25163 -0.28624  0.01555  0.29685  1.03077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.17749    0.04976   3.567 0.000383 ***
## x_1          3.78654    0.16698  22.677  < 2e-16 ***
## x_2         -1.22031    0.17207  -7.092 2.92e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3898 on 797 degrees of freedom
## Multiple R-squared:  0.3927, Adjusted R-squared:  0.3912 
## F-statistic: 257.7 on 2 and 797 DF,  p-value: < 2.2e-16

Step 3: Boundary Line

Consider the model:

\[0.5=\hat{\beta}_0+\hat{\beta}_1 x_1+\hat{\beta}_2 x_2\] A probability greater than 0.5 would be results in the classification of “7”. A probability less than 0.5 would be results in the classification of “2”.

Then, we can fit a line
\[x_2=\frac{0.5-\hat{\beta}_0}{\hat{\beta}_2}-\frac{\hat{\beta}_1}{\hat{\beta}_2}x_1\]

Hint: Think of \(x_2\) as the response and \(x_1\) as the explanatory.

## BETAs
beta0<-lm27$coefficients[1]
beta_x1<-lm27$coefficients[2]
beta_x2<-lm27$coefficients[3]

## NEW COEFFICIENTS
new_yint<-(0.5-beta0)/beta_x2
new_slope<- -1*(beta_x1/beta_x2)

Add the line to the scatterplot

ggplot(train27, aes(x=x_1, y=x_2, color=y))+
  geom_point()+
  geom_abline(intercept=new_yint, 
              slope=new_slope)

Step 4: Prediction

Fit points using the new equation for the line defined above.

Then we can classify using the following rules:

  • If the \(X_2\) value is greater than the predicted (on the left side of the line), then classify as 7
  • If the \(X_2\) value is less than the predicted (on the right side of the line), then classify as 2
## FIT FOR PREDICTED VALUES AND CLASSIFY
lmClass<-test27%>%
  mutate(pred=new_yint+new_slope*x_1)%>%
  mutate(class=as.numeric(pred>x_2))

Step 5: Model Accuracy

## CORRECT RATE
mean(lmClass$y01==lmClass$class)
## [1] 0.75

Not too bad for an inappropriately applied model.

Model B: Logistic Regression

We will now try a logistic regression because that is suited for binary variables, such as this.

Step 1: Fit A Logistic

log27<-glm(y~x_1*x_2, data=train27, family = "binomial")
summary(log27)
## 
## Call:
## glm(formula = y ~ x_1 * x_2, family = "binomial", data = train27)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4126  -0.6596   0.1544   0.6933   3.0199  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.9096     0.9449   0.963 0.335722    
## x_1           7.7811     4.9746   1.564 0.117780    
## x_2         -18.9360     3.6413  -5.200 1.99e-07 ***
## x_1:x_2      58.8139    17.6345   3.335 0.000853 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1106.83  on 799  degrees of freedom
## Residual deviance:  706.16  on 796  degrees of freedom
## AIC: 714.16
## 
## Number of Fisher Scoring iterations: 5

Question:

  • How can we interpret these coefficients in the context of these data?
USE THIS SPACE FOR YOUR NOTES

Step 2: Prediction

## Predict the probability of being a 7
log27_prop<-predict(log27, test27, type="response")
head(log27_prop)
##         1         2         3         4         5         6 
## 0.3499461 0.9091655 0.9135826 0.8284153 0.5434213 0.0491970

Step 3: Thresholding

Use 0.5 as the threshold:

## THRESHOLD
log27_pred<-ifelse(log27_prop>.5, "7", "2")

Step 4: Model Accuracy

### confusion mat
table(log27_pred, test27$y)
##           
## log27_pred  2  7
##          2 82 24
##          7 24 70
## correct
mean(log27_pred==test27$y)
## [1] 0.76

Model C: K-Nearest Neighbors

How about a non-parametric method? K-nearest neighbors will do the trick.

Step 1: Define the Features and Outcomes

## TRAINING 
### FEATURE
trainFea<-train27%>%
  select(-c(y, y01))
### OUTCOME
trainOut<-train27$y

## TESTING 
### FEATURE
testFea<-test27%>%
  select(-c(y, y01))
### OUTCOME
testOut<-test27$y

Step 2: Fit a KNN (k=1)

library(class)

set.seed(123)

### KNN
knn.pred=knn(train = trainFea,
             test = testFea,
             cl = trainOut,
             k=1)

### CONFUSION MATRIX
cm<-table(knn.pred,testOut)
cm
##         testOut
## knn.pred  2  7
##        2 85 31
##        7 21 63
### CORRECT RATE
mean(knn.pred==testOut)
## [1] 0.74

Step 3: Find the Best k

set.seed(123)
error <- c()
for (i in 1:30){
  knnPred<- knn(train = trainFea,
                test = testFea,
                cl = trainOut, 
                k = i)
  error[i] = 1- mean(knnPred==testOut)
}

ggplot(data = data.frame(error), aes(x = 1:30, y = error)) +
  geom_line(color = "Blue")+
  xlab("Neighborhood Size")

which.min(error)
## [1] 8

Step 4: Fit a KNN (k=8)

### k=8
knn.pred8=knn(train = trainFea,
             test = testFea,
             cl = trainOut,
             k=8)

cm8<-table(knn.pred8,testOut)
cm8
##          testOut
## knn.pred8  2  7
##         2 94 17
##         7 12 77
### correct rate
mean(knn.pred8==testOut)
## [1] 0.855

That’s a substantial increase!

Model D: Classification Tree

Trees are nice because they are highly interpretable and mirror human decision making.

Step 1: Fit a Tree

set.seed(123)
library(rpart)

### AVOID DATA LEAKAGE
train27<-train27%>%
  select(-y01)

test27<-test27%>%
  select(-y01)

### FIT A TREE
classTree<- rpart(y ~ .,
                  data = train27,
                  method = "class")

Step 2: Plot the Tree

### PLOT TREE
library(rpart.plot)
rpart.plot(classTree)

Question:

  • What patterns do you observe in the branching?
USE THIS SPACE FOR YOUR NOTES

Step 3: Pruning?

### Plot CP
plotcp(classTree)

## LOOKS LIKE THE FULL TREE IS BEST, NO NEED TO PRUNE

Step 4: Predict

### PREDICT
predTree1<-predict(classTree , test27, type="class")

Step 5: Model Accuracy

### CONFUSION MATRIX
cmTree1<-table(predTree1, test27$y)
cmTree1
##          
## predTree1  2  7
##         2 85 18
##         7 21 76
## CORRECT RATE
mean(predTree1==test27$y)
## [1] 0.805

Model E: Random Forest

What’s better than one tree? Many trees!

Step 1: Fit a Random Forest

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
set.seed(123)
caretRF <- train(y ~., 
                 data = train27, 
                 method = "rf",
                 trControl = trainControl("cv", number = 10),
                 importance = TRUE
)
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
# Best tuning parameter
caretRF$bestTune
##   mtry
## 1    2
caretRF$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 17.75%
## Confusion matrix:
##     2   7 class.error
## 2 305  74   0.1952507
## 7  68 353   0.1615202

Step 2: Predict

## PREDICT
predCaretRF <- caretRF %>% predict(test27)

Step 3: Model Accuracy

## TABLE
table(predCaretRF, test27$y)
##            
## predCaretRF  2  7
##           2 83 22
##           7 23 72
## CORRECT RATE
mean(predCaretRF==test27$y)
## [1] 0.775

Model F: Linear Discriminant

Linear discriminant analysis (LDA) is a modeling technique that can be used for classification and dimension reduction. The goal of LDA is to maximize the distance of the projected means and to minimize the projected within-class variance. We can then take an orthogonal vector to create a decision boundary.

Step 1: Fit LDA

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# Fit the model
lda27 <- lda(y~., data = train27)
lda27
## Call:
## lda(y ~ ., data = train27)
## 
## Prior probabilities of groups:
##       2       7 
## 0.47375 0.52625 
## 
## Group means:
##         x_1       x_2
## 2 0.1287798 0.2831542
## 7 0.2341133 0.2881862
## 
## Coefficients of linear discriminants:
##           LD1
## x_1 15.509380
## x_2 -4.998317

Step 2: Prediction

# Make predictions
predLDA <- lda27 %>% predict(test27)

# Predicted classes
head(predLDA $class, 6)
## [1] 2 7 7 7 7 2
## Levels: 2 7
# Predicted probabilities of class memebership.
head(predLDA $posterior, 6) 
##           2          7
## 1 0.6319696 0.36803040
## 2 0.1063043 0.89369568
## 3 0.1656544 0.83434560
## 4 0.1387310 0.86126898
## 5 0.4707492 0.52925081
## 6 0.9617298 0.03827015
# Linear discriminants
head(predLDA $x, 3) 
##          LD1
## 1 -0.4436966
## 2  1.2160639
## 3  0.8975601

Step 3: Model Accuracy

# Model accuracy
mean(predLDA$class==test27$y)
## [1] 0.75
plot(lda27)

Step 4: Boundary Line

library(klaR)

partimat(y~., data = train27, method = "lda")

Model G: Quadratic Discriminant

Rather than fitting a straight boundary function, we can use a quadratic.

Step 1: Fit a QDA

library(MASS)

# Fit the model
qda27 <- qda(y~., data = train27)
qda27
## Call:
## qda(y ~ ., data = train27)
## 
## Prior probabilities of groups:
##       2       7 
## 0.47375 0.52625 
## 
## Group means:
##         x_1       x_2
## 2 0.1287798 0.2831542
## 7 0.2341133 0.2881862

Step 2: Predict

# Make predictions
predQDA <- qda27 %>% predict(test27)

Step 3: Model Accuracy

# Model accuracy
mean(predQDA$class==test27$y)
## [1] 0.82

Step 4: Boundary Line

library(klaR)

partimat(y~., data = train27, method = "qda")

Compare

Model A: Linear Regression

## CORRECT RATE
mean(lmClass$y01==lmClass$class)
## [1] 0.75

Model B: Logistic Regression

## correct
mean(log27_pred==test27$y)
## [1] 0.76

Model C: K-Nearest Neighbors

### correct rate
mean(knn.pred8==testOut)
## [1] 0.855

Model D: Classification Tree

## CORRECT RATE
mean(predTree1==test27$y)
## [1] 0.805

Model E: Random Forest

## CORRECT RATE
mean(predCaretRF==test27$y)
## [1] 0.775

Model F: Linear Discriminant

mean(predLDA$class==test27$y)
## [1] 0.75

Model G: Quadratic Discriminant

mean(predQDA$class==test27$y)
## [1] 0.82