Overview

Taken from “Deep Learning” by Goodfellow, Bengio and Courville.

Tasks

Regression (learning output-input relationship)
Classification
Transcription (handwriting and speech recognition)
Machine translation
Structured output (annotate aerial photos, describe images)
Anomaly detection
Synthesis (learn paintings of Picasso and generate samples)
Denoising
Density estimation (from samples \(x_i\) develop PDF \(P(x)\))
many more …

Performance

Measure how well a task is performed.

Experience

Supervised learning
Unsupervised learning
Reinforcement learning

The art of ML and AI

There are many hyperparameters that make things work. There are no general rules on how to set them, they are often application specific. Success and failure can be hard to explain.

What we will do in next 2.5 days

Classification – Fisher’s LDA, KNN, SVM, tree-based method
Artificial Neural Network
Unsupervised learning – PCA/SVD, clustering by K-means and hierarchical method

What we won’t do

Recurrent NN
Reinforcement learning
Natural language processing

Fisher’s LDA (supervised learning)

Data: We are given training set \((x_i,y_i)\), \(i=1:n\); \(x_i\in R^p\) are the input variables and \(y_i\) are the labels. Typically, the label could be one of the \(K\geq 2\) classes.

Task: Given an input \(x\), predict the correct label.

The basic idea in Linear Discriminant Analysis is that samples from each class are drawn from a multivariate normal distribution with mean \(\mu_k\) and common covariance \(\Sigma\). So, if \(x_k\) is class \(k\), then

\[ X \sim N(\mu_k, \Sigma)\].

We use the training data to learn \(\mu_k\) and \(\Sigma\):

\[\hat{\pi}_k = N_k/N\] \[\hat{\mu}_k = \sum_{y_i=k} x_i/N_k\] \[\hat{\Sigma} = \sum_{k=1}^K \sum_{y_i=k} (x_i - \hat{\mu}_k)(x_i-\hat{\mu}_k)^T/(N-K)\] Linear Discriminant Functions \[ \delta_k(x) = x^T\Sigma^{-1}\mu_k-\frac{1}{2} \mu_k^T \Sigma^{-1} \mu_k + \log \pi_k \] Method: Given \(x\), we calculate \(\delta_k(x)\). We assign class \(k\) if \(\delta_k(x)\) is maximized. Based on computing the posterior probability \(P(Y=k|X=x)\).

The Decision boundary between class \(k\) and class \(k'\) is the set \(\{ x: \delta_k(x) = \delta_k'(x)\}\)

library(ISLR)
library(MASS)
library(ggplot2)
library(dplyr)
library(class)

select data for training and testing

smp_size <- floor(0.7 * nrow(Default))
set.seed(123)
train_ind <- sample(seq_len(nrow(Default)), size = smp_size)
train <- Default[train_ind, ]
test <- Default[-train_ind, ]
#visualize training set
ggplot(train,
       aes(x=income,y=balance,color=default))+
  geom_point(size=1,shape=3)

LDA on training set

# train on data
lda.fit = lda(default ~ income + balance, data = train)

Plot decision boundary (from michael.hahsler.net)

# I got this from michael.hahsler.net
decisionplot <- function(model, data, class = NULL, predict_type = "class",
  resolution = 100, showgrid = TRUE, ...) {

  if(!is.null(class)) cl <- data[,class] else cl <- 1
  data <- data[,1:2]
  k <- length(unique(cl))

  plot(data, col = as.integer(cl)+1L, pch = as.integer(cl)+1L, ...)

  # make grid
  r <- sapply(data, range, na.rm = TRUE)
  xs <- seq(r[1,1], r[2,1], length.out = resolution)
  ys <- seq(r[1,2], r[2,2], length.out = resolution)
  g <- cbind(rep(xs, each=resolution), rep(ys, time = resolution))
  colnames(g) <- colnames(r)
  g <- as.data.frame(g)

  ### guess how to get class labels from predict
  ### (unfortunately not very consistent between models)
  p <- predict(model, g, type = predict_type)
  if(is.list(p)) p <- p$class
  p <- as.factor(p)

  if(showgrid) points(g, col = as.integer(p)+1L, pch = ".")

  z <- matrix(as.integer(p), nrow = resolution, byrow = TRUE)
  contour(xs, ys, z, add = TRUE, drawlabels = FALSE,
    lwd = 2, levels = (1:(k-1))+.5)

  invisible(z)
}
q <- train[1:7000,c("income","balance","default")]
decisionplot(lda.fit, q, class = "default")

Test LDA model on test data

lda.pred = predict(lda.fit,test)
table(lda.pred$class,test$default)

##      
##         No  Yes
##   No  2906   71
##   Yes    7   16

Homework: Learn about ROC curve for assessing classifier.

Quadratic Discriminant Analysis (QDA)

Assumes that for each class, you have a different mean and covariance matrix \[X \sim N(\mu_k,\Sigma_k)\] The formula for to find out what class to assign to \(x\) is \[\delta_k(x) = -\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) -\frac{1}{2}\log|\Sigma_k| + \log \pi_k \] Note that the formula is quadratic in \(x\).

Try QDA on the Default data.

qda.fit = qda(default ~ income + balance, data = train)
decisionplot(qda.fit, q, class = "default")

qda.pred = predict(qda.fit, test)
table(qda.pred$class, test$default)

##      
##         No  Yes
##   No  2902   68
##   Yes   11   19

K Nearest Neighbors

We are again given labeled data \((y_i,x_i)\). Note that KNN can be used for regression task. The idea behind KNN is simple. You need to specify the number of neighbors to check. Given a test point \(x\), you find \(k\) nearest neighbors to \(x\) in \(\{x_i\}\), read the associated labels, use majority rule.

library(caret)

## Warning: package 'caret' was built under R version 3.3.3

## Loading required package: lattice

knn.fit <- knn3(default ~ balance + income, data = train, k=3)
decisionplot(knn.fit, q, class="default")

attach(Default)
train.X = cbind(balance,income)[train_ind,]
test.X = cbind(balance,income)[-train_ind,]
train.default = default[train_ind]
knn.pred=knn(train.X,test.X,train.default,k=1)
table(knn.pred,default[-train_ind])

##         
## knn.pred   No  Yes
##      No  2847   65
##      Yes   66   22

knn.pred=knn(train.X,test.X,train.default,k=5)
table(knn.pred,default[-train_ind])

##         
## knn.pred   No  Yes
##      No  2891   75
##      Yes   22   12

Homework (easy): Do ISLR 4.6 “Lab: Logistic Regression, QDA, and KNN”

Homework (more challenging): Problem 10 on page 171 of ISLR (the Lab above will be helpful)

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Day 1 Fadil

Fadil Santosa

June 22, 2018