Separating Hyperplane

Data are \((y_i,x_i)\) for \(i=1:n\); \(y_i\) are labels for two classes, \(x_i \in R^p\). We will look for a hyperplane that separates the two classes – \(y_i = \pm 1\). Assuming that the data are separable, then we are looking for vector \((\beta_0, beta)\), \(\beta \in R^p\) such that \[ y_i (\beta_0 + \beta^T x_i) > 0 \] Once \((\beta_0,\beta)\) has been determined from data, you use \[f(x) = \beta_0 + \beta^T x \] to classify. If \(f(x)>0\), then assign label “1”; otherwise, assign label “-1”.

If the data are indeed separable, there might be many possible solutions.

Maximal Margin Classifier

Margin is the smallest distance from the hyperplane to the observations \(\{x_i\}\). We can maximize the margin by solving the optimization problem \[\begin{eqnarray*} && \max_{(\beta_0,\beta), M} M \\ && \mbox{subj} \;\; \| \beta \|^2 = 1\\ && \;\;\;\;\; y_i (\beta_0 + \beta^T x_i) \geq M \end{eqnarray*}\]

Support vectors are points in \(\{x_i\}\) for which \(f(x_i)=M_*\) where \(M_*\) is the maximizer.

Support Vector Machine (SVM)

The idea behind SVM is really simple. It answers the question on how to extend Maximal Margin Classifier when the data are not separable. Introduce slack variables into the formulation. Let \(\epsilon \in R^n\). We propose the optimization problem \[\begin{eqnarray*} && \max_{(\beta_0,\beta), \epsilon, M} M \\ && \mbox{subj} \;\; \| \beta \|^2 = 1\\ && \;\;\;\;\; y_i (\beta_0 + \beta^T x_i) \geq M (1-\epsilon_i)\\ && \;\;\;\;\; \epsilon_i \geq 0, \; \sum_{i=1}^n \epsilon_i \leq C \end{eqnarray*}\]

The constant \(C\) is a tuning parameter that controls how many training points are misclassified (hence width of the margin). Note that setting \(C\) too small may lead to infeasibility. Note also that the number of unknowns have blown up to \((1+p+n)\), so if you have a billion data points, this may be difficult to compute.

Role of slack variables:

Nonlinear decision boundaries are possible by using kernels (“the kernel trick”, or “kernelization”). We replace the classifier by \[ f(x) = \beta_0 + \sum_{i=1}^n \alpha_i K(x,x_i) \] where \(K(x,y)\) is the kernel. A popular choice for kernel is \[K(x,x_i) = \exp(-\gamma \|x-x_i\|^2)\]

Demo.

library(e1071)
## Warning: package 'e1071' was built under R version 3.3.3
library(ggplot2)
#generate simulated data
set.seed(1)
x1 = rnorm(20)
x2 = rnorm(20)
y = c(rep(-1,10), rep(1,10))
#create better separation
x1[y==1] = x1[y==1] + 1
x2[y==1] = x2[y==1] + 1
D <- cbind(y,x1,x2)
Df <- as.data.frame(D)
ggplot(Df, aes(x=x1, y=x2, color=y)) + geom_point()

dat = data.frame(cbind(x1,x2),y=as.factor(y))
svmfit <- svm(y ~ ., data=dat, kernel="linear", cost=0.01, scale=FALSE)

plot(svmfit, dat)

summary(svmfit)
## 
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 0.01, 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.01 
##       gamma:  0.5 
## 
## Number of Support Vectors:  20
## 
##  ( 10 10 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1
svmfit <- svm(y ~ ., data=dat, kernel="radial", cost=10, scale=FALSE)

plot(svmfit, dat)

summary(svmfit)
## 
## Call:
## svm(formula = y ~ ., data = dat, kernel = "radial", cost = 10, 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
##       gamma:  0.5 
## 
## Number of Support Vectors:  10
## 
##  ( 6 4 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1
# training and testing
set.seed(123)
# generate 200 points on the plane
x = matrix(rnorm(200*2), ncol=2)
# make data set:
# pull apart the first 100 from the next 100
x[1:100,] = x[1:100,] + 1.2
x[101:200,] = x[101:200,] - 1.2
y = c(rep(1,100), rep(-1,100))
dat = data.frame(x = x, y = as.factor(y))
train = sample(200,100)
# visualize
ggplot(dat[train,], aes(x=x.2, y=x.1, color=y)) + geom_point()

svmfit = svm(y ~ ., data = dat[train,], kernel="radial", gamma = 1, cost = 1)
plot(svmfit, dat[train,])

# try on test set
table(true=dat[-train,"y"], pred=predict(svmfit,dat[-train,]))
##     pred
## true -1  1
##   -1 51  2
##   1   0 47

dat[train,]

Decision Tree

(see PDF notes for background)

library(tree)
## Warning: package 'tree' was built under R version 3.3.3
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.3.3
?Carseats
## starting httpd help server ...
##  done
attach(Carseats)
names(Carseats)
##  [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population" 
##  [6] "Price"       "ShelveLoc"   "Age"         "Education"   "Urban"      
## [11] "US"
High = ifelse(Sales <=8, "No", "Yes")
Carseats =data.frame(Carseats ,High)
# grab 300 (of 400) rows for training
train = sample(1:nrow(Carseats), 300)
Carseats.train = Carseats[train,]
Carseats.test = Carseats[-train,]
High.train = High[train]
High.test = High[-train]
tree.carseats = tree(High ~ . -Sales, Carseats.train)
summary(tree.carseats)
## 
## Classification tree:
## tree(formula = High ~ . - Sales, data = Carseats.train)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "CompPrice"   "Income"      "Age"        
## [6] "Advertising" "Education"  
## Number of terminal nodes:  24 
## Residual mean deviance:  0.3641 = 100.5 / 276 
## Misclassification error rate: 0.08 = 24 / 300
plot(tree.carseats)
text(tree.carseats, pretty = 0)

# now test model
tree.pred = predict(tree.carseats, Carseats.test, type="class")
table(tree.pred, High.test)
##          High.test
## tree.pred No Yes
##       No  40  14
##       Yes 15  31

Deviance is \[ - 2 \sum_m \sum_k n_{mk}\log \hat{p}_{mk} \] where \(n_{mk}\) is number of observations in the \(m\)th node that belongs to class \(k\), and \(\hat{p}_{mk}\) is the proportion belonging to class \(k\).

The tree may be a bit unweildy. You can prune it.

prune.carseats = prune.misclass(tree.carseats,best=9)
plot(prune.carseats)
text(prune.carseats,pretty=0)

Better still, use cv.tree() which performs cross-validation in order to determine optimal level of tree complexity.

Suggested homework: work through pruning example on P. 327 ISLR.

Suggested homework (more involved): learn tree-based regression in ISLR and work through Sections 8.3.2 - 8.3.4