Read the \(social.csv\) file in R which contains data about the gender, age and salary information for the customers of a store; together with the information whether a binary variable representing whether the customer purchased a product or not.
This relevant information contains the following information:
UserId: Id of a user.
Gender: Gender of the customer.
Age: Age of the customer in years.
EstimatedSalary: Yearly estimated salary of the customer.
Purchased: Binary variable with value 1 if the customer purchased the item, 0 otherwise.
The aim of the following exercise is to train a support vector machine (SVM) that will predict whether a customer will purchase their product or not (based on the gender, age and estimated salary information).
SVM can also be formulated as an unconstrained optimization problem as follows:
library(readr)
social <- read_csv("./social.csv")
## Rows: 400 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Gender
## dbl (4): User ID, Age, EstimatedSalary, Purchased
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#View(social)
\[ \min_{w , b} \frac{1}{2} \parallel{w}\parallel^2 + C \sum_{n = 1}^{N} max (0, 1 - y_n \langle w,x_n\rangle + b) \]
(0.1)
(see e.g. equation (12.31)) where \(C > 0\) is a trade-off parameter between the classification error and the size of the margin. Here we take
\[ y_n = \begin{cases} \text{+1 if purchase} =1 \\ \text{ -1 if purchase} = 0 \end{cases} \]
and we take the input data (features) as
\[ x_n = [Age_n^1 EstimatedSalary_n ] \]
where \(Age_n\) and \(EstimatedgedSalary_n\) are the n-th row of the Age and EstimatedSalary columns in the data (corresponding to the n-th customer) respectively. For simplicity and easier visualization purposes, we will not be using Gender as a predictor (feature).
The objective (0.1) can be written as
\[ \min_{w , b} F(w,b) =\frac{1}{2} \parallel{w}\parallel_2 + C \sum_{n = 1}^{N} \ell(t_n)), \]
(0.2)
where
\[ \ell(t) \text{:=max(0, 1 - t) , } t_n = y_n \langle w,x_n\rangle + b) \]
The function \(\ell (t)\) is called the “hinge loss”, This function has a “kink” at t=1 and is not differentiable at t= 1 as illustrated in Figure 1 below.
Figure 1 Hinge loss \(\ell(t)\) vs. t
At \(t = 1\), the tangent line to the hinge loss is not well defined; in fact any line with slope \(m\in[-1,0]\) passing through the point \((0,1)\) will be lying below our function and the gradient does not exist. At this point, the gradient is a set of possible values that lie between 0 and 1. We call this a subgradient which is a generalized notion of gradient. The subdifferential \(\partial\ell\) of the hing loss at point t is defined as the set
\[ \partial\ell (t) = \begin{cases} \text{-1 if t < 1,} \\\text{[-1,0] if t}=1, \\ \text{0 if t>1} \end{cases} \]
(note that this is a generalization of the differential) and any element of this set is called a “subgradient”. For example,
\[ g(t) = \begin{cases} \text{-1 if t < 1,} \\\text{0 if t}=1, \\ \text{0 if t > 1} \end{cases} \]
(0.3)
is a subgradient because \(g(t) \in \partial\ell(t)\). Similarly, we can consider the subgradient of the objective F with respect to parameters w and b and they can be computed by the chain rule as
\[ \frac{\partial F}{ \partial w} =w + \sum_{n = 1}^{N} Cg(t_n)y_nx_n \text{ , } \frac{\partial F}{ \partial w} C\sum_{n = 1}^{N} g(t_n)y_n \] (0.4)
where N is the number of data points used for training the SVM model.
#training data
dt = sort(sample(nrow(social),nrow(social)*0.75 ))
train <- social[dt,]
test <- social[-dt,]
train$Age <-scale(train$Age)
train$EstimatedSalary <-scale(train$EstimatedSalary)
test$Age <- scale(test$Age)
test$EstimatedSalary <- scale(test$EstimatedSalary)
\[ \nabla F(x) =[\frac{\partial F}{ \partial w}, \frac{\partial F}{ \partial b}]^T \]
denote the subgradient of the objective \(F(x)\) at point \(x\) where \(\frac{\partial F}{ \partial w}\)and \(, \frac{\partial F}{ \partial b}\)are given by (0.4).
Implement the subgradient descent method (where we replace the gradient with the subgradient):
\[ x^{k+1} =x^k - \alpha\nabla F(x)\]
where \(α\) is a stepsize you can choose. Based on gradient descent method compute the optimal parameters \(x_∗ = [w_∗, b_∗]\). Note that these optimal parameters should be computed using only the “training data” but not the “test data”
gradientDesc <- function(Input, learn_rate, conv_threshold, C, max_iter){
f_val_trend <- c()
gradient_trend <- c()
n <- nrow(Input)
A <- as.matrix(Input[,3:4])
y <- as.matrix(Input[,5])
y<- replace(y,y==0,-1)
w1 <- 0
w2 <- 1
b <- 0
w <- matrix(c(w1,w2), nrow = 2, ncol =1)
x <- matrix (c(w1,w2,b),nrow = 3,ncol = 1)
converged <- "not yet"
iter <- 0
f_val <- 0
g_tn_func <- function(w){
g_tn <- c()
for (i in seq(1:n)) {
x_n <- matrix(c(A[i,1],A[i,2]), nrow = 2,ncol = 1)
t_n <- y[i] * (t(w) %*% x_n + b)
if(t_n<1){
g_tn[i] <- -1
}
if(t_n>=1){
g_tn[i] <- 0
}}
return(g_tn)
}
while (((converged=='not yet')&(iter<max_iter)&(f_val<1e+250))){
# satisfying three conditions will enter the loop
delta_w1 <- w1+ sum(mapply(function(x,y,z) C*x*y*z, g_tn_func(w),y,A[,1]))
delta_w2 <- w2 + sum(mapply(function(x,y,z) C*x*y*z, g_tn_func(w),y,A[,2]))
delta_b <- C*sum(mapply(function(x,y) x*y, g_tn_func(w),y))
#calcluate the gradient vector
delta_f <- matrix (c(delta_w1,delta_w2, delta_b), nrow=3, ncol=1)
norm_delta <- norm(delta_f)
if (norm_delta <= conv_threshold){
converged <- "fine"
}
if (norm_delta > conv_threshold){
converged <- "not yet"
}
x <- x - learn_rate * delta_f
w <- matrix(x[1:2], nrow = 2,ncol = 1)
w1 <- w[1]
w2 <- w[2]
b <- x[3]
f_val <- 0.5 * (w1^2 + w2^2) + C * sum(mapply(function(y,x1,x2) max(0,1-y*(w1*x1+w2*x2+b)), y,A[,1],A[,2]))
iter <- iter +1
f_val_trend[iter] <- f_val
gradient_trend[iter] <- norm_delta
}
print("Final x parameters:")
print(x)
title = paste("learning rate a = ",learn_rate)
plot(f_val_trend, main = title, type="l", xlab = "Iteration", ylab = "loss value")
cat("Final norm(gradient):", tail(gradient_trend, n=1))
cat(" Total iterations:", iter)
}
# Call Function
learn_rate = 1e-5
conv_threshold = 0.01
C = 50
max_iter = 1000
gradientDesc(train, learn_rate, conv_threshold, C, max_iter)
## [1] "Final x parameters:"
## [,1]
## [1,] 1.618010
## [2,] 1.054904
## [3,] -0.917500
## Final norm(gradient): 69.53624 Total iterations: 1000
\[f(x) = \langle w_*,x\rangle + b_*)\]
If \(f(x) > 0\), we classify the input \(x\) as \(+1\) (Purchase), if \(f(x) < 0\), we classify the input \(x\) as \(-1\) (No Purchase). What percent of the test data can you classify correctly?
A_test <- as.matrix(test[,3:4])
y_test <- as.matrix(test[,5])
pred1 <- c()
coef_w <- matrix(c(2.012826,1.036296),nrow = 1, ncol = 2)
for (i in seq(1:nrow(A_test))) {
x_n <- matrix(c(A_test[i,1],A_test[i,2]),nrow = 2, ncol = 1)
f_n <- coef_w %*% x_n - 0.793
pred1[i] <- ifelse(f_n<0,0,1)
}
accuracy <- 0
for (i in seq(1:nrow(y_test))) {
if(y_test[i] == pred1[i]){
accuracy <- accuracy +1
}
}
print(accuracy/nrow(y_test))
## [1] 0.84
Next fill out the following table according to your model: Entries of this table
pred/actual | actual = 1 | acatual = -1 pred = 1 | |
pred =-1 | |
count the number of occurrences for each case (how many times we predicted one, but the actual was -1,1 and so on). For example, if your table looks like this,
pred/actual | actual = 1 | acatual = -1
pred = 1 | 4 | 1
pred =-1 | 2 | 3
it means that out of \(10\) data points, your classification was right \(4 + 3 = 7\) times, you were wrong \(3\) times (once you predicted \(+1\) but the actual was \(-1\) and twice you predicted \(-1\) but the actual was \(+!\)). This table is also known as the “confusion matrix”. What happens to the confusion matrix if you use C = 100? What happens to the confusion matrix if you use C = 1? Comment on what the effect of C is on the entries of the confusion matrix. What choice of C gives you the best classification accuracy results on the test data?
table(pred1,y_test)
## y_test
## pred1 0 1
## 0 60 12
## 1 4 24
For C = 1, with a coefficient that used for regularization we will overfit the training data. Since the confusion matrix is built on unseen data so we will have bad performance. Accordingly, we will have more false positive and false negative results. On the other hand, C =100 will result in underfit our training data and we will get a model that will not be informative and we can get false predictions in the confusion matrix.
\(\langle w_*,x\rangle + b_* =0\)
library(ggplot2)
train$Purchased <- factor(train$Purchased)
ggplot(train, aes(Age, EstimatedSalary, colour = Purchased)) + geom_point()+scale_color_manual(breaks = c("0", "1"), values=c("red", "green"))+ ggtitle("Salary versus Age in the train set") + geom_abline(intercept = 0.765225, slope = -1.942327, color='purple',size=1.2)
test$Purchased <- factor(test$Purchased)
ggplot(test, aes(Age, EstimatedSalary, colour = Purchased)) + geom_point()+scale_color_manual(breaks = c("0", "1"), values=c("red", "green"))+ ggtitle("Salary versus Age in the train set") + geom_abline(intercept = 0.765225, slope = -1.942327, color='purple',size=1.2)