Optimization for Machine Learning

Homework 2

Dennis Duncan

RUID No. 046-00-9156

Please answer the following questions in electronic form and upload and submit your files to the blackboard site, before the due date. Make sure to push the submit button. You can typeset your answer using MS Word (and MS equation for mathematical formulas), or LaTeX, or you may simply hand write your note and scan and turn it into a single pdf format file and upload to the blackboard. Please provide your code as well as a separate file, we may need to run your code.

Questions

Below are some (simplified version of) objectives that arise frequently in machine learning. For simplicity, we provide the objectives in dimension two as a function of \[x = [x1,x2]^T\]. Plot these functions and argue whether they are convex, concave or non-convex. (Mathematical proof is not needed, the purpose is to visualize these functions and decide about their convexity by looking at their 3d visualizations).

  1. \(f(x) = (1− x_1x_2)^2\)
  library(ggplot2)

library(plotly)
Registered S3 method overwritten by 'data.table':
  method           from
  print.data.table     
Registered S3 method overwritten by 'htmlwidgets':
  method           from         
  print.htmlwidget tools:rstudio

Attaching package: ‘plotly’

The following object is masked from ‘package:ggplot2’:

    last_plot

The following object is masked from ‘package:stats’:

    filter

The following object is masked from ‘package:graphics’:

    layout
x1=seq(0,50,length=500)
x2=seq(0,50,length=500)
#x=rbind(x1,x2)
fxa<-function(p1,p2){return((1-p1*p2)^2)}
zfxa=outer(x1,x2,fxa)
fig <- plot_ly(x=x1,y=x2,z=zfxa,type = "surface")


# 
t=seq(-1,0,length=20)
xl <- numeric(length(t))
yl <- numeric(length(t))
zl <- numeric(length(t))
for(i in seq_along(t)) {

   xl[i] <- 50+10*t[i]
   yl[i] = 40-10*t[i]
   zl[i] = (1-50*40)^2
 }


fig <- fig %>% add_trace(x = xl,y = yl,z= zl,  type = 'scatter3d',color = I('red'),opacity = 1,mode = 'lines',
        line = list(width =6))

fig
NA

The function is not convex. That is evident as shown above with the majority of the data found below the two arbitrary points.

  1. \(f(x) = log(1+ exp(−w^Tx)\) where \(w ∈ \mathbb{R}^2\) is a vector.
p1=seq(-100,100,length=1000)
p2=seq(-100,100,length=1000)
w=matrix(runif(20),2,1)
w1=w[1,]
w2=w[2,]
fxb<-function(x1,x2){return(log(1+exp(-as.vector(t(w1)*x1-as.vector(t(w2)*x2)))))}
zfxb=outer(p1,p2,fxb)
Recycling array of length 1 in array-vector arithmetic is deprecated.
  Use c() or as.vector() instead.
Recycling array of length 1 in array-vector arithmetic is deprecated.
  Use c() or as.vector() instead.
plot_ly(x=p1,y=p2,z=zfxb,type="surface")

We know by definition that a logarithmic and exponential functions are convex as well as a combination of both. The above graph provides confirmation this property of convex functions; where by inspection the floor is affine which here is convex and all points above are within the set and therefore convex.

  1. \(f(x) = max(a_2 max(a_1*x_1,b_1),b_2)\) where \(a_1, a_2, b_1, b_2\) are real scalars.
x1 = seq(-1,10,length=50)
a1 = runif(1,max = 1,min = 0)
a2 = runif(1,max = 1,min = 0)
b1 = runif(1,max = 1,min = 0)
b2 = runif(1,max = 1,min = 0)

fxc<-function(x)
  {
  return (max((a2)*max(a1*x,b1,b2)))
  
  }
y <- numeric(length(x1))
for(i in seq_along(x1)) {
   y[i] <- fxc(x1[i])
 }
plot(x1,y,type="l")

We know that linear functions are convex and the mx of two linear functions are also convex. The above grapgh confirms this property of convex functions.

  1. \(f(x) = ‖Ax− b‖^2 + λL_\delta(x)\)

where λ > 0 is a regularization parameter, δ > 0 is a parameter, \[ L_\delta (x) = \begin{cases} \frac{1}{2}‖x‖^2\, ,& ‖x‖ \le \delta,\\ \delta(‖x‖ - \frac{1}{2}\delta), & otherwise \end{cases} \] is called the “Huber loss”. This function arises in robust linear regression problems.

p1=seq(-100,100,length=500)
p2=seq(-100,100,length=500)
b =  matrix(rnorm(2),2,1)

The function is convex by inspect. That is, the linear rgression (1st part) converts the sum of the regulation sequentially. with the graph flipped up and scaled to examine the bottom of the data set we can see a pronounced cup.

  1. \(f(x) = x_1log(x_1) + x_2log(x_2)\) for \(x_1 > 0 and x_2 > 0\)
x1 = seq(1,50,length=50)
x2 = seq(1,50,length=50)

entropy <-function(x1,x2)
{
  return (x1*log(x1)+x2*log(x2))
  
}

zEntropy=outer(x1,x2,entropy)
fig <- plot_ly(x=x1,y=x2,z=zEntropy,type = "surface")



t=seq(-1,0,length=20)
xl <- numeric(length(t))
yl <- numeric(length(t))
zl <- numeric(length(t))
for(i in seq_along(t)) {

   xl[i] <- 1-19*t[i]
   yl[i] = 1-19*t[i]
   zl[i] = -entropy(20,20)*t[i]
 }


fig <- fig %>% add_trace(x = xl,y = yl,z= zl,  type = 'scatter3d',color = I('red'),opacity = 1,mode = 'lines',
        line = list(width =6))

fig
NA

This function is convex. A partial red line is added to disclose part of the cereal bowl shape of the function. We apply the theory of entroy to imply that the mirror image is also within the bounds of the data set.

  1. f(x) = − log det (\(\begin{vmatrix}x_1 & x_2\\ x_3 & x_4 \end{vmatrix}\)) for \((x_1,x_2)\) such that \(x_1 \> \|x_2\|≥ 0.\) Note that the determinant of a 2×2 matrix is given by the formula det (\(\begin{vmatrix}a & b\\ c & d \end{vmatrix}\)) = ad− bc.
x1 = seq(20,40,length=100)
x2 = seq(1,19,length=100)

logde <-function(x1,x2)
{
  return (-log(x1*x1-x2*x2))
  
}

zdet=outer(x1,x2,logde)
fig <-plot_ly(x=x1,y=x2,z=zdet,type = "surface")


##Line equation
t=seq(-1,0,length=20)
xl <- numeric(length(t))
yl <- numeric(length(t))
zl <- numeric(length(t))
for(i in seq_along(t)) {

   xl[i] <- 20-20*t[i]
   yl[i] = 1-18*t[i]
   zl[i] = logde(20,1)+(logde(20,1)-logde(40,19))*t[i]
 }


fig <- fig %>% add_trace(x = xl,y = yl,z= zl,  type = 'scatter3d',color = I('red'),opacity = 1,mode = 'lines',
        line = list(width =6))

fig
NA

The above function of a matrix is a convex function as shown with the addition of the red line.

  1. Sigmoid function:

\[ σ(x) = \frac{1}{1+ e−‖x‖} \]

sigmoid = function(x)
p1 = seq(10,100,length=500)
p2 = seq(10,100,length=500)

sigmoid = function(x1,x2)

  {
   1 / (1 + exp(-sqrt(x1^2+x2^2)))
}

zsigmoid= outer(p1,p2,entropy)
NaNs produced
fig <- plot_ly(x=p1,y=p2,z=zsigmoid,type = "surface")

##Line equation


# point1 (100,10, sigmoid (100,10))
# point2 (10,100, sigmoid (10,100))

t=seq(-1,0,length=20)
xl <- numeric(length(t))
yl <- numeric(length(t))
zl <- numeric(length(t))
for(i in seq_along(t)) {

   xl[i] = 100+90*t[i]
   yl[i] = 10-90*t[i]
   zl[i] = sigmoid(10,100) 
 }


fig <- fig %>% add_trace(x = xl,y = yl,z= zl,  type = 'scatter3d',color = I('red'),opacity = 1,mode = 'lines',
        line = list(width =6))

fig

We know that a sigmoid function is monotonic, and has a first derivative which is bell shaped. Also we know that the sigmoid function is convex for values less than a particular point, and it is concave for values greater than that point: in many of the examples here, that point is 0. In the above sigmoid function we can by inspect see that it is convex.

##2. Consider the following polynomial in dimension one:

\[ f(x) = x^4 - x^3 - x^2 + 3x = 1, x \in \mathbb{R} \]

Plot this function on the [−2,2] interval. Find its minimum using the Newton-Raphson method. Mark your iterations on the plot of the func- tion. Is the performance of your algorithm sensitive to the initialization? Is it possible that your algorithm does not converge and why?

x1 = seq(-2,2,length=100)
f <- function(x)
  {
  return(x^4 - x^3 - x^2 + 3*x-1)
}

l <- function(x)
  {
  return((2)*x-1)
}

g <- function(x)
  {
  return(4*x^3 - 3*x^2 - 2*x + 3)
}



# y <- numeric(length(x1))
# l<- numeric(length(x1))
# for(i in seq_along(x1)) {
#    y[i] <- f(x1[i])
#    l[i] <- 
#  }
plot(x1,f(x1),type="l",ylim=range(c(-5,10)))

The above function is not convex. We apply Newton-Raphson method to optimize f(x), where we choose \(g(x) =\nabla f(x)\).

Therefore,

\(x^{k+1}\) = \(x^{k}\) - \(\frac{-f^{'}(x^{k})}{f^{''}(x^{k)}}\)

Accordingly,

\(f^{'}(x) = 4x^3 - 3x^2 - 2x + 3\) and \(f^{''} (x) = 12x^2 - 6x - 2\)

\(x^0 = 1\)

\(x^1 = 0.5\)

\(x^2 = 1.5\)

The above graph is not convex as we can observe that is we select to points on the right part of the \(x^1\) we could get erroneous conclusion of a local minimum and overshoot as the initization will jump up as shown and we will see a swing to the right such that the data does not converge

  1. How does the Newton’s method work for solving the linear regression problem from Homework 1? Hint: In Homework 1, the objective is of the form f(x) = 1 2‖y− Ax‖2. We have the formulas \[ ∇f(x) = −AT(y− Ax),∇2f(x) = ATA. \] The Newton iterations are of the form \[ xk+1 = xk − ∇f(xk)−1∇f(xk). \]

The application of Newton’s method smooths the graph such that we can more clear work with the data as discussed and shown in our class lecture.