Optimization in Machine Learning

Author

徐帅铭

1. Optimization

1 (a)

Consider minimizing the following univariate function

f(x)=exp(1.5x)-3(x+6)^2-0.05x^3

Write a function `f_obj(x)` that calculates this objective function.

Plot this function on the domain x\in[-40,7].

f_obj <- function(x){
  exp(1.5 * x) - 3 * (x + 6)^2 - 0.05 * x^3
}

curve(f_obj, from = -40, to = 7,
      lwd = 2, col = "steelblue",
      xlab = "x", ylab = "f(x)",
      main = "Objective function")

1 (b)

Use the optim() function to solve this optimization problem. Use method = "BFGS" . Try two initial points: -15 and 0. Report are the solutions you obtained different? Why?

f_obj <- function(x){
  exp(1.5 * x) - 3 * (x + 6)^2 - 0.05 * x^3
}

res1 <- optim(par = -15, fn = f_obj, method = "BFGS")
res2 <- optim(par = 0, fn = f_obj, method = "BFGS")

res1$par;res1$value

[1] -32.64911

[1] -390.3858

res2$par;res2$value

[1] 2.349967

[1] -175.8626

两个解不同。原因是该目标函数不是凸函数，存在多个局部极小点。BFGS 属于局部优化方法，最后收敛到哪个点，与初始值所在的位置有关。

1 (c)

Consider a bi-variate function to minimize

f(x,y)=3x^2 + 2y^2 - 4xy + 6x - 5y + 7

Derive the partial derivatives of this function with respect to x and y. And solve for the analytic solution of this function by applying the first-order conditions.

一阶偏导为

\frac{\partial f}{\partial x} = 6x - 4y + 6

\frac{\partial f}{\partial y} = 4y - 4x - 5

令一阶偏导为0，解得x=-0.5,y=0.75，则带回原函数，极小值为3.625

1 (d)

Check the second-order condition to verify that solution you obtained in the previous step is indeed a minimum.

二阶偏导为

\frac{\partial^2 f}{\partial^2 x} = 6

\frac{\partial^2 f}{\partial^2 y} = 4

\frac{\partial^2 f}{\partial x \partial y} =\frac{\partial^2 f}{\partial y \partial x} = -4

则Hessian矩阵为 H = \begin{bmatrix} 6 & -4 \\ -4 & 4 \end{bmatrix}

对角线均大于0

行列式：6×4-(-4)^2=24-16=8>0

则该矩阵正定，所以点(-1/2, 3/4)是全局最小点。

1 (e)

Use the optim() function to solve this optimization problem. Use method = "BFGS". Set your own initial point. Report the solutions you obtained. Does different choices of the initial point lead to different solutions? Why?

f2 <- function(par){
  x <- par[1]
  y <- par[2]
  3*x^2 + 2*y^2 - 4*x*y + 6*x - 5*y + 7
}

res3 <- optim(c(0, 0), f2, method = "BFGS")
res4 <- optim(c(10, -3), f2, method = "BFGS")
res5 <- optim(c(-8, 12), f2, method = "BFGS")
res3$par;res3$value

[1] -0.50  0.75

[1] 3.625

res4$par;res4$value

[1] -0.50  0.75

[1] 3.625

res5$par;res5$value

[1] -0.50  0.75

[1] 3.625

无论从哪个初始点出发，BFGS 都应收敛到同一个解，该目标函数是严格凸函数，只有唯一的全局最小点，因此初始值不会改变最终解。

2. Regression and Optimization with Huber Loss

When fitting linear regressions, outliers could significantly affect the fitting results. However, manually checking and removing outliers can be tricky and time consuming. Some regression methods address this problem by using a more robust loss function. For example, one such regression is to minimize the objective function \frac{1}{n}\sum_{i=1}^{n}\ell_{\delta}\left(y_i - x_i^{\top}\beta\right)

where the loss function \ell_{\delta} is the Huber Loss, defined as \ell_{\delta}(a)= \begin{cases} \frac{1}{2}a^2, & \text{if } |a|\le \delta \\ \delta\left(|a|-\frac{1}{2}\delta\right), & \text{o.w.} \end{cases} Here is a visualization that compares Huber loss with the \ell_{2} loss. We can see that the Huber loss assigns much less value when y_i - x_i^{T}\beta is more extreme (outliers)

# define the Huber loss function
  Huber <- function(a, delta = 1) ifelse(abs(a) <= delta, 0.5*a^2, delta*( abs(a) - 0.5*delta))
  
  # plot against L2
  x = seq(-4, 4, 0.01)
  plot(x, Huber(x), type = "l",
       xlab = "a", ylab = "Huber Loss",
       col = "darkorange", ylim = c(0, 8))
  lines(x, 0.5*x^2, col = "deepskyblue")
  legend(x = 0, y = 8, legend = c("Huber Loss", "OLS loss"), 
         col = c("darkorange", "deepskyblue"), lty = 1)

Use the following code to generate

# generate data from a simple linear model 
  set.seed(542)
  n = 150
  x = runif(n)
  X = cbind(1, x)
  y = X %*% c(0.5, 1) + rnorm(n)
  
  # create an outlier
  y[which.min(X[, 2])] = -30

2 (a)

Fit an OLS model with the regular \ell_{2} loss. Report your coefficients (do not report other information). Although this is only one set of samples, but do you expect this estimator to be biased based on how we set up the observed data? Do you expect the parameter \beta_1 to bias upwards or downwards? Explain your reason.

Hint: is the outlier pulling the regression line slope up or down?

fit_ols <- lm(y ~ x)
coef(fit_ols)

(Intercept)           x 
 -0.1253725   1.9036546

从理论上看，这个 OLS 估计会受到离群点影响而产生偏差。由于离群点出现在最小的 x 附近，且对应的 y 被强行设成很小的负值，它会把回归线左端明显往下拉，从而使回归线更陡。因此，斜率参数 \beta_1 预期会向上偏。

2 (b)

Define your own Huber loss function huberLoss(b, trainX, trainY) given a set of observed data with tuning parameter \delta = 1. Here, b is a p-dim parameter vector, trainX is a n \times p design matrix and trainY is the outcome. This function should return a scalar as the empirical loss. You can use our Huber function in your own code. After defining this loss function, use the optim() function to solve the parameter estimates. Finally, report your coefficients.

Use b = (0, 0) as the initial value.
Use BFGS as the optimization method.

huberLoss <- function(b, trainX, trainY, delta = 1){
  r <- as.vector(trainY - trainX %*% b)
  mean(Huber(r, delta = delta))
}

res_huber <- optim(
  par = c(0, 0),
  fn = huberLoss,
  trainX = X,
  trainY = y,
  delta = 1,
  method = "BFGS"
)

res_huber$par

[1] 0.7545940 0.6223551

得到的两个参数分别是 Huber 回归下的截距和斜率。与 OLS 相比，Huber 回归的斜率通常会更接近真实值 1，因为它降低了离群点对目标函数的支配作用。

2 (c)

We still do not know which method performs better in this case. Let’s use a simulation study to compare the two methods. Complete the following

Set up a simulation for 1000 times. At each time, randomly generate a set of observed data, but also force the outlier with our code y[which.min(X[, 2])] = -30.
Fit the regression model with \ell_2 loss and Huber loss, and record the slope variable estimates.
Make a side-by-side boxplot to show how these two methods differ in terms of the estimations. Which method seem to have more bias? and report the amount of bias based on your simulation. What can you conclude from the results? Does this match your expectation in part a)? Can you explain this (both OLS and Huber) with the form of loss function, in terms of what their effects are?

set.seed(123)
B <- 1000
ols_slope <- numeric(B)
huber_slope <- numeric(B)

for(b in 1:B){
  n = 150
  x = runif(n)
  X = cbind(1, x)
  y = X %*% c(0.5, 1) + rnorm(n)

  y[which.min(X[, 2])] = -30

  fit_ols <- lm(y ~ x)
  ols_slope[b] <- coef(fit_ols)[2]

  fit_huber <- optim(
    par = c(0, 0),
    fn = huberLoss,
    trainX = X,
    trainY = y,
    delta = 1,
    method = "BFGS"
  )
  huber_slope[b] <- fit_huber$par[2]
}

boxplot(ols_slope, huber_slope,
        names = c("OLS", "Huber"),
        col = c("skyblue", "orange"),
        ylab = "Slope estimates",
        main = "Comparison of slope estimates")

bias_ols <- mean(ols_slope) - 1
bias_huber <- mean(huber_slope) - 1

bias_ols

[1] 1.217242

bias_huber

[1] 0.05945632

mean(ols_slope)

[1] 2.217242

mean(huber_slope)

[1] 1.059456

在该模拟中，真实斜率为：\beta_1 = 1

因此偏差定义为：Bias = E(\hat{\beta})-\beta

也就是： Bias = 平均估计值 - 1 预期结果是：

OLS 的箱线图整体会明显偏高； Huber 的箱线图会更靠近 1； OLS 的偏差明显更大；两种方法都可能有一定偏差，但 Huber 通常小得多。结果解释

这与 2(a) 的直觉一致。因为 OLS 使用的是平方损失： \ell_2 \quad loss = \frac{1}{2} a^2

误差越大，惩罚增长越快，所以离群点会有特别大的影响。而 Huber loss 在 |a|>\delta 后只做线性惩罚，所以离群点不会把整条回归线拽得太厉害。

模拟结果表明，OLS 的斜率估计偏差更大，并且主要表现为向上偏；Huber 回归更稳健，估计更接近真实值 \beta_1 = 1。这说明在存在极端离群点时，Huber loss 比普通 OLS loss 更能减弱离群点对参数估计的影响。

3. Scaling and Coordinate Descent for Linear Regression

Scaling issue in the practice, we usually standardize each covariate/feature to mean 0 and standard deviation 1. Standardization is essential when we apply \ell_2 and \ell_1 penalty on the loss function, because if the covariates are with different scales, then they are penalized differently. Without prior information, we should prevent that from happening. Besides, scaling the data also help to make the optimization more stable, since the step size in many descent algorithms could be affected by the scale.

In practice, after obtaining the coefficients fitted with scaled data, we want to recover the original coefficients of the unscaled data. For this question, we use the following intuition: \frac{Y-\bar{Y}}{\mathrm{sd}_y} = \sum_{j=1}^{p} \frac{X_j-\bar{X}_j}{\mathrm{sd}_j}\gamma_j

Y = \bar{Y} - \underbrace{\sum_{j=1}^{p}\bar{X}_j\frac{\mathrm{sd}_y\cdot\gamma_j}{\mathrm{sd}_j}}_{\beta_0} + \sum_{j=1}^{p} X_j \underbrace{\frac{\mathrm{sd}_y\cdot\gamma_j}{\mathrm{sd}_j}}_{\beta_j},

In this equation, the first line is the model fitted with scaled and centered data. And we obtain the fitted parameters as \gamma_j’s
In the second line, the coefficients \beta_j’s for the original data is recovered.
When fitting the scaled and centered data, no intercept term is needed.

Based on this relationship, we perform the following when fitting a linear regression:

Center and scale both X (column-wise) and Y and denote the processed data as \frac{Y-\bar{Y}}{\mathrm{sd}_y} and \frac{X_j-\bar{X}_j}{\mathrm{sd}_j} in the above formula. Make sure that the standard error of each variable is 1 after scaling. This means that you should use N, not N-1 when calculating the estimation of variance.
Fit a linear regression using the processed data based on the no-intercept model, and obtain the parameter estimates \gamma_j’s.
Recover the original parameters \beta_0 and \beta_j’s.

Use the following code to generate your data:

library(MASS)
  set.seed(10)
  n = 20
  p = 3
  
  # covariance matrix
  V = matrix(0.3, p, p)
  diag(V) = 1
  
  # generate data
  X_org = as.matrix(mvrnorm(n, mu = rep(0, p), Sigma = V))
  true_b = c(1, 2, 0)
  y_org = X_org %*% true_b + rnorm(n)

3 (a)

Fit an OLS estimator with the original data Y_org and X_org by lm(). Also, fit another OLS with scaled data by lm(). Report the coefficients/parameters. Then, transform coefficients from the second approach back to its original scale, and match with the first approach. Summarize your results in a single table: The rows should contain three methods: OLS, OLS Scaled, and OLS Recovered, and there should be four columns that represents the coefficients for each method.

# 自定义用 N 作分母的标准差
sdN <- function(z){
  sqrt(mean((z - mean(z))^2))
}

# 原始 OLS
dat_org <- data.frame(y = y_org, X_org)
fit_org <- lm(y ~ ., data = dat_org)
coef_org <- coef(fit_org)

# 标准化
x_mean <- colMeans(X_org)
x_sd   <- apply(X_org, 2, sdN)
y_mean <- mean(y_org)
y_sd   <- sdN(y_org)

X_sc <- sweep(X_org, 2, x_mean, "-")
X_sc <- sweep(X_sc, 2, x_sd, "/")
y_sc <- (y_org - y_mean) / y_sd

# 标准化后的无截距 OLS
dat_sc <- data.frame(y = y_sc, X_sc)
fit_sc <- lm(y ~ . - 1, data = dat_sc)
gamma_hat <- coef(fit_sc)

# 恢复原尺度
beta_rec <- y_sd * gamma_hat / x_sd
beta0_rec <- y_mean - sum(x_mean * beta_rec)

# 汇总表
res_tab <- rbind(
  "OLS" = c(coef_org[1], coef_org[-1]),
  "OLS Scaled" = c(0, gamma_hat),
  "OLS Recovered" = c(beta0_rec, beta_rec)
)

colnames(res_tab) <- c("Intercept", "Beta1", "Beta2", "Beta3")
round(res_tab, 6)

              Intercept    Beta1    Beta2    Beta3
OLS           -0.516594 0.659289 2.182843 0.249836
OLS Scaled     0.000000 0.231331 0.811403 0.067537
OLS Recovered -0.516594 0.659289 2.182843 0.249836

原始数据直接用 lm() 拟合；标准化数据用无截距模型拟合；再通过

\beta_j = \frac{sd_y}{sd_j} γ_j

和

\beta_0 = \bar{Y}-\sum\bar{X}_j \beta_j

恢复到原尺度。恢复后的结果应与原始 OLS 的系数基本一致，只会有极小的数值误差。

3 (b)

Instead of using the lm() function, write your own coordinate descent code to solve the scaled problem. This function will be modified and used next week when we code the Lasso method. Complete the following steps:

i) Given the loss function L(\beta)=\lVert y-X\beta\rVert^2 or \sum_{i=1}^{n}\left(y_i-\sum_{j=0}^{p}x_{ij}\beta_j\right)^2 , derive the updating/calculation formula of coefficient \beta_j, when holding all other covariates fixed. You must use LaTex to typeset your derivation with proper explaination of notations. Write down the formula (in terms of y, x and \beta’s) of residual r before and after this update. Based on our lecture, how to make the update of r computationally efficient?

推导坐标下降法中，第 j个系数的更新公式；写出更新前后的残差公式；并说明怎样让更新更高效。

OLS 的损失函数为 L(\beta)=\|y-X\beta\|_2^2 =\sum_{i=1}^{n}\left(y_i-\sum_{j=0}^{p}x_{ij}\beta_j\right)^2.

坐标下降法的基本思想是：每次只更新一个系数，其余系数暂时固定。

固定除 \beta_j 之外的所有系数，定义第 j 个坐标对应的部分残差为 r^{(j)} = y-\sum_{\ell\neq j}x_\ell\beta_\ell.

则损失函数可写为 L(\beta_j)=\|r^{(j)}-x_j\beta_j\|_2^2.

对 \beta_j 求导，得 \frac{dL}{d\beta_j} = -2x_j^\top\left(r^{(j)}-x_j\beta_j\right).

令导数等于 0，即 -2x_j^\top\left(r^{(j)}-x_j\beta_j\right)=0.

因此有 x_j^\top r^{(j)}-x_j^\top x_j\,\beta_j=0,

从而得到第 j 个系数的更新公式： \beta_j = \frac{x_j^\top r^{(j)}}{x_j^\top x_j}.

这就是 OLS 下坐标下降法的单坐标更新公式。

若每次都重新计算 r^{(j)} = y-\sum_{\ell\neq j}x_\ell\beta_\ell, 计算代价较高。

更高效的做法是维护当前总残差 r=y-X\beta.

在更新第 j 个系数前，先将旧的第 j 项贡献加回去： r \leftarrow r+x_j\beta_j.

然后利用更新公式计算新的 \beta_j： \beta_j \leftarrow \frac{x_j^\top r}{x_j^\top x_j}.

再将新的第 j 项贡献减去： r \leftarrow r-x_j\beta_j.

因此，坐标下降法中第 j 个系数的更新公式为 \beta_j = \frac{x_j^\top r^{(j)}}{x_j^\top x_j}, 其中 r^{(j)} = y-\sum_{\ell\neq j}x_\ell\beta_\ell.

若通过维护当前总残差 r=y-X\beta，则可以通过“先加回旧贡献，再更新，再减去新贡献”的方式实现更高效的坐标更新，而不必在每一步都从头计算整条残差向量。

ii) Implement this coordinate descent method with your own code to solve OLS with the scaled data. Print and report your scaled coefficients (no need to recover the original version) and compare with the result from the previous question.

Do not use functions from any additional library.
Start with a vector \beta = 0.
Run your coordinate descent algorithm for a maximum of maxitr = 100 iterations (while each iteration will loop through all variables). However, stop your algorithm if the \beta value of the current iteration is sufficiently similar to the previous one, i.e., \left\|\beta^{(k)}-\beta^{(k-1)}\right\|_{1}\le \mathrm{tol}. Set tol = 1e-7 where \|\cdot\|_1 is the L1 distance.

Make a plot to analyze the convergence of the coordinate descent. On the x-axis, we use the number of iteration. On the y-axis, use log(Loss_k - trueLoss). Here, trueLoss is the emperical loss based on the true optimizor, which we can simply use the solution from the lm() function (the scaled version). The loss_k is the loss function at the begining of the k-th iteration (Keep in mind that within each iteration, we will loop over all \beta_j). If this plot displays a stragiht line, then we say that this algorithm has a linear convergence rate. Of course, this is at the log scale.

# X_sc 和 y_sc 已在 3(a) 中生成

coord_desc_ols <- function(X, y, maxitr = 100, tol = 1e-7){
  n <- nrow(X)
  p <- ncol(X)

  beta <- rep(0, p)
  loss_path <- numeric(maxitr)

  # 当前残差 r = y - X beta
  r <- as.vector(y - X %*% beta)

  for(k in 1:maxitr){
    beta_old <- beta

    # 记录本轮开始时的损失
    loss_path[k] <- sum(r^2)

    for(j in 1:p){
      # 先加回旧贡献
      r <- r + X[, j] * beta[j]

      # 更新 beta_j
      beta[j] <- sum(X[, j] * r) / sum(X[, j]^2)

      # 再减去新贡献
      r <- r - X[, j] * beta[j]
    }

    # 收敛判断
    if(sum(abs(beta - beta_old)) <= tol){
      loss_path <- loss_path[1:k]
      break
    }
  }

  list(beta = beta, loss_path = loss_path, iter = length(loss_path))
}

# 坐标下降求解
cd_fit <- coord_desc_ols(X_sc, y_sc, maxitr = 100, tol = 1e-7)
cd_fit$beta

[1] 0.23133130 0.81140337 0.06753725

# 与 lm 的标准化结果比较
fit_sc <- lm(y_sc ~ X_sc - 1)
coef(fit_sc)

     X_sc1      X_sc2      X_sc3 
0.23133128 0.81140337 0.06753726

# true loss
true_beta <- coef(fit_sc)
trueLoss <- sum((y_sc - X_sc %*% true_beta)^2)

# 作图
plot(log(cd_fit$loss_path - trueLoss),
     type = "b", pch = 16, col = "steelblue",
     xlab = "Iteration",
     ylab = "log(Loss_k - trueLoss)",
     main = "Convergence of coordinate descent")# 这里默认 X_sc 和 y_sc 已在 3(a) 中生成

coord_desc_ols <- function(X, y, maxitr = 100, tol = 1e-7){
  n <- nrow(X)
  p <- ncol(X)

  beta <- rep(0, p)
  loss_path <- numeric(maxitr)

  # 当前残差 r = y - X beta
  r <- as.vector(y - X %*% beta)

  for(k in 1:maxitr){
    beta_old <- beta

    # 记录本轮开始时的损失
    loss_path[k] <- sum(r^2)

    for(j in 1:p){
      # 先加回旧贡献
      r <- r + X[, j] * beta[j]

      # 更新 beta_j
      beta[j] <- sum(X[, j] * r) / sum(X[, j]^2)

      # 再减去新贡献
      r <- r - X[, j] * beta[j]
    }

    # 收敛判断
    if(sum(abs(beta - beta_old)) <= tol){
      loss_path <- loss_path[1:k]
      break
    }
  }

  list(beta = beta, loss_path = loss_path, iter = length(loss_path))
}

# 坐标下降求解
cd_fit <- coord_desc_ols(X_sc, y_sc, maxitr = 100, tol = 1e-7)
cd_fit$beta

[1] 0.23133130 0.81140337 0.06753725

# 与 lm 的标准化结果比较
fit_sc <- lm(y_sc ~ X_sc - 1)
coef(fit_sc)

     X_sc1      X_sc2      X_sc3 
0.23133128 0.81140337 0.06753726

# true loss
true_beta <- coef(fit_sc)
trueLoss <- sum((y_sc - X_sc %*% true_beta)^2)

# 作图
plot(log(cd_fit$loss_path - trueLoss),
     type = "b", pch = 16, col = "steelblue",
     xlab = "Iteration",
     ylab = "log(Loss_k - trueLoss)",
     main = "Convergence of coordinate descent")

按照坐标下降法实现后，得到的标准化系数应与 lm(y_sc ~ X_sc - 1) 的结果几乎一致，只存在很小的数值误差。说明自己写的坐标下降算法是正确的。

在收敛图中，若

log(loss_k - trueLoss)

大致呈现一条下降的直线，则说明该算法在对数尺度下近似表现为线性收敛。