During my undergrad I spent a ton of time working on convex optimization. In economics, this involves using a lot of linear algebra and calculus to prove that an model, under certain assumptions, has a local, or global minimum. This is used to validate a lot of economic models.

I didn’t really see how this was applicable because I didn’t have any interest in being a theoretical economist, and frankly, I doubt I’m smart enough to be anyways.

Still, I struggled through this because I really wanted an extra ‘H’ after my BA and because I’m a bit of an academic masochist.

However, I never foresaw myself using it as part of my job.

Now that I’m completing a Masters of Data Science, I use stuff like this frequently, the only difference is that I use it for analyzing data instead of validating theoretical models.

I like it a lot more now.

Ordinary Least Squares

Most people familiar with multiple linear regression know the linear algebra formula:

\[ m = (X^TX)^{-1}X^Ty \]

Where X is an n x m matrix and y is a m dimensional vector where m is the number of observations and n is the number of variables. It is important to note that if you have x features there will be n = x + 1 variables because the intercept variables are not included in x.

This formula will always produce a BLUE estimator.

For more complex algorithms like random forests and neural nets, something called gradient decent is used.

This can also be applied to linear regression. While gradient decent does not guarantee finding the global minimum, it will find a local one in all algorithms. However, because the gradient in OLS is strictly convex, it will find the global one because no local ones exist.

What is Gradient Decent?

Gradient decent is the processes of minimizing a cost function by taking the derivative of a matrix (called a gradient) and minimizing it’s values (decent).

Imagine that instead of doing matrix multiplication to come up with a random hypothesis:

\[ h_\theta = \theta^Tx \]

Where \[\theta\] is a vector of n coefficients and x is the value of these weights.

These weights need to be set to minimize the loss function:

\[ J(\theta) = \frac{1}{2m}\sum_{i=1}^m (h_ \theta(x_i) - y_i)^2 \\ \frac{\partial J(\theta_j)}{\partial \theta_j} = \frac{1}{m} \sum_{i=1} ^m (h_ \theta(x_i) - y_i) \] The constant 2 above is there to make taking the derivative easier. Minimizing f(a), 2f(a) or 0.5f(a) will result in the same minimizing values.

People familiar with OLS wee recognize the interior term as the squared loss function.

We know have the formulas we need to run gradient decent.

\[ \theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1} ^m (h_ \theta(x_i) - y_i) \]

Here \(\alpha\) is the learning rate, or how large of a change is made. Also note, the := symbol stands for update.

In effect, we are updating the parameter \(\theta\) by some small move of \(\alpha\) in the direction indicated by the derivative \(\frac{\partial J(\theta_j)}{\partial \theta_j}\)

Thus, if there derivative is negative, there will be a small move in the positive direction, and visa versa.

If this is iterated enough times, we will approach the minimum. It is also interesting to note, that the closer the derivative gets to zero, the smaller the effect of \(\alpha\) will be. Thus, the rate of movement slows, allowing for a convergence on the true minimum.

In this graph, there is a value for theta on the x axis and the loss value on the right. The correct model is minimized at the * value. The line in red is the value of the mean of the predicted vs actual numbers \(\sum_{i=1} ^m (h_ \theta(x_i) - y_i)\).

If the prediction the predicted value is greater than the actual, this derivative will be positive, as shown in red. The model is then updated by \(- \alpha * mean\)

This results in a step downwards, dark green, and a new derivative is taken, the pink line. This process is repeated. As the derivative (mean) grows smaller, there are smaller steps taken.

If the mean was negative, then the weight would be increased.

This is repeated until the changes become very small.

As stated earlier, in linear models, this will find the global minimum, but in different models, it could get stuck at a local minimum.

Coding.

This math is a bit complicated, it might be easier to understand done in code.

Cost function

Cost <- function(X, y, theta){
  m <- length(y)
  J = 1/ (2*m) * sum((X %*% theta- y)^2)
  return(J)
}

Gradient Decent

gradDecent <- function(X, y, theta, alpha, iterations){
  m <- length(y)
  lossVal <- rep(0, iterations)
  for (i in 1:iterations){
    theta <- theta - alpha*(1/m)*(t(X)%*%(X%*%theta - y))
    #theta <- theta - alpha * (1/m) * (t(X) %*% (X %*% theta - y))
    # This looks different from the mathmatical formula we used. 
    # This is because we are using a vectorized process instead of a for loop 
    # to update each value.
    
    # Intuitivly, this can be though of as  (X %*% theta - y) being the loss function
    # which is a 1 x n matrix. Multiplying it by t(X) means that we will n x 1 column vector.
    lossVal[i] = Cost(X, y, theta)
  }
  Jplot <- qplot(x = 1:iterations, y = lossVal)
  ans <- list(theta, Jplot)
  return(ans)
}

Training

library(car)
library(tidyverse)
library(caret)
library(knitr)
df <- Prestige
df <- df %>% drop_na()
df %>% head() %>% kable()
education income women prestige census type
gov.administrators 13.11 12351 11.16 68.8 1113 prof
general.managers 12.26 25879 4.02 69.1 1130 prof
accountants 12.77 9271 15.70 63.4 1171 prof
purchasing.officers 11.42 8865 9.11 56.8 1175 prof
chemists 14.62 8403 11.68 73.5 2111 prof
physicists 15.64 11030 5.13 77.6 2113 prof

First we need to change the dummy factor variable type to numeric dummies. Note, we have to have only two dummies vs the three factors. We will also remove census.

dummies <- model.matrix( ~ type - 1, data=df) %>% 
  as_data_frame() %>% 
  select(-typewc)
dummies %>% head()
## # A tibble: 6 x 2
##   typebc typeprof
##    <dbl>    <dbl>
## 1      0        1
## 2      0        1
## 3      0        1
## 4      0        1
## 5      0        1
## 6      0        1
#df$type <- NULL
df <- cbind(df, dummies)
df$type <- NULL
df$census <- NULL

Now we have cleaned data.

We still need to separate X and y. We also have to scale X and y in order to get consistent results. Gradient decent can be screwed up with features that differ in scale.

test <- df %>% scale() %>% as.data.frame()
X <- test %>% select(-prestige) %>% scale() %>% data.matrix( rownames.force = FALSE)
y <- test %>% select(prestige) %>% data.matrix( rownames.force = FALSE)
# data.matrix(df, rownames.force = FALSE)
X<-cbind(rep(1, nrow(X)), X)
X <- matrix(X, ncol = ncol(X))
y <- matrix(y) %>% as.numeric()
theta<-rep(0, ncol(X))

We will next run the model with different levels of alpha and for different numbers of iterations.

First model

This model has an alpha of 0.1 and 100 iterations. It gets pretty close to the true value.

results <- gradDecent(X, y, theta, 0.1, 1000)
results
## [[1]]
##               [,1]
## [1,] -1.652306e-16
## [2,]  5.889189e-01
## [3,]  2.579191e-01
## [4,]  1.182848e-02
## [5,]  8.530877e-02
## [6,]  2.412328e-01
## 
## [[2]]

Model 2

This model has a much lower alpha but the same iterations. It doesn’t quite get to the correct value.

results <- gradDecent(X, y, theta, 0.0001, 1000)
results
## [[1]]
##               [,1]
## [1,] -1.406216e-17
## [2,]  7.489512e-02
## [3,]  6.094275e-02
## [4,] -9.752833e-03
## [5,] -5.295304e-02
## [6,]  7.082409e-02
## 
## [[2]]

Model 3

This model has the highest alpha that will work, any higher and it will go off into infinity.

results <- gradDecent(X, y, theta, 0.696, 1000)
results
## [[1]]
##               [,1]
## [1,] -1.527690e-16
## [2,]  5.875226e-01
## [3,]  2.568549e-01
## [4,]  1.190349e-02
## [5,]  8.651093e-02
## [6,]  2.399063e-01
## 
## [[2]]

Model 4

Because the alpha is too high, this model goes off course. The steps are so big that every iteration, it actually gets worse. If alpha gets large enough, it will get numbers so large that R will not be able to handle it.

results <- gradDecent(X, y, theta, 0.7, 1000)
results
## [[1]]
##               [,1]
## [1,] -4.095216e-14
## [2,] -1.321124e+02
## [3,] -1.005312e+02
## [4,]  7.125586e+00
## [5,]  1.138000e+02
## [6,] -1.252831e+02
## 
## [[2]]

Model 5

This model has a very slow learning rate, but for more iterations. It is able to get to the correct value but it takes longer.

results <- gradDecent(X, y, theta, 0.0001, 100000) # Increase this to get to the actual values of theta.
results
## [[1]]
##               [,1]
## [1,] -1.665350e-16
## [2,]  4.853657e-01
## [3,]  2.703631e-01
## [4,]  7.098734e-03
## [5,]  2.200456e-02
## [6,]  2.830984e-01
## 
## [[2]]

Using R

This model is how one would normally do it. Far easier, even though we used scaled variables, we didn’t need too. It just makes comparing them easier.

#test <- df %>% select(-prestige) %>% sca
summary(lm(prestige~., test))
## 
## Call:
## lm(formula = prestige ~ ., data = test)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.86274 -0.26217  0.01824  0.30698  1.08206 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.335e-16  4.214e-02   0.000 1.000000    
## education    5.889e-01  1.039e-01   5.671 1.63e-07 ***
## income       2.579e-01  6.487e-02   3.976 0.000139 ***
## women        1.183e-02  5.577e-02   0.212 0.832494    
## typebc       8.531e-02  7.795e-02   1.094 0.276626    
## typeprof     2.412e-01  7.639e-02   3.158 0.002150 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4172 on 92 degrees of freedom
## Multiple R-squared:  0.8349, Adjusted R-squared:  0.826 
## F-statistic: 93.07 on 5 and 92 DF,  p-value: < 2.2e-16

Using Matrix Algebra

Finally we will look at the matrix algebra approach. We have to change the first row from ones to zeros.

solve(crossprod(X), crossprod(X,y))
##              [,1]
## [1,] 1.825353e-16
## [2,] 5.889230e-01
## [3,] 2.579185e-01
## [4,] 1.182842e-02
## [5,] 8.531094e-02
## [6,] 2.412309e-01

Refernces: