Summary of Assignment Instructions: Perform a Multivariate Linear Regression analysis using Stochastic Gradient Descent on the supplied dataset. The dataset has two predictor variables:

  1. living area (given in square feet)
  2. number of bedrooms

The response variable is the price of the house. There are 47 observations in total.

Since the predictor variables are given in different units, we are asked to standardize them by estimating the mean and standard deviation of each variable and then compute new versions of each using the following formula:

x(std) = (x - mu) / stddev where mu and stddev are the mean and standard deviation of x, respectively.

# Read the predictor data into a table
predData <- read.table("https://raw.githubusercontent.com/jtopor/CUNY-MSDA-605/master/Final/ex3x.dat", col.names=c('Sqft', 'BR'))

# Read the response variable data into a table
respData <- read.table("https://raw.githubusercontent.com/jtopor/CUNY-MSDA-605/master/Final/ex3y.dat", col.names=c('Price'))

# Copy the original predictor data into another vector for calculations
predStd <- predData

# get the mean and standard deviation for the Sqft variable
muSqft <- mean(predStd$Sqft)
sdSqft <- sd(predStd$Sqft)

# Transform the Sqft data using (x - mu) / stddev
predStd$Sqft <- (predStd$Sqft - muSqft) / sdSqft

# get the mean and standard deviation for the BR variable
muBR <- mean(predStd$BR)
sdBR <- sd(predStd$BR)

# Transform the Sqft data using (x - mu) / stddev
predStd$BR <- (predStd$BR - muBR) / sdBR

With the data standardizations complete we can now proceed with constructing the matrices to be used for the gradient descent calculations.

# initialize dummy variable x0 = 1 for each row of predictor data
x0 <- rep(1, nrow(predStd))

# x will be n+1 dimensions since x0 is being appended to it
x <- data.frame(cbind(x0,predStd))
y <- data.frame(respData)

Two separate functions are used to perform the stochastic gradient descent (SGD) calculations: one computes the partial derivative component of the standard gradient descent formula and one that uses random sampling to construct the “mini batches” and iterates through a specified number of cycles while updating the value of a matrix that tracks the results of each iteration.

Define the function that calculates the partial derivative component of the standard gradient descent formula:

# ---------------------------------------------------------------------
# define a function that implements the partial derivative component of 
# the gradient descent formula
g_part_deriv <- function(x, y, theta) {
  
  # need to transpose both x and theta to enable matrix multiplication in R
  part_deriv <- (1/ nrow(y))* (t(x) %*% ((x %*% t(theta)) - y))
  
  # return transpose of the partial derivative so value can be added to results matrix
  return(t(part_deriv))
}

Define the primary stochastic gradient descent function:

# ---------------------------------------------------------------------
# define stochastic gradient descent algorithm
stoch_gradDescent <- function(x, y, alpha, n){
  
  # merge x and y to enable accurate random sampling
  xy <- data.frame(cbind(y,x))
  
  # theta MUST be 3 columns: must match width of x matrix
  # set matrix theta and set all elements = 0 to start with
  theta <- matrix(c(0, 0, 0), nrow = 1)
    
  # Initialize a matrix to store values of theta for each iteration
  thetaIter <- matrix(NA, nrow = n, ncol = 3)
    
  # set seed value for random sampling
  set.seed(42)
    
  # now iterate using mini batches of randomly sampled  data, updating theta each step
  for (i in 1:n) {
    # randomly sample 5 items from the combined xy data frame
    xysamp <- as.matrix( xy[sample(nrow(xy), 5, replace = TRUE), ] )
      
    # isolate 'x' component of random samples
    xsamp <- as.matrix(xysamp[,2:4])
    
    # isolate 'y' component of random samples
    ysamp <- as.matrix(xysamp[,1])
      
    # update theta using mini batches
    # theta <- theta - 0.001  * g_part_deriv(xsamp, ysamp, theta) 
    theta <- theta - alpha  * g_part_deriv(xsamp, ysamp, theta) 
      
    # save the theta values for iteration i to a matrix for future plotting
    thetaIter[i,] <- theta
      
 } # end for loop
 return(thetaIter)
}

Now perform SGD for various values of \(\alpha\):

# run for alpha = 0.001
result1 <- stoch_gradDescent(x, y, 0.001, 500)

# run for alpha = 0.01
result2 <- stoch_gradDescent(x, y, 0.01, 500)

# run for alpha = 0.1
result3 <- stoch_gradDescent(x, y, 0.1, 500)

# run for alpha = 1
result4 <- stoch_gradDescent(x, y, 1, 500)

Now that we have the results of our SGD analysis for the four separate values of alpha, we can create a plot of the values of theta with respect to the number of iterations to get a visual representation of how quickly the values for the various variables converge when different values for alpha are applied.

J(theta), 500 Iterations, Alpha = 0.001
Y-Intercept SqFt BRs
134076.9 37939.28 17252.19

As the plot shows, for \(\alpha = 0.001\), we do not achieve convergence within 500 iterations. Therefore, using such a value for \(0.001\) for \(\alpha\) may be too computationally expensive for purposes of achieving convergence within a reasonable period of time for this specific data set. The table above shows the value of \(J(\theta)\) after 500 iterations with \(\alpha = 0.001\).

J(Theta), 500 Iterations, Alpha = 0.01
Y-Intercept SqFt BRs
339327.3 100529.5 3207.527

The plot \(\alpha = 0.01\) shows that we may be achieving convergence within 500 iterations. For example, the plot shows that the estimate of the Y-intercept rose sharply during the first 200 iterations or so and then continued to smoothly increase but at a slower rate than that which occurred during the first 200 iterations. The plot doesn’t show the “choppiness” we would expect to see if we were continually overshooting the mimima. The table above shows the value of \(J(\theta)\) after 500 iterations with \(\alpha = 0.01\).

J(Theta), 500 Iterations, Alpha = 0.1
Y-Intercept SqFt BRs
345724.9 105554 3642.815

For \(\alpha = 0.1\), the plot shows that we may have achieved convergence rather quickly, perhaps after 50 iterations or so. After that point the estimates for the three variables appear to bounce up and down repeatedly, which likely indicates that the size of the “step” value we’ve chosen for \(\alpha\) in this instance is causing the variable estimates to oscillate around their optimal values. The table above shows the value of \(J(\theta)\) after 500 iterations with \(\alpha = 0.1\).

J(Theta), 500 Iterations, Alpha = 1.0
Y-Intercept SqFt BRs
368528.3 125503.6 -14053.34

For \(\alpha = 1.0\), the plot shows that our value for \(\alpha\) in this instance may, in fact, be too large as evidenced by the very noisy pattern displayed for each of the three variables. Using too large of a value for \(\alpha\) can cause such behavior in SGD since the algorithm can end up continually “overshooting” the optimal (aka, minima) values we are searching for, particularly if the data set size is relatively small. The table above shows the value of \(J(\theta)\) after 500 iterations with \(\alpha = 1.0\)

—————————————————————————–

Estimating Using R’s Linear Regression Function

Using R’s built-in linear regression we get:

matx <- as.matrix(x)
maty <- as.matrix(y)

mdl <- lm(maty ~ matx[, 2] + matx[, 3])
summary(mdl)
## 
## Call:
## lm(formula = maty ~ matx[, 2] + matx[, 3])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -130582  -43636  -10829   43698  198147 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   340413       9637  35.323  < 2e-16 ***
## matx[, 2]     110631      11758   9.409 4.22e-12 ***
## matx[, 3]      -6650      11758  -0.566    0.575    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66070 on 44 degrees of freedom
## Multiple R-squared:  0.7329, Adjusted R-squared:  0.7208 
## F-statistic: 60.38 on 2 and 44 DF,  p-value: 2.428e-13

The characteristic equation for the estimate of a home’s price derived via R’s linear regression function is:

Home Price Estimate = \(340,413 + 110,631 * (Square Footage) - 6650 * (Bedrooms)\)

—————————————————————————–

Comparing SGD Results to Linear Regression

Let’s compare this to the results of our SGD calculations after 500 iterations:

Alpha Y-Intercept Sq Feet Bedrooms
0.001 134,076.90 37,939.28 17,252.19
0.01 339,327.30 100,529.50 3,207.53
0.1 345,724.90 105,554.00 3,642.80
1.0 368,528.30 125,503.60 -14,053.34

Of the four SGD results, those for \(\alpha = 0.001\) and \(\alpha = 1.0\) are clearly very far off from the results of the linear regression modeling.

For \(\alpha = 0.01\) and \(\alpha = 0.1\), \(\alpha = 0.01\) appears to yield a less “noisy” result while \(\alpha = 0.1\) appears to have converged faster. Both of their Y-intercept estimates vary from the linear regression result by only one to two percent or so. In regards to the square footage multiplier, \(\alpha = 0.1\) has yielded a result that is somewhat closer to the results of linear regression, while the estimate of the bedroom multiplier varies from that of linear regression by roughly equal amounts for both \(\alpha = 0.01\) and \(\alpha = 0.1\).

However, such differences should be expected since the linear regression model makes use of the data set in its entirety to find the best fitting least squares model while the “mini batch” SGD we’ve performed here relies on repeated sampling of the original data and yields only an approximation of the best fitting least squares model for the various variables.

Also, SGD is proported to work very well when applied to a large data set, yielding results that are comparable to non-stochastic gradient descent at a greatly reduced computational cost. However, the data set we’re using here is relatively small (47 observations comprised of 3 variables). As such, it might be inappropriate to use SGD on this data, particularly if we are concerned with the accuracy of the results.