Summary

Multilinear regression implemented in R. The code is coming from the Machine learning course by Andrew Ng on Coursera. Initially, (as part of the weekly exercise) it was done in Matlab/Octave. This is a transposition in R.

Note: before using those functions, a column of 1 should be added to X in order to deal with Theta0

Cost function

Note: Theta should be a vector. X a matrix and y a vector.

Cost_function <- function(X, y, Theta, Lambda= 0){
  m <- dim(X)[1]
  assign("m", m, .GlobalEnv)
  Hypothesis <- X %*% Theta
  assign("Hypothesis", Hypothesis, .GlobalEnv)
  J <- 0
  J <- (1/ (2 * m)) * t((X %*% Theta) - y) %*% ((X %*% Theta) - y)
  
  # with regularization. put lambda to 0 if not needed
  Reg <- (Lambda / (2 * m)) * sum(Theta[2: length(Theta)]^2)
  J <- J + Reg
  return(J)
}

Feature scaling and mean normalization

Methods: if the features are on the same scale, then gradient descent converge more quickly. So this is to be used with gradient descent

Both are implemented below in one shot

Implementation note: when normalizing the features, it is important to store the values used for normalization - the mean value and the standard deviation used for the computation

After learning the parameters from the model, we often want to predict the prices of houses we have not seen before.

Given a new x value (living room area and number of bed rooms), we must first normalize x using the mean and standard deviation that we had previously computed from the training set.

FeatScal_MeanNorm <- function(X){
  
  X_norm <- X
  mu <- vector(mode="numeric", length= dim(X)[2])
  sigma <- vector(mode="numeric", length= dim(X)[2])
  

  for(i in 1 : dim(X)[2]){
    avg <- mean(X[ ,i])
    S <- max(X[,i]) - min(X[,i])
    X_norm[ ,i] <- (X[ ,i] - avg) / S
    # save the mean etc... in another matrix
    mu[i] <- avg
    sigma[i] <- S
  }
  
  assign("mu", mu, .GlobalEnv)
  assign("sigma", sigma, .GlobalEnv)
  assign("X_norm", X_norm, .GlobalEnv)
  print(X_norm)
}

Gradient function with regularization

If J increases, Alpha needs to be lower like 0.0001.

Gradient_descent <- function(X, y, Theta, Alpha, nb_iteration, Lambda=0){
  Theta_grad <- Theta
  J <- Cost_function(X, y, Theta_grad, 0)
  df_learning <- data.frame(Iteration= integer(), J= integer()) #dataframe for learning rate
  
  for (i in 1 : nb_iteration){
    #Theta_grad <- Theta_grad - (Alpha / m) * (t(X) %*% ((X %*% Theta_grad) - y)) # vectorization without regularization
    
    # Gradient terms
    # Notice how we use full vectorization here
    # The trick is to compute the summatory term within the gradient function
    # using a multiplication of the transpose of the input examples and
    # the errors in our predictions. The transpose of X gives us all the feature
    # i for all the examples in the i row and so on ...
    # Also, since we DO NOT apply regularization for theta1, we use set the
    # first element of a copy of theta as 0
    pred <- X %*% Theta_grad
    grad <- ((t(X) %*% (pred - y)) / m) + ((Lambda/m) * c(0, Theta_grad[2:length(Theta_grad)]))
    
    Theta_grad <- Theta_grad - Alpha * grad
    
    J <- Cost_function(X, y, Theta_grad, Lambda)
    df_learning[i, ] <- c(i, J)
  }
  
  print(paste("local min J", J))
  assign("df_learning", df_learning, .GlobalEnv) #save df in order to do the plot
  return(Theta_grad) # nb: code stop after return function  
} 

Normal equation with or without regularization

The regularization objective is to reduce overfiting.

Normal equation does not need feature scaling!

Norm_equ <- function(X, y, Lambda = 0){
  library(MASS) #for the function ginv
  
  # Creating the matrix needed if lambda is used
  n <- dim(X)[2]
  
  Mat <- diag(c(rep(1,n)))
  Mat[1, ] <- 0
  
  ifelse(Lambda == 0, Theta_grad <- ginv(t(X) %*% X) %*% t(X) %*% y,
        Theta_grad <- ginv(t(X) %*% X + Lambda * Mat) %*% t(X) %*% y)
  
  assign("Theta_grad", Theta_grad, .GlobalEnv)      
  return(Theta_grad) # nb : code stop after return function      
  
} 

Testing

Gradient descent without regularization

We will use the mtcars dataset for this example, with mpg as our dependant variable.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
X <-  as.matrix(mtcars[,2:7])
Y <-  mtcars[,1]

Gradient_descent(X,Y, c(0,0,0,0,0,0),0.00001,800)
## [1] "local min J 29.1175717142582"
##             [,1]
## cyl   0.08612188
## disp -0.02429728
## hp    0.06951919
## drat  0.13716588
## wt    0.04651859
## qsec  0.64248892
plot(df_learning$Iteration, df_learning$J,
     main = "Learning rate", 
     xlab = "Nb iterations",
     ylab = "J cost")

Gradient descent with regularization

Gradient_descent(X,Y, c(0,0,0,0,0,0),0.00001,800,50)
## [1] "local min J 29.6709653789403"
##             [,1]
## cyl   0.08625318
## disp -0.02418372
## hp    0.06969790
## drat  0.13640546
## wt    0.04626597
## qsec  0.63893032
plot(df_learning$Iteration, df_learning$J, 
     main = "Learning rate", 
     xlab = "Nb iterations",
     ylab = "J cost")

Normal equation with and without regularization

Norm_equ(X,Y,0) # without regularization
## Warning: package 'MASS' was built under R version 3.3.3
##             [,1]
## [1,]  0.14804477
## [2,]  0.01271651
## [3,] -0.01063254
## [4,]  3.26954359
## [5,] -4.75444330
## [6,]  1.19404896
Cost_function(X,Y, Theta_grad,0) # cost function
##          [,1]
## [1,] 2.884699
Norm_equ(X,Y,5) # with regulariation
##              [,1]
## [1,]  0.252642608
## [2,] -0.009190186
## [3,] -0.005295806
## [4,]  2.382375503
## [5,] -2.479987066
## [6,]  1.163583137
Cost_function(X,Y, Theta_grad,0) # cost function
##          [,1]
## [1,] 3.493251