Introduction

Welcome to my blog. If you are reading this, most likely you’ve heard or have taken a course taught by Andrew Ng on Machine Learning on Coursera. One thing I always wished for is that course was taught in R instead of Octave/Matlab, so I decided to do all the assignments in the course using R.

Implementation Notes
This document is aligned with the course assignments, therefore NO math deriviation. ML code is wrapped into model format of lm, glm functions. You can supply dataframes directly with model specifications in the format: y~x1+x2… I will make every effort to avoid using any packages other that ggplot2. Enjoy!

The Github repository for this blog can be found here.

Part 1: Gradient Descent

Lets start with loading the housing data and food truck profit data from the ex1. Note that the data is loaded into a dataframe and not into individual X and y matricies.

Gradient descent

Next, we will implement gradient descent similar to the one in gradientDescentMulti.m. The function below has the following inputs:

  • X,y data in matrix format
  • theta initial value for \(\theta\)
  • alpha the learning rate \(\alpha\)
  • num_iters how many iterations the loop should make
  • threshold break execution if difference in \(J\)(\(\theta\)) from one iteration to another reached a value below threshold

The function outputs a list that has 2 elements:

  • theta parameter vector
  • iteration how many iterations it took to converge

Testing: Check the results of Gradient Descent against built-in lm function

##      [,1]     [,2]
## y -3.8952 1.192975
## (Intercept)           X 
##   -3.895781    1.193034

Feature Normalization

In this part we will look at housing prices dataset df_prices. More specifically, variables that will be part of X matrix: footage and bedroom:

##     footage        bedroom    
##  Min.   : 852   Min.   :1.00  
##  1st Qu.:1432   1st Qu.:3.00  
##  Median :1888   Median :3.00  
##  Mean   :2001   Mean   :3.17  
##  3rd Qu.:2269   3rd Qu.:4.00  
##  Max.   :4478   Max.   :5.00

Values of feature bedroom are about 1,000 smaller than the values of footage. Performing feature scaling will insure that gradient descent converges much quicker
The function featureNormalize takes the following inputs:

  • xm matrix of X’s
  • infl scalar, 1 if intercept is present in the model, 0 otherwise

The function outputs a list that has 2 elements:

  • x matrix of standartized X’s
  • scalingMatrix matrix of means and st. deviations for each regressor. If infl=1, mean is set to 0 and st. dev=1 for the first column in X’s

Testing: Compare original matrix of X’s to Feature Normalized

##         footage   bedroom
## mu    2000.6809 3.1702128
## sigma  794.7024 0.7609819
##          footage    bedroom
## [1,]  0.13000987 -0.2236752
## [2,] -0.50418984 -0.2236752
## [3,]  0.50247636 -0.2236752
## [4,] -0.73572306 -1.5377669
## [5,]  1.25747602  1.0904165
## [6,] -0.01973173  1.0904165
##   footage bedroom
## 1    2104       3
## 2    1600       3
## 3    2400       3
## 4    1416       2
## 5    3000       4
## 6    1985       4

Putting it all together: gd function

In this section I would like discuss a “wrapper” function gd that works with the functions above in a similar fashion as lm does with lm.fit. That is goal here is to make gradient descent a practical solution that could serve as an alternative to lm.
The function gd takes the following inputs:

  • formula an object of class “formula”: a symbolic descriptionof the model to be fitted
  • data data frame
  • subset an optional vector specifying a subset of observations to be used in the fitting process
  • theta a vector of initial values of \(\theta\). Length of the vector should match model specification
  • alpha the learning rate \(\alpha\)
  • num_iters how many iterations the loop should make
  • threshold break execution if difference in J(\(\theta\)) from one iteration to another reached a value below threshold
  • normalize logical. If FALSE the scaling matrix will have mu=0 and sd. dev=1 for all variables

The function outputs a list that has 3 elements:

  • theta parameter vector of model coefficients
  • scalingMatrix matrix of means and st. deviations for each regressor. If there is Intercept in the model, mean is set to 0 and st. dev is set to 1 for the first column in X’s
  • iteration how many iterations it took to converge

Compare to lm

In this section we are going to to look at house prices dataset and compare the results of the new gd function vs. R’s buil-in lm. For this test we also will be performing feature normalization:

  • gd:
## $theta
##                   [,1]
## (Intercept) 340412.659
## bedroom      -6644.255
## footage     110625.831
## 
## $scalingMatrix
##       (Intercept)      bedroom      footage
## mu              0 2.698896e-16 2.315118e-17
## sigma           1 1.000000e+00 1.000000e+00
## 
## $iteration
## [1] 2161
  • lm
## 
## Call:
## lm(formula = price ~ bedroom + footage, data = df_prices_scaled)
## 
## Coefficients:
## (Intercept)      bedroom      footage  
##      340413        -6649       110631

As expected, the results for 2 approached are very close. Note: gd also outputs scaling data taht can be used to pre-process the data prior to using predict.

Part 2: Logistic Regression

In this section, I will implement logistic regression and compare the performance to glm(family=”binomial”) ouput. Our dataset contains data on university admittance based on two scores a student gets in two exams.
Lets load the data…

Sigmoid function

before we can start with actual cost function, we need to implement sigmoid function, \(g(x)\):

Cost function and gradient (Regularized)

In this section we will be implementing function similar to costFunctionReg.m. This function will have two outputs: cost and gradient. In place of Octave/Matlab’s fminunc we will be using R’s general purpose optimization function optim. The format of two functions is very similar.
Some additional notes about implementation in R:

  • Since the function is capable of dealing with regularization, it also requires an input infl that indicates whether or not model specification has intercept to avoid regularizing it
  • An actual cost/gradient function is wrapped into a simple function that only take one input type and output two closures that can be supplied to optim as separate function, cost and gradient

The function ComputeCostGradient takes the following inputs:

  • type 2 options: “J” to output cost function, “grad” to output gradient function

The function outputs 2 closures:

  • J for cost function
  • grad for gradient function

Putting it all together

In this section I will be introducing “wrapper” function gdlreg that’s based on lm function.
The function gdlreg takes the following inputs:

  • formula an object of class “formula”: a symbolic descriptionof the model to be fitted
  • data data frame
  • subset an optional vector specifying a subset of observations to be used in the fitting process
  • theta a vector of initial values of \(\theta\). Length of the vector should match model specification
  • lambda a regularization parameter
  • method the method to be used. See ?optim/method for details

The function outputs:

  • theta parameter vector of model coefficients

Plotting decision boundary

  • Coming up …

Compare to glm

In this section we will apply the new gdlreg2 function R’s glm function. We will be using exam/admittance data loaded at the begining of this section:

  • gdlreg2:
## (Intercept)          e1          e2 
## -25.1647048   0.2062524   0.2015046
  • glm/binomial:
## (Intercept)          e1          e2 
## -25.1613335   0.2062317   0.2014716

Part 3: Multi-class Classification

In this section I will be implementing one-vs-all logistic regression to recognize hand-written digits. We have 2 files that contain 5000 trining examples of handritten digits and the corresponding labels. These files are in native Octave/Matlab format and will require package rmatio from CRAN to read them into R. The same 2 datasets will be used in the following sectioon on Neural Networks

## Warning: package 'rmatio' was built under R version 3.6.1

Cost function and gradient (Regularized)

In this section we will be implementing function similar to lrCostFunction.m. This functions happens to be exactly the same as ComputeCostGradient defined in the previous section and we will be reusing it in this section.

One-vs-all Classification

In this part of the exercise, you will implement one-vs-all classfication by training multiple regularized logistic regression classifiers, one for each of the K classes in our dataset. In the handwritten digits dataset, K = 10, but our code will work for any value of K.

To accomplish this, I will be using function gdlregMulti that mimics OneVSAll.m. The function gdlregMulti takes the following inputs:

  • formula an object of class “formula”: a symbolic descriptionof the model to be fitted
  • data data frame
  • subset an optional vector specifying a subset of observations to be used in the fitting process
  • theta a vector of initial values of \(\theta\). Length of the vector should match model specification
  • lambda a regularization parameter
  • method the method to be used. See ?optim/method for details

The function outputs a list that has 2 elements:

  • all_theata a dataframe, each row contains parameter estimates for each class K
  • y_class vector of class labels in the model

One-vs-all Prediction

In this section I will examine predictOneVsAll function that mimics predictOneVsAll.m.

The function one input from gdlregMulti classifier that contains 2 elements:

  • all_theata a dataframe, each row contains parameter estimates for each class K
  • y_class vector of class labels in the model

Testing

Evaluation of model performance will consist of several steps:

  1. Split the dataset into 3 parts: training (70%), cross-validation (20%), test (10%)
  2. Train the model on training over the range of “\(\lambda\)
  3. Evaluate accuracy of each model on cross-validation set. Pick the value of “\(\lambda\)” that maximizes the accuracy on cross-validation set
  4. Train the model on training dataset with the optimal “\(\lambda\)
  5. Evaluate the accuracy of the final model on test set
## Optimal value of lambda: 2

Final model evaluation

In this section I will be estimating the model based on the training set and optimal value of \(\lambda\)=3:

## Model Accuracy: 87.7%
##          actual
## predicted  1  2  3  4  5  6  7  8  9 10
##        1  46  0  2  0  1  1  3  1  0  0
##        2   0 41  6  0  0  0  0  0  0  0
##        3   0  0 45  0  2  0  0  3  0  1
##        4   0  1  0 43  0  1  3  0  3  0
##        5   0  1  2  0 35  0  0  2  0  3
##        6   1  0  0  1  2 58  0  0  0  0
##        7   0  0  0  0  0  1 49  1  2  0
##        8   0  1  0  1  0  0  0 39  0  0
##        9   0  1  2  5  0  0  1  2 46  0
##        10  0  1  0  0  0  0  1  0  3 41

Part 4: coming up NN

Part 8: Anomaly Detection and Recommender Systems

In this part I will implement collaborative filtering and apply it to a dataset of movie ratings. This dataset consists of ratings on a scale of 1 to 5. The dataset has nu=943 users, and nm=1,682 movies. Start with loading the movie database files:

Cost function and gradient (Regularized)

In this section we will be implementing function similar to cofiCostFunc.m. This function will have two outputs: cost and gradient.
Some additional notes about implementation in R:

  • An actual cost/gradient function is wrapped into a simple function that only take one input type and output two closures that can be supplied to optim as separate function, cost and gradient

The function cofi takes the following inputs:

  • type 2 options: “J” to output cost function, “grad” to output gradient function

The inside anonymous function takes the following inputs:

  • params initial unrolled values for X and Theta stacked
  • Y matrix of movie ratings

The function outputs 2 closures:

  • J for cost function
  • grad for gradient function
  • R binary-valued indicator matrix, were R(i,j)=1 if user j gave a rating to movie i, and 0 otherwise

Geadient Checking

Before proceeding further, I’m going to test the performance of cofi function, specifically I will be comparing analytical gradients to numeric. To perform the gradient checking, I created the function checkCOGradients. This fuction will output two rows of number: analytical and numeric gradients. The fuction takes the following inputs:

  • costFunc function to be evaluated
  • Y matrix of movie ratings
  • R binary-valued indicator matrix, were R(i,j)=1 if user j gave a rating to movie i, and 0 otherwise
  • num_users
  • num_movies
  • num_features
  • lambda a regularization parameter

To test the cost function I will be creating a small test dataset:

Perform gradient checking:

##          Numeric Analytical
##  [1,]  0.4906855  0.4906855
##  [2,]  1.7131053  1.7131053
##  [3,]  1.2870548  1.2870548
##  [4,]  1.5119594  1.5119594
##  [5,]  1.2060363  1.2060363
##  [6,] -1.5411221 -1.5411221
##  [7,]  1.9451505  1.9451505
##  [8,]  1.4613903  1.4613903
##  [9,]  1.7167589  1.7167589
## [10,]  1.3693976  1.3693976
## [11,] -4.2376488 -4.2376488
## [12,] -0.8775910 -0.8775910
## [13,] -0.6593336 -0.6593336
## [14,] -0.7745479 -0.7745479
## [15,] -0.6178293 -0.6178293
## [16,]  1.4007275  1.4007275
## [17,] -5.8526847 -5.8526847
## [18,]  0.0000000  0.0000000
## [19,]  0.0000000  0.0000000
## [20,]  7.4273223  7.4273223
## [21,]  6.2368812  6.2368812
## [22,]  0.0000000  0.0000000
## [23,]  0.0000000  0.0000000
## [24,] -4.4764255 -4.4764255
## [25,] -6.0927194 -6.0927194
## [26,]  0.0000000  0.0000000
## [27,]  0.0000000  0.0000000

Analytical and numerical gradients look exactly the same.

Normalize Ratings

In this step we eill be implementing the function that normalizes movie ratings by mean rating for that movies. The rezulting normalized entries will have mean zero rating for each movie. The fuction takes the following inputs:

  • Y matrix of movie ratings
  • R binary-valued indicator matrix, were R(i,j)=1 if user j gave a rating to movie i, and 0 otherwise

The function outputs 2 maricies:

  • Ynorm mean normalized version of Y
  • Ymean vector of length num_movies with mean ration foe each movie

We can move on to creating movie recommendations

Learning Movie Recommendations

In this step we will train the collaborative filtering. I’ll be using the same movie preferences as in the course material:

## 
## 
## New user ratings:
## Rated 4 for Toy Story (1995) 
## Rated 3 for Twelve Monkeys (1995) 
## Rated 5 for Usual Suspects, The (1995) 
## Rated 4 for Outbreak (1995) 
## Rated 5 for Shawshank Redemption, The (1994) 
## Rated 3 for While You Were Sleeping (1995) 
## Rated 5 for Forrest Gump (1994) 
## Rated 2 for Silence of the Lambs, The (1991) 
## Rated 4 for Alien (1979) 
## Rated 5 for Die Hard 2 (1990) 
## Rated 5 for Sphere (1998)
##      index                                         MovieName
## 1293  1293                                   Star Kid (1997)
## 1189  1189                                Prefontaine (1997)
## 1653  1653 Entertaining Angels: The Dorothy Day Story (1996)
## 1201  1201        Marlene Dietrich: Shadow and Light (1996) 
## 1500  1500                         Santa with Muscles (1996)
## 1599  1599                     Someone Else's America (1995)
## 1122  1122                    They Made Me a Criminal (1939)
## 1467  1467              Saint of Fort Washington, The (1993)
## 814    814                     Great Day in Harlem, A (1994)
## 1536  1536                              Aiqing wansui (1994)

Final Comments on CoFi

I tried several optimizations methods: BFGS, CG and L-BFGS-B:

  • Both CG and L-BFGS-B push the same top 10 in a different order as Octave implementation
  • BFGS produces a result that is close, but not exact. That is, I need to expand the list to top 12 to get the 10 recommndations from the other methods