Introduction

Welcome to my blog. If you are reading this, most likely you’ve heard or have taken a course taught by Andrew Ng on Machine Learning on Coursera. One thing I always wished for is that course was taught in R instead of Octave/Matlab, so I decided to do all the assignments in the course using R.

Implementation Notes
This document is aligned with the course assignments, therefore NO math deriviation. ML code is wrapped into model format of lm, glm functions. You can supply dataframes directly with model specifications in the format: y~x1+x2… I will make every effort to avoid using any packages other that ggplot2. Enjoy!

The Github repository for this blog can be found here.

Part 1: Gradient Descent

Lets start with loading the housing data and food truck profit data from the ex1. Note that the data is loaded into a dataframe and not into individual X and y matricies.

  df_prices<-read.csv("w2/ex1data2.txt",header = FALSE)
  names(df_prices)<-c("footage","bedroom","price")
  
  df_foodTruck<-read.csv("w2/ex1data1.txt",header = FALSE)
  names(df_foodTruck)<-c("X","y")

Plotting Data

Next, let’s visualize the data and examine the relationship between population of a city and profit of a food truck in that city:

ggplot(data=df_foodTruck)+geom_point((aes(x=X,y=y)),shape=4,color="red",size=3)+
  scale_y_continuous(labels = scales::dollar)+
  scale_x_continuous(labels = scales::comma)+
  labs(x="Population of City in 10,000s", y="Price in $10,000s")+
  theme(panel.background = element_blank(),axis.line = element_line(colour = "black"))

Computing the cost \(J\)(\(\theta\))

In this section we will examine implementation of the cost function:

computeCost<-function(X, y, theta){
  m<-dim(X)[1]
  J<-(1/(2*m))*t(X%*%theta-y)%*%(X%*%theta-y)
}

Do the testing:

X<-as.matrix(cbind(rep(1,nrow(df_foodTruck)),df_foodTruck[,c("X")]))
y<-as.matrix(df_foodTruck["y"])
J<-computeCost(X, y, theta=c(0,0))
sprintf("Computed value of cost function = %s. Expected cost value (approx) 32.07",
       round(J,digits=2))

## [1] "Computed value of cost function = 32.07. Expected cost value (approx) 32.07"

Gradient descent

Next, we will implement gradient descent similar to the one in gradientDescentMulti.m. The function below has the following inputs:

X,y data in matrix format
theta initial value for \(\theta\)
alpha the learning rate \(\alpha\)
num_iters how many iterations the loop should make
threshold break execution if difference in \(J\)(\(\theta\)) from one iteration to another reached a value below threshold

The function outputs a list that has 2 elements:

theta parameter vector
iteration how many iterations it took to converge

gradientDescentMulti<-function(X, y, theta, alpha, num_iters,threshold){
  J_history<-vector(mode = "numeric", length = num_iters)
  m<-nrow(X)
  for(i in seq_len(num_iters)){
    theta<-theta-(t(X)%*%(X%*%theta-y))*(alpha/m)
    J_history[i]<-computeCost(X, y, theta)
    
    Conversion<-( i>1 &&  abs(1-(J_history[i]/J_history[i-1]))<threshold)
    
    if(is.na(Conversion)==F){
      if(Conversion) break
    } else  {
      stop(sprintf("Failed to converge. Learning rate alpha = '%s' is too large.",
                   alpha), domain = NA)
    }
  }
  out<-list(theta=theta,iteration=i)
}

Testing: Check the results of Gradient Descent against built-in lm function

theta_GD<-gradientDescentMulti(X,y,theta=c(0,0),alpha=.02,num_iters = 10000,threshold=5e-11)$theta
theta_LM<-lm(y~X,data=df_foodTruck)$coef
#Compare to Linear regression
t(theta_GD)

##      [,1]     [,2]
## y -3.8952 1.192975

theta_LM

## (Intercept)           X 
##   -3.895781    1.193034

Feature Normalization

In this part we will look at housing prices dataset df_prices. More specifically, variables that will be part of X matrix: footage and bedroom:

summary(df_prices[c("footage","bedroom")])

##     footage        bedroom    
##  Min.   : 852   Min.   :1.00  
##  1st Qu.:1432   1st Qu.:3.00  
##  Median :1888   Median :3.00  
##  Mean   :2001   Mean   :3.17  
##  3rd Qu.:2269   3rd Qu.:4.00  
##  Max.   :4478   Max.   :5.00

Values of feature bedroom are about 1,000 smaller than the values of footage. Performing feature scaling will insure that gradient descent converges much quicker
The function featureNormalize takes the following inputs:

xm matrix of X’s
infl scalar, 1 if intercept is present in the model, 0 otherwise

The function outputs a list that has 2 elements:

x matrix of standartized X’s
scalingMatrix matrix of means and st. deviations for each regressor. If infl=1, mean is set to 0 and st. dev=1 for the first column in X’s

featureNormalize<-function(xm, infl){
  
  scalingMatrix<-apply(xm,2,function(x){
    cbind(mean(x),sd(x))
  })
  rownames(scalingMatrix)<-c("mu","sigma")
  
  # if model has constant, do not scale the constant
  if(infl==1){
    scalingMatrix[1,1]<-0
    scalingMatrix[2,1]<-1
  }
  mus<-scalingMatrix[rep(1,nrow(xm)),]
  sds<-scalingMatrix[rep(2,nrow(xm)),]
  X_norm<-(xm-mus)/sds
  out<-list(x=X_norm,scalingMatrix=scalingMatrix)
}

Testing: Compare original matrix of X’s to Feature Normalized

normalize<-featureNormalize(
  xm=as.matrix(df_prices[c("footage","bedroom")]),
  infl=0
  )
normalize$scalingMatrix

##         footage   bedroom
## mu    2000.6809 3.1702128
## sigma  794.7024 0.7609819

head(normalize$x)

##          footage    bedroom
## [1,]  0.13000987 -0.2236752
## [2,] -0.50418984 -0.2236752
## [3,]  0.50247636 -0.2236752
## [4,] -0.73572306 -1.5377669
## [5,]  1.25747602  1.0904165
## [6,] -0.01973173  1.0904165

head(df_prices[c("footage","bedroom")])

##   footage bedroom
## 1    2104       3
## 2    1600       3
## 3    2400       3
## 4    1416       2
## 5    3000       4
## 6    1985       4

Putting it all together: `gd` function

In this section I would like discuss a “wrapper” function gd that works with the functions above in a similar fashion as lm does with lm.fit. That is goal here is to make gradient descent a practical solution that could serve as an alternative to lm.
The function gd takes the following inputs:

formula an object of class “formula”: a symbolic descriptionof the model to be fitted
data data frame
subset an optional vector specifying a subset of observations to be used in the fitting process
theta a vector of initial values of \(\theta\). Length of the vector should match model specification
alpha the learning rate \(\alpha\)
num_iters how many iterations the loop should make
threshold break execution if difference in J(\(\theta\)) from one iteration to another reached a value below threshold
normalize logical. If FALSE the scaling matrix will have mu=0 and sd. dev=1 for all variables

The function outputs a list that has 3 elements:

theta parameter vector of model coefficients
scalingMatrix matrix of means and st. deviations for each regressor. If there is Intercept in the model, mean is set to 0 and st. dev is set to 1 for the first column in X’s
iteration how many iterations it took to converge

gd<-function(formula,data,subset,theta,alpha=1e-4, num_iters=1e+4, threshold=5e-10, normalize=T){
  mf <- match.call(expand.dots = F)
  m <- match(c("formula", "data","subset"), 
             names(mf), 0L)
  mf <- mf[c(1L, m)]
  mf$drop.unused.levels <- TRUE
  mf[[1L]] <- quote(stats::model.frame)
  mf <- eval(mf, parent.frame())
  mt <- attr(mf, "terms")
  
  y <- model.response(mf, "numeric")
  x <- model.matrix(mt, mf)
  # flag if intercept was selected. will set lambda vector[1] (regularization variable) to zero if intercept present in formula
  infl <- attr(mt,"intercept")
  #
  n<-ncol(x)
  if(n!=length(theta)) stop("Model formula and initial theta have incompatible dimensions")
  #lm.fit (x, y)$coefficients
  scalingMatrix<-NULL
  if(normalize){
    x<-featureNormalize(x,infl=infl)$x
    scalingMatrix<-featureNormalize(x,infl=infl)$scalingMatrix
  }
  
  out<-gradientDescentMulti(X=x, y=y, theta=theta, alpha=alpha, num_iters=num_iters,threshold=threshold)
  model<-list(theta=out$theta, scalingMatrix=scalingMatrix, iteration=out$iteration)
}

Compare to lm

In this section we are going to to look at house prices dataset and compare the results of the new gd function vs. R’s buil-in lm. For this test we also will be performing feature normalization:

gd:

(gd(price~bedroom+footage,data=df_prices,theta=c(0,0,0),alpha=0.01,threshold=0.00000000005,normalize=T))

## $theta
##                   [,1]
## (Intercept) 340412.659
## bedroom      -6644.255
## footage     110625.831
## 
## $scalingMatrix
##       (Intercept)      bedroom      footage
## mu              0 2.698896e-16 2.315118e-17
## sigma           1 1.000000e+00 1.000000e+00
## 
## $iteration
## [1] 2161

# create a copy of the dataset prior to scaling
df_prices_scaled<-df_prices
# scale
df_prices_scaled[,c("footage","bedroom")]<-lapply(df_prices[,c("footage","bedroom")],scale)
lm(price~bedroom+footage,data=df_prices_scaled)

## 
## Call:
## lm(formula = price ~ bedroom + footage, data = df_prices_scaled)
## 
## Coefficients:
## (Intercept)      bedroom      footage  
##      340413        -6649       110631

As expected, the results for 2 approached are very close. Note: gd also outputs scaling data taht can be used to pre-process the data prior to using predict.

Part 2: Logistic Regression

In this section, I will implement logistic regression and compare the performance to glm(family=”binomial”) ouput. Our dataset contains data on university admittance based on two scores a student gets in two exams.
Lets load the data…

grades<-read.csv("w3/ex2data1.txt",header = FALSE)
names(grades)<-c("e1","e2","success")

Visualizing the data

Next, let’s visualize the data and examine the relationship between our variables. It would appear that student are more likely to get admitted if they got high scores on both exams:

  ggplot(data=grades)+geom_point(aes(x=e1,y=e2,
                                     shape=factor(success),color=factor(success)),size=3)+
  scale_shape_manual(values=c(1,3),labels=c("Not admitted","Admitted"))+
  scale_color_manual(values=c("red","black"),labels=c("Not admitted","Admitted"))+
  scale_y_continuous(breaks=seq(0,100,by=10))+
  scale_x_continuous(breaks=seq(0,100,by=10))+
  labs(x="Exam 1 score", y="Exam 2 score", color=NULL,shape=NULL)+
  theme(
    axis.line=element_line(color="black"),
    panel.background = element_blank(),
    panel.grid.major = element_line(color="grey90"),
    legend.position = c(.85,.9),
    legend.background = element_rect(color="black"),
    legend.key = element_blank()
  )

Sigmoid function

before we can start with actual cost function, we need to implement sigmoid function, \(g(x)\):

sigmoid<-function(z){
  g<-1/(1+exp(-z))
}

Cost function and gradient (Regularized)

In this section we will be implementing function similar to costFunctionReg.m. This function will have two outputs: cost and gradient. In place of Octave/Matlab’s fminunc we will be using R’s general purpose optimization function optim. The format of two functions is very similar.
Some additional notes about implementation in R:

Since the function is capable of dealing with regularization, it also requires an input infl that indicates whether or not model specification has intercept to avoid regularizing it
An actual cost/gradient function is wrapped into a simple function that only take one input type and output two closures that can be supplied to optim as separate function, cost and gradient

The function ComputeCostGradient takes the following inputs:

type 2 options: “J” to output cost function, “grad” to output gradient function

The function outputs 2 closures:

J for cost function
grad for gradient function

ComputeCostGradient<-function(type){
  
  function(X, y, theta, infl, lambda){
    # length of theta or design matrix
    n<-dim(X)[2]
    # number of observations
    m<-dim(X)[1]
    # vector to drop Xo from regularization component. check if Xo supplied to the model formula
    if(infl==1) 
      lambda_vector<-c(0,rep(1,n-1))
    else 
      lambda_vector<-c(rep(1,n))
    
    if(type=="J"){
      (-1/m)*sum(y*log(sigmoid(X%*%theta))+(1-y)*log(1-sigmoid(X%*%theta)))+lambda/(2*m)*lambda_vector%*%theta^2
    }
    else if(type=="grad"){
      as.numeric((1/m)*t(X)%*%(sigmoid(X%*%theta)-y)+lambda/m*lambda_vector*theta)
    }
    else stop("Invalid output request from CostGradient: acceptable values are: 'J' and 'grad'")
  }
}

Putting it all together

In this section I will be introducing “wrapper” function gdlreg that’s based on lm function.
The function gdlreg takes the following inputs:

formula an object of class “formula”: a symbolic descriptionof the model to be fitted
data data frame
subset an optional vector specifying a subset of observations to be used in the fitting process
theta a vector of initial values of \(\theta\). Length of the vector should match model specification
lambda a regularization parameter
method the method to be used. See ?optim/method for details

The function outputs:

theta parameter vector of model coefficients

gdlreg2<-function(formula,data,subset,theta, lambda=0, method ="Nelder-Mead"){
  if(is.na(match(method,c("Nelder-Mead", "BFGS", "L-BFGS-B")))) 
    stop("Please select on of the tested methods: Nelder-Mead, BFGS, L-BFGS-B")
  
  mf <- match.call(expand.dots = F)
  m <- match(c("formula", "data","subset"), 
             names(mf), 0L)
  mf <- mf[c(1L, m)]
  #mf <- mf[m]
  mf$drop.unused.levels <- TRUE
  mf[[1L]] <- quote(stats::model.frame)
  mf <- eval(mf, parent.frame())
  mt <- attr(mf, "terms")
  
  y <- model.response(mf, "numeric")
  x <- model.matrix(mt, mf)
  # flag if intercept was selected. will set lambda vector[1] (regularization variable) to zero if intercept present in formula
  infl <- attr(mt,"intercept")
  #
  n<-ncol(x)
  if(n!=length(theta)) 
    stop("Model formula and initial theta have incompatible dimensions")
  # two closures are created by ComputeCostGradient function for J and grad
  J<-ComputeCostGradient("J")
  grad<-ComputeCostGradient("grad")
  
  # optimize based on advanced algorithm. same as Octave's fminunc
  res<-optim(theta,J,grad,X=x,y=y,infl=infl,lambda=lambda,method=method)$par
  names(res)<-colnames(x)
  res
}

Plotting decision boundary

Coming up …

Compare to glm

In this section we will apply the new gdlreg2 function R’s glm function. We will be using exam/admittance data loaded at the begining of this section:

gdlreg2:

gdlreg2(success~e1+e2,data=grades,theta=c(0,0,0),lambda=0)

## (Intercept)          e1          e2 
## -25.1647048   0.2062524   0.2015046

glm/binomial:

glm(success~e1+e2,data=grades, family="binomial")$coef

## (Intercept)          e1          e2 
## -25.1613335   0.2062317   0.2014716

Part 3: Multi-class Classification

In this section I will be implementing one-vs-all logistic regression to recognize hand-written digits. We have 2 files that contain 5000 trining examples of handritten digits and the corresponding labels. These files are in native Octave/Matlab format and will require package rmatio from CRAN to read them into R. The same 2 datasets will be used in the following sectioon on Neural Networks

library(rmatio)

## Warning: package 'rmatio' was built under R version 3.6.1

handwrit_data<-read.mat("w5/ex4data1.mat")
handwrit_weights<-read.mat("w5/ex4weights.mat")

Visualizing the data

Next, let’s visualize the data. We begin by by randomly selecting 100 rows. Some implementation tips: 1. roll vector of x’s for each row into a 20x20 matrix/grid; 2. use geom_raster to plot a grid; 3. use facet_wrap to display all 100 digits in one plot:

library(ggplot2)
m=dim(handwrit_data$X)[1]
rand_indices<-sample(m,100)
sel<-handwrit_data$X[rand_indices[1:100],]
#Roll vector into a grid with 2 coordinates and the value
number_df<-NULL
# individual number (vector of 400 features)
for (s in 1:100){
  temp_matrix<-matrix(sel[s,],nrow=20,byrow=T)
  for(i in 1:20){
    for(j in 1:20){
      temp<-data.frame(index=s,x1=i,x2=j,value=temp_matrix[i,21-j])
      number_df<-rbind(number_df,temp)
      #print(temp)
    }
  }
}
ggplot(data=number_df,aes(x=x1,y=x2))+geom_raster(aes(fill=value))+facet_wrap(~index, ncol=10)+
  scale_fill_continuous(type = "viridis")+
  labs(x="",y=NULL,title="Hand-written numbers: sample")+
  theme(
  strip.background = element_blank(),
  strip.text.x = element_blank(),
  legend.position = "none",
  #axis.title=element_blank(),
  axis.text = element_blank(),
  axis.ticks = element_blank()
)

Cost function and gradient (Regularized)

In this section we will be implementing function similar to lrCostFunction.m. This functions happens to be exactly the same as ComputeCostGradient defined in the previous section and we will be reusing it in this section.

One-vs-all Classification

In this part of the exercise, you will implement one-vs-all classfication by training multiple regularized logistic regression classifiers, one for each of the K classes in our dataset. In the handwritten digits dataset, K = 10, but our code will work for any value of K.

To accomplish this, I will be using function gdlregMulti that mimics OneVSAll.m. The function gdlregMulti takes the following inputs:

formula an object of class “formula”: a symbolic descriptionof the model to be fitted
data data frame
subset an optional vector specifying a subset of observations to be used in the fitting process
theta a vector of initial values of \(\theta\). Length of the vector should match model specification
lambda a regularization parameter
method the method to be used. See ?optim/method for details

The function outputs a list that has 2 elements:

all_theata a dataframe, each row contains parameter estimates for each class K
y_class vector of class labels in the model

gdlregMulti<-function(formula,data,subset,theta, lambda=0, method ="BFGS"){
  #method BFGS works with a large number of features
  
  if(is.na(match(method,c("Nelder-Mead", "BFGS", "L-BFGS-B")))) 
    stop("Please select on of the tested methods: Nelder-Mead, BFGS, L-BFGS-B")
  
  mf <- match.call(expand.dots = F)
  m <- match(c("formula", "data","subset"), 
             names(mf), 0L)
  mf <- mf[c(1L, m)]
  #mf <- mf[m]
  mf$drop.unused.levels <- TRUE
  mf[[1L]] <- quote(stats::model.frame)
  mf <- eval(mf, parent.frame())
  mt <- attr(mf, "terms")
  
  ymulti <- model.response(mf, "numeric")
  y_class<-unique(ymulti)
  x <- model.matrix(mt, mf)
  # flag if intercept was selected. will set lambda vector[1] (regularization variable) to zero if intercept present in formula
  infl <- attr(mt,"intercept")
  #
  n<-ncol(x)
  if(n!=length(theta)) 
    stop("Model formula and initial theta have incompatible dimensions")
  # two closures are created by ComputeCostGradient function for J and grad
  
  all_theta<-NULL
  for (i in y_class){
    y<-as.numeric(ymulti %in% i)
     J<-ComputeCostGradient("J")
     grad<-ComputeCostGradient("grad") 
     res<-optim(theta,J,grad,X=x,y=y,infl=infl,lambda=lambda,method=method)$par
     #all_theta<-data.frame(rbind(all_theta,c(class=i,res)))
     all_theta<-rbind(all_theta,res)
     colnames(all_theta)<-colnames(x)
     rownames(all_theta)<-NULL
  }
    
  
  out<-list(all_theta=all_theta,y_class=y_class)
}

One-vs-all Prediction

In this section I will examine predictOneVsAll function that mimics predictOneVsAll.m.

The function one input from gdlregMulti classifier that contains 2 elements:

all_theata a dataframe, each row contains parameter estimates for each class K
y_class vector of class labels in the model

predictOneVsAll<-function(OvsA, X){
  all_theta<-OvsA[["all_theta"]]
  y_class<-OvsA[["y_class"]]
  if("(Intercept)" %in% colnames(all_theta)){
    X<-data.frame(temp=1,X)
    names(X)[1] <- "(Intercept)"
  }
  X<-sigmoid(as.matrix(X)%*%t(all_theta))
  X<-apply(X,1,function(x) y_class[which.max(x)])
}

Testing

Evaluation of model performance will consist of several steps:

Split the dataset into 3 parts: training (70%), cross-validation (20%), test (10%)
Train the model on training over the range of “\(\lambda\)”
Evaluate accuracy of each model on cross-validation set. Pick the value of “\(\lambda\)” that maximizes the accuracy on cross-validation set
Train the model on training dataset with the optimal “\(\lambda\)”
Evaluate the accuracy of the final model on test set

df_handwrit<-as.data.frame(handwrit_data)

# training index
train_in<-factor(sample(c("train","cv","test"),m,replace=T, prob=c(.7,.2,.1)))

# training set
X_train<-df_handwrit[train_in=="train",-401]
y_train<-df_handwrit[train_in=="train",401]

# cross-validation set
X_cv<-df_handwrit[train_in=="cv",-401]
y_cv<-df_handwrit[train_in=="cv",401]

# test set
X_test<-df_handwrit[train_in=="test",-401]
y_test<-df_handwrit[train_in=="test",401]

# accuracy collector
accuracy<-NULL
for (i in 0:20){
# train the model
hr_train<-gdlregMulti(y~.,data=df_handwrit,subset=(train_in=="train"),theta=rep(0,401),lambda=i,method="BFGS")

# test CV to determine lambda
hr_cv_predict<-predictOneVsAll(hr_train,X_cv)

# accuracy test
temp_acc<-sum(hr_cv_predict==y_cv)/length(y_cv)
accuracy<-c(accuracy,temp_acc)
}

optimal_l<-(0:20)[which.max(accuracy)]
cat("Optimal value of lambda:",optimal_l)

## Optimal value of lambda: 2

df_accuracy<-data.frame(lambda=0:20,Accuracy=accuracy)
ggplot(data=df_accuracy)+geom_point(aes(lambda,Accuracy))+geom_line(aes(lambda,Accuracy))+
  geom_vline(aes(xintercept=optimal_l))+
  geom_point(aes(x=optimal_l,y=accuracy[which.max(accuracy)]),color="red",size=3)+
  scale_x_continuous(labels=0:20,breaks=0:20)+labs(title="Accuracy Vs. lambda")+
  theme(panel.background = element_blank(),axis.line = element_line(colour = "black"))

Final model evaluation

In this section I will be estimating the model based on the training set and optimal value of \(\lambda\)=3:

hr_train_optimal_l<-gdlregMulti(y~.,data=df_handwrit,subset=(train_in=="train"),theta=rep(0,401),lambda=optimal_l,method="BFGS")
hr_test_predict<-predictOneVsAll(hr_train,X_test)
test_accuracy<-sum(hr_test_predict==y_test)/length(y_test)

cat("Model Accuracy:",percent(test_accuracy))

## Model Accuracy: 87.7%

table(predicted=hr_test_predict,actual=y_test)

##          actual
## predicted  1  2  3  4  5  6  7  8  9 10
##        1  46  0  2  0  1  1  3  1  0  0
##        2   0 41  6  0  0  0  0  0  0  0
##        3   0  0 45  0  2  0  0  3  0  1
##        4   0  1  0 43  0  1  3  0  3  0
##        5   0  1  2  0 35  0  0  2  0  3
##        6   1  0  0  1  2 58  0  0  0  0
##        7   0  0  0  0  0  1 49  1  2  0
##        8   0  1  0  1  0  0  0 39  0  0
##        9   0  1  2  5  0  0  1  2 46  0
##        10  0  1  0  0  0  0  1  0  3 41

Part 4: coming up NN

Part 8: Anomaly Detection and Recommender Systems

In this part I will implement collaborative filtering and apply it to a dataset of movie ratings. This dataset consists of ratings on a scale of 1 to 5. The dataset has n_u=943 users, and n_m=1,682 movies. Start with loading the movie database files:

# initial parameters
movieParams<-read.mat("w9/ex8_movieParams.mat")
# movie ratings
movies<-read.mat("w9/ex8_movies.mat")

# movie IDs
movieList<-readLines("w9/movie_ids.txt")
movieList<-data.frame(index=regmatches(movieList,regexpr("^[0-9]*",movieList)),MovieName=sub("^[0-9]*\\s","\\1",movieList))

Y<-movies[["Y"]]
R<-movies[["R"]]
X<-movieParams[["X"]]
Theta<-movieParams[["Theta"]]

Cost function and gradient (Regularized)

In this section we will be implementing function similar to cofiCostFunc.m. This function will have two outputs: cost and gradient.
Some additional notes about implementation in R:

An actual cost/gradient function is wrapped into a simple function that only take one input type and output two closures that can be supplied to optim as separate function, cost and gradient

The function cofi takes the following inputs:

type 2 options: “J” to output cost function, “grad” to output gradient function

The inside anonymous function takes the following inputs:

params initial unrolled values for X and Theta stacked
Y matrix of movie ratings

The function outputs 2 closures:

J for cost function
grad for gradient function
R binary-valued indicator matrix, were R(i,j)=1 if user j gave a rating to movie i, and 0 otherwise

cofi<-function(type){
  function(params, Y, R, num_users, num_movies, num_features, lambda){
    
    X<-matrix(
      params[1:(num_movies*num_features)],
      nrow=num_movies,
      ncol=num_features
    )
    
    Theta<-matrix(
      params[(num_movies*num_features+1):length(params)],
      nrow=num_users,
      ncol=num_features
    )
    
    
    # COST FUNCTION
    J_reg<-(lambda/2)*sum(Theta^2)+(lambda/2)*sum(X^2)
    
    J<-(1/2)*sum(R*(X%*%t(Theta)-Y)^2)+J_reg
    
    # GRADIENTS
    
    Theta_grad=t(R*(X%*%t(Theta)-Y))%*%X+lambda*Theta
    
    X_grad=(R*(X%*%t(Theta)-Y))%*%Theta+lambda*X
   
    
    grad<-c(as.vector(X_grad),as.vector(Theta_grad))
    
    
    if(type=="J"){
      J
    }
    else if(type=="grad"){
      grad
    }
    else stop("Invalid output request from CostGradient: acceptable values are: 'J' and 'grad'")
    
  }
}

Geadient Checking

Before proceeding further, I’m going to test the performance of cofi function, specifically I will be comparing analytical gradients to numeric. To perform the gradient checking, I created the function checkCOGradients. This fuction will output two rows of number: analytical and numeric gradients. The fuction takes the following inputs:

costFunc function to be evaluated
Y matrix of movie ratings
R binary-valued indicator matrix, were R(i,j)=1 if user j gave a rating to movie i, and 0 otherwise
num_users
num_movies
num_features
lambda a regularization parameter

checkCOGradients<-function(costFunc,Y, R, num_users, num_movies, num_features, lambda=0){
  grad<-costFunc("grad")
  J<-costFunc("J")
  
  

   params<-rnorm((num_users+num_movies)*num_features)
  
  #
  g_analytical<-grad(params=params,Y=Y, R=R, num_users=num_users, num_movies=num_movies, num_features=num_features,lambda=lambda)
  l<-length(g_analytical)
  
  e<-1e-4
  g_numeric<-vector(mode="numeric",length=l)
  
  for (i in 1:l){
    e_vector<-rep(0,l)
    e_vector[i]<-e
    param_plus<-params+e_vector
    param_minus<-params-e_vector
    JP<-J(param_plus,Y=Y, R=R, num_users=num_users, num_movies=num_movies, num_features=num_features,lambda=lambda)
    JM<-J(param_minus,Y=Y, R=R, num_users=num_users, num_movies=num_movies, num_features=num_features,lambda=lambda)
    g_numeric[i]<-(JP-JM)/(2*e)
    
  }
  cbind(Numeric=g_numeric,Analytical=g_analytical[1:l])
}

To test the cost function I will be creating a small test dataset:

num_users = 4
num_movies = 5
num_features = 3

X = X[1:num_movies, 1:num_features]
Theta = Theta[1:num_users, 1:num_features]

Y = Y[1:num_movies, 1:num_users]
R = R[1:num_movies, 1:num_users]

Perform gradient checking:

checkCOGradients(cofi,Y, R, num_users, num_movies, num_features, lambda=0)

##          Numeric Analytical
##  [1,]  0.4906855  0.4906855
##  [2,]  1.7131053  1.7131053
##  [3,]  1.2870548  1.2870548
##  [4,]  1.5119594  1.5119594
##  [5,]  1.2060363  1.2060363
##  [6,] -1.5411221 -1.5411221
##  [7,]  1.9451505  1.9451505
##  [8,]  1.4613903  1.4613903
##  [9,]  1.7167589  1.7167589
## [10,]  1.3693976  1.3693976
## [11,] -4.2376488 -4.2376488
## [12,] -0.8775910 -0.8775910
## [13,] -0.6593336 -0.6593336
## [14,] -0.7745479 -0.7745479
## [15,] -0.6178293 -0.6178293
## [16,]  1.4007275  1.4007275
## [17,] -5.8526847 -5.8526847
## [18,]  0.0000000  0.0000000
## [19,]  0.0000000  0.0000000
## [20,]  7.4273223  7.4273223
## [21,]  6.2368812  6.2368812
## [22,]  0.0000000  0.0000000
## [23,]  0.0000000  0.0000000
## [24,] -4.4764255 -4.4764255
## [25,] -6.0927194 -6.0927194
## [26,]  0.0000000  0.0000000
## [27,]  0.0000000  0.0000000

Analytical and numerical gradients look exactly the same.

Normalize Ratings

In this step we eill be implementing the function that normalizes movie ratings by mean rating for that movies. The rezulting normalized entries will have mean zero rating for each movie. The fuction takes the following inputs:

Y matrix of movie ratings
R binary-valued indicator matrix, were R(i,j)=1 if user j gave a rating to movie i, and 0 otherwise

The function outputs 2 maricies:

Ynorm mean normalized version of Y
Ymean vector of length num_movies with mean ration foe each movie

normalizeRatings<-function(Y, R){
  m<-dim(Y)[1] # movies
  n<-dim(Y)[2] # users
  Ymean<-vector(mode="numeric",length=m)
  Ynorm<-matrix(0,nrow=m,ncol=n)
  for (i in 1:m){
    idx<-which(R[i,]==1)
    Ymean[i]<-mean(Y[i,idx])
    Ynorm[i,idx]<-Y[i,idx]-Ymean[i]
  }
  list(Ynorm=Ynorm,Ymean=Ymean)
}

We can move on to creating movie recommendations

Learning Movie Recommendations

In this step we will train the collaborative filtering. I’ll be using the same movie preferences as in the course material:

# re-load full dataset
Y<-movies[["Y"]]
R<-movies[["R"]]

my_ratings<-vector(mode="numeric",length=dim(Y)[1])

my_ratings[1]<-4
my_ratings[98]<-2
my_ratings[7]<-3
my_ratings[12]<-5
my_ratings[54]<-4
my_ratings[64]<-5
my_ratings[66]<-3
my_ratings[69]<-5
my_ratings[183]<-4
my_ratings[226]<-5
my_ratings[355]<-5

cat('\n\nNew user ratings:\n')

## 
## 
## New user ratings:

for (i in seq_along(my_ratings)){
  if(my_ratings[i]>0){
    cat(sprintf("Rated %d for %s \n",my_ratings[i],movieList$MovieName[i]))

  }
}

## Rated 4 for Toy Story (1995) 
## Rated 3 for Twelve Monkeys (1995) 
## Rated 5 for Usual Suspects, The (1995) 
## Rated 4 for Outbreak (1995) 
## Rated 5 for Shawshank Redemption, The (1994) 
## Rated 3 for While You Were Sleeping (1995) 
## Rated 5 for Forrest Gump (1994) 
## Rated 2 for Silence of the Lambs, The (1991) 
## Rated 4 for Alien (1979) 
## Rated 5 for Die Hard 2 (1990) 
## Rated 5 for Sphere (1998)

# append the data to the main dataset

Y<-cbind(as.matrix(my_ratings),Y)
R<-cbind(as.matrix(my_ratings!=0),R)

# Normalize (mean) ratings

Y_norm_list<-normalizeRatings(Y, R)
Ynorm<-Y_norm_list[["Ynorm"]]
Ymean<-Y_norm_list[["Ymean"]]

num_users<-dim(Y)[2]
num_movies<-dim(Y)[1]
num_features<-10 # arbitrary value

# Set initial values for Theta and X
X<-matrix(rnorm(num_movies*num_features),num_movies,num_features)
Theta<-matrix(rnorm(num_users*num_features),num_users,num_features)

# initial parameters

initial_parameters<-c(as.vector(X),as.vector(Theta))
lambda<-10

# run

J<-cofi("J")
grad<-cofi("grad")

res<-optim(initial_parameters,J,grad,Y=Ynorm,R=R,num_users=num_users, num_movies=num_movies, num_features=num_features,lambda=lambda,method="L-BFGS-B")
res<-res$par

# unroll to matricies

X<-matrix(
  res[1:(num_movies*num_features)],
  nrow=num_movies,
  ncol=num_features
)

Theta<-matrix(
  res[(num_movies*num_features+1):length(res)],
  nrow=num_users,
  ncol=num_features
)

# predictions
p<-X%*%t(Theta)
my_predictions <- p[,1] + Ymean

movieList[order(my_predictions,decreasing = T),][1:10,]

##      index                                         MovieName
## 1293  1293                                   Star Kid (1997)
## 1189  1189                                Prefontaine (1997)
## 1653  1653 Entertaining Angels: The Dorothy Day Story (1996)
## 1201  1201        Marlene Dietrich: Shadow and Light (1996) 
## 1500  1500                         Santa with Muscles (1996)
## 1599  1599                     Someone Else's America (1995)
## 1122  1122                    They Made Me a Criminal (1939)
## 1467  1467              Saint of Fort Washington, The (1993)
## 814    814                     Great Day in Harlem, A (1994)
## 1536  1536                              Aiqing wansui (1994)

Final Comments on CoFi

I tried several optimizations methods: BFGS, CG and L-BFGS-B:

Both CG and L-BFGS-B push the same top 10 in a different order as Octave implementation
BFGS produces a result that is close, but not exact. That is, I need to expand the list to top 12 to get the 10 recommndations from the other methods

Machine Learning in R

K Mingoulin

12/5/2019

Introduction

Part 1: Gradient Descent

Plotting Data

Computing the cost \(J\)(\(\theta\))

Gradient descent

Feature Normalization

Putting it all together: `gd` function

Compare to lm

Part 2: Logistic Regression

Visualizing the data

Sigmoid function

Cost function and gradient (Regularized)

Putting it all together

Plotting decision boundary

Compare to glm

Part 3: Multi-class Classification

Visualizing the data

Cost function and gradient (Regularized)

One-vs-all Classification

One-vs-all Prediction

Testing

Final model evaluation

Part 4: coming up NN

Part 8: Anomaly Detection and Recommender Systems

Cost function and gradient (Regularized)

Geadient Checking

Normalize Ratings

Learning Movie Recommendations

Final Comments on CoFi

Machine Learning in R

K Mingoulin

12/5/2019

Introduction

Part 1: Gradient Descent

Plotting Data

Computing the cost \(J\)(\(\theta\))

Gradient descent

Feature Normalization

Putting it all together: gd function

Compare to lm

Part 2: Logistic Regression

Visualizing the data

Sigmoid function

Cost function and gradient (Regularized)

Putting it all together

Plotting decision boundary

Compare to glm

Part 3: Multi-class Classification

Visualizing the data

Cost function and gradient (Regularized)

One-vs-all Classification

One-vs-all Prediction

Testing

Final model evaluation

Part 4: coming up NN

Part 8: Anomaly Detection and Recommender Systems

Cost function and gradient (Regularized)

Geadient Checking

Normalize Ratings

Learning Movie Recommendations

Final Comments on CoFi

Putting it all together: `gd` function