Welcome to my blog. If you are reading this, most likely you’ve heard or have taken a course taught by Andrew Ng on Machine Learning on Coursera. One thing I always wished for is that course was taught in R instead of Octave/Matlab, so I decided to do all the assignments in the course using R.
Implementation Notes
This document is aligned with the course assignments, therefore NO math deriviation. ML code is wrapped into model format of lm, glm functions. You can supply dataframes directly with model specifications in the format: y~x1+x2… I will make every effort to avoid using any packages other that ggplot2. Enjoy!
The Github repository for this blog can be found here.
Lets start with loading the housing data and food truck profit data from the ex1. Note that the data is loaded into a dataframe and not into individual X and y matricies.
df_prices<-read.csv("w2/ex1data2.txt",header = FALSE)
names(df_prices)<-c("footage","bedroom","price")
df_foodTruck<-read.csv("w2/ex1data1.txt",header = FALSE)
names(df_foodTruck)<-c("X","y")Next, let’s visualize the data and examine the relationship between population of a city and profit of a food truck in that city:
ggplot(data=df_foodTruck)+geom_point((aes(x=X,y=y)),shape=4,color="red",size=3)+
scale_y_continuous(labels = scales::dollar)+
scale_x_continuous(labels = scales::comma)+
labs(x="Population of City in 10,000s", y="Price in $10,000s")+
theme(panel.background = element_blank(),axis.line = element_line(colour = "black"))In this section we will examine implementation of the cost function:
Do the testing:
X<-as.matrix(cbind(rep(1,nrow(df_foodTruck)),df_foodTruck[,c("X")]))
y<-as.matrix(df_foodTruck["y"])
J<-computeCost(X, y, theta=c(0,0))
sprintf("Computed value of cost function = %s. Expected cost value (approx) 32.07",
round(J,digits=2))## [1] "Computed value of cost function = 32.07. Expected cost value (approx) 32.07"
Next, we will implement gradient descent similar to the one in gradientDescentMulti.m. The function below has the following inputs:
X,y data in matrix formattheta initial value for \(\theta\)alpha the learning rate \(\alpha\)num_iters how many iterations the loop should makethreshold break execution if difference in \(J\)(\(\theta\)) from one iteration to another reached a value below thresholdThe function outputs a list that has 2 elements:
theta parameter vectoriteration how many iterations it took to convergegradientDescentMulti<-function(X, y, theta, alpha, num_iters,threshold){
J_history<-vector(mode = "numeric", length = num_iters)
m<-nrow(X)
for(i in seq_len(num_iters)){
theta<-theta-(t(X)%*%(X%*%theta-y))*(alpha/m)
J_history[i]<-computeCost(X, y, theta)
Conversion<-( i>1 && abs(1-(J_history[i]/J_history[i-1]))<threshold)
if(is.na(Conversion)==F){
if(Conversion) break
} else {
stop(sprintf("Failed to converge. Learning rate alpha = '%s' is too large.",
alpha), domain = NA)
}
}
out<-list(theta=theta,iteration=i)
}Testing: Check the results of Gradient Descent against built-in lm function
theta_GD<-gradientDescentMulti(X,y,theta=c(0,0),alpha=.02,num_iters = 10000,threshold=5e-11)$theta
theta_LM<-lm(y~X,data=df_foodTruck)$coef
#Compare to Linear regression
t(theta_GD)## [,1] [,2]
## y -3.8952 1.192975
## (Intercept) X
## -3.895781 1.193034
In this part we will look at housing prices dataset df_prices. More specifically, variables that will be part of X matrix: footage and bedroom:
## footage bedroom
## Min. : 852 Min. :1.00
## 1st Qu.:1432 1st Qu.:3.00
## Median :1888 Median :3.00
## Mean :2001 Mean :3.17
## 3rd Qu.:2269 3rd Qu.:4.00
## Max. :4478 Max. :5.00
Values of feature bedroom are about 1,000 smaller than the values of footage. Performing feature scaling will insure that gradient descent converges much quicker
The function featureNormalize takes the following inputs:
xm matrix of X’sinfl scalar, 1 if intercept is present in the model, 0 otherwiseThe function outputs a list that has 2 elements:
x matrix of standartized X’sscalingMatrix matrix of means and st. deviations for each regressor. If infl=1, mean is set to 0 and st. dev=1 for the first column in X’sfeatureNormalize<-function(xm, infl){
scalingMatrix<-apply(xm,2,function(x){
cbind(mean(x),sd(x))
})
rownames(scalingMatrix)<-c("mu","sigma")
# if model has constant, do not scale the constant
if(infl==1){
scalingMatrix[1,1]<-0
scalingMatrix[2,1]<-1
}
mus<-scalingMatrix[rep(1,nrow(xm)),]
sds<-scalingMatrix[rep(2,nrow(xm)),]
X_norm<-(xm-mus)/sds
out<-list(x=X_norm,scalingMatrix=scalingMatrix)
}Testing: Compare original matrix of X’s to Feature Normalized
normalize<-featureNormalize(
xm=as.matrix(df_prices[c("footage","bedroom")]),
infl=0
)
normalize$scalingMatrix## footage bedroom
## mu 2000.6809 3.1702128
## sigma 794.7024 0.7609819
## footage bedroom
## [1,] 0.13000987 -0.2236752
## [2,] -0.50418984 -0.2236752
## [3,] 0.50247636 -0.2236752
## [4,] -0.73572306 -1.5377669
## [5,] 1.25747602 1.0904165
## [6,] -0.01973173 1.0904165
## footage bedroom
## 1 2104 3
## 2 1600 3
## 3 2400 3
## 4 1416 2
## 5 3000 4
## 6 1985 4
gd functionIn this section I would like discuss a “wrapper” function gd that works with the functions above in a similar fashion as lm does with lm.fit. That is goal here is to make gradient descent a practical solution that could serve as an alternative to lm.
The function gd takes the following inputs:
formula an object of class “formula”: a symbolic descriptionof the model to be fitteddata data framesubset an optional vector specifying a subset of observations to be used in the fitting processtheta a vector of initial values of \(\theta\). Length of the vector should match model specificationalpha the learning rate \(\alpha\)num_iters how many iterations the loop should makethreshold break execution if difference in J(\(\theta\)) from one iteration to another reached a value below thresholdnormalize logical. If FALSE the scaling matrix will have mu=0 and sd. dev=1 for all variablesThe function outputs a list that has 3 elements:
theta parameter vector of model coefficientsscalingMatrix matrix of means and st. deviations for each regressor. If there is Intercept in the model, mean is set to 0 and st. dev is set to 1 for the first column in X’siteration how many iterations it took to convergegd<-function(formula,data,subset,theta,alpha=1e-4, num_iters=1e+4, threshold=5e-10, normalize=T){
mf <- match.call(expand.dots = F)
m <- match(c("formula", "data","subset"),
names(mf), 0L)
mf <- mf[c(1L, m)]
mf$drop.unused.levels <- TRUE
mf[[1L]] <- quote(stats::model.frame)
mf <- eval(mf, parent.frame())
mt <- attr(mf, "terms")
y <- model.response(mf, "numeric")
x <- model.matrix(mt, mf)
# flag if intercept was selected. will set lambda vector[1] (regularization variable) to zero if intercept present in formula
infl <- attr(mt,"intercept")
#
n<-ncol(x)
if(n!=length(theta)) stop("Model formula and initial theta have incompatible dimensions")
#lm.fit (x, y)$coefficients
scalingMatrix<-NULL
if(normalize){
x<-featureNormalize(x,infl=infl)$x
scalingMatrix<-featureNormalize(x,infl=infl)$scalingMatrix
}
out<-gradientDescentMulti(X=x, y=y, theta=theta, alpha=alpha, num_iters=num_iters,threshold=threshold)
model<-list(theta=out$theta, scalingMatrix=scalingMatrix, iteration=out$iteration)
}In this section we are going to to look at house prices dataset and compare the results of the new gd function vs. R’s buil-in lm. For this test we also will be performing feature normalization:
## $theta
## [,1]
## (Intercept) 340412.659
## bedroom -6644.255
## footage 110625.831
##
## $scalingMatrix
## (Intercept) bedroom footage
## mu 0 2.698896e-16 2.315118e-17
## sigma 1 1.000000e+00 1.000000e+00
##
## $iteration
## [1] 2161
# create a copy of the dataset prior to scaling
df_prices_scaled<-df_prices
# scale
df_prices_scaled[,c("footage","bedroom")]<-lapply(df_prices[,c("footage","bedroom")],scale)
lm(price~bedroom+footage,data=df_prices_scaled)##
## Call:
## lm(formula = price ~ bedroom + footage, data = df_prices_scaled)
##
## Coefficients:
## (Intercept) bedroom footage
## 340413 -6649 110631
As expected, the results for 2 approached are very close. Note: gd also outputs scaling data taht can be used to pre-process the data prior to using predict.
In this section, I will implement logistic regression and compare the performance to glm(family=”binomial”) ouput. Our dataset contains data on university admittance based on two scores a student gets in two exams.
Lets load the data…
Next, let’s visualize the data and examine the relationship between our variables. It would appear that student are more likely to get admitted if they got high scores on both exams:
ggplot(data=grades)+geom_point(aes(x=e1,y=e2,
shape=factor(success),color=factor(success)),size=3)+
scale_shape_manual(values=c(1,3),labels=c("Not admitted","Admitted"))+
scale_color_manual(values=c("red","black"),labels=c("Not admitted","Admitted"))+
scale_y_continuous(breaks=seq(0,100,by=10))+
scale_x_continuous(breaks=seq(0,100,by=10))+
labs(x="Exam 1 score", y="Exam 2 score", color=NULL,shape=NULL)+
theme(
axis.line=element_line(color="black"),
panel.background = element_blank(),
panel.grid.major = element_line(color="grey90"),
legend.position = c(.85,.9),
legend.background = element_rect(color="black"),
legend.key = element_blank()
)before we can start with actual cost function, we need to implement sigmoid function, \(g(x)\):
In this section we will be implementing function similar to costFunctionReg.m. This function will have two outputs: cost and gradient. In place of Octave/Matlab’s fminunc we will be using R’s general purpose optimization function optim. The format of two functions is very similar.
Some additional notes about implementation in R:
infl that indicates whether or not model specification has intercept to avoid regularizing ittype and output two closures that can be supplied to optim as separate function, cost and gradientThe function ComputeCostGradient takes the following inputs:
type 2 options: “J” to output cost function, “grad” to output gradient functionThe function outputs 2 closures:
J for cost functiongrad for gradient functionComputeCostGradient<-function(type){
function(X, y, theta, infl, lambda){
# length of theta or design matrix
n<-dim(X)[2]
# number of observations
m<-dim(X)[1]
# vector to drop Xo from regularization component. check if Xo supplied to the model formula
if(infl==1)
lambda_vector<-c(0,rep(1,n-1))
else
lambda_vector<-c(rep(1,n))
if(type=="J"){
(-1/m)*sum(y*log(sigmoid(X%*%theta))+(1-y)*log(1-sigmoid(X%*%theta)))+lambda/(2*m)*lambda_vector%*%theta^2
}
else if(type=="grad"){
as.numeric((1/m)*t(X)%*%(sigmoid(X%*%theta)-y)+lambda/m*lambda_vector*theta)
}
else stop("Invalid output request from CostGradient: acceptable values are: 'J' and 'grad'")
}
}In this section I will be introducing “wrapper” function gdlreg that’s based on lm function.
The function gdlreg takes the following inputs:
formula an object of class “formula”: a symbolic descriptionof the model to be fitteddata data framesubset an optional vector specifying a subset of observations to be used in the fitting processtheta a vector of initial values of \(\theta\). Length of the vector should match model specificationlambda a regularization parametermethod the method to be used. See ?optim/method for detailsThe function outputs:
theta parameter vector of model coefficientsgdlreg2<-function(formula,data,subset,theta, lambda=0, method ="Nelder-Mead"){
if(is.na(match(method,c("Nelder-Mead", "BFGS", "L-BFGS-B"))))
stop("Please select on of the tested methods: Nelder-Mead, BFGS, L-BFGS-B")
mf <- match.call(expand.dots = F)
m <- match(c("formula", "data","subset"),
names(mf), 0L)
mf <- mf[c(1L, m)]
#mf <- mf[m]
mf$drop.unused.levels <- TRUE
mf[[1L]] <- quote(stats::model.frame)
mf <- eval(mf, parent.frame())
mt <- attr(mf, "terms")
y <- model.response(mf, "numeric")
x <- model.matrix(mt, mf)
# flag if intercept was selected. will set lambda vector[1] (regularization variable) to zero if intercept present in formula
infl <- attr(mt,"intercept")
#
n<-ncol(x)
if(n!=length(theta))
stop("Model formula and initial theta have incompatible dimensions")
# two closures are created by ComputeCostGradient function for J and grad
J<-ComputeCostGradient("J")
grad<-ComputeCostGradient("grad")
# optimize based on advanced algorithm. same as Octave's fminunc
res<-optim(theta,J,grad,X=x,y=y,infl=infl,lambda=lambda,method=method)$par
names(res)<-colnames(x)
res
}In this section we will apply the new gdlreg2 function R’s glm function. We will be using exam/admittance data loaded at the begining of this section:
## (Intercept) e1 e2
## -25.1647048 0.2062524 0.2015046
## (Intercept) e1 e2
## -25.1613335 0.2062317 0.2014716
In this section I will be implementing one-vs-all logistic regression to recognize hand-written digits. We have 2 files that contain 5000 trining examples of handritten digits and the corresponding labels. These files are in native Octave/Matlab format and will require package rmatio from CRAN to read them into R. The same 2 datasets will be used in the following sectioon on Neural Networks
## Warning: package 'rmatio' was built under R version 3.6.1
Next, let’s visualize the data. We begin by by randomly selecting 100 rows. Some implementation tips: 1. roll vector of x’s for each row into a 20x20 matrix/grid; 2. use geom_raster to plot a grid; 3. use facet_wrap to display all 100 digits in one plot:
library(ggplot2)
m=dim(handwrit_data$X)[1]
rand_indices<-sample(m,100)
sel<-handwrit_data$X[rand_indices[1:100],]
#Roll vector into a grid with 2 coordinates and the value
number_df<-NULL
# individual number (vector of 400 features)
for (s in 1:100){
temp_matrix<-matrix(sel[s,],nrow=20,byrow=T)
for(i in 1:20){
for(j in 1:20){
temp<-data.frame(index=s,x1=i,x2=j,value=temp_matrix[i,21-j])
number_df<-rbind(number_df,temp)
#print(temp)
}
}
}
ggplot(data=number_df,aes(x=x1,y=x2))+geom_raster(aes(fill=value))+facet_wrap(~index, ncol=10)+
scale_fill_continuous(type = "viridis")+
labs(x="",y=NULL,title="Hand-written numbers: sample")+
theme(
strip.background = element_blank(),
strip.text.x = element_blank(),
legend.position = "none",
#axis.title=element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank()
)In this section we will be implementing function similar to lrCostFunction.m. This functions happens to be exactly the same as ComputeCostGradient defined in the previous section and we will be reusing it in this section.
In this part of the exercise, you will implement one-vs-all classfication by training multiple regularized logistic regression classifiers, one for each of the K classes in our dataset. In the handwritten digits dataset, K = 10, but our code will work for any value of K.
To accomplish this, I will be using function gdlregMulti that mimics OneVSAll.m. The function gdlregMulti takes the following inputs:
formula an object of class “formula”: a symbolic descriptionof the model to be fitteddata data framesubset an optional vector specifying a subset of observations to be used in the fitting processtheta a vector of initial values of \(\theta\). Length of the vector should match model specificationlambda a regularization parametermethod the method to be used. See ?optim/method for detailsThe function outputs a list that has 2 elements:
all_theata a dataframe, each row contains parameter estimates for each class Ky_class vector of class labels in the modelgdlregMulti<-function(formula,data,subset,theta, lambda=0, method ="BFGS"){
#method BFGS works with a large number of features
if(is.na(match(method,c("Nelder-Mead", "BFGS", "L-BFGS-B"))))
stop("Please select on of the tested methods: Nelder-Mead, BFGS, L-BFGS-B")
mf <- match.call(expand.dots = F)
m <- match(c("formula", "data","subset"),
names(mf), 0L)
mf <- mf[c(1L, m)]
#mf <- mf[m]
mf$drop.unused.levels <- TRUE
mf[[1L]] <- quote(stats::model.frame)
mf <- eval(mf, parent.frame())
mt <- attr(mf, "terms")
ymulti <- model.response(mf, "numeric")
y_class<-unique(ymulti)
x <- model.matrix(mt, mf)
# flag if intercept was selected. will set lambda vector[1] (regularization variable) to zero if intercept present in formula
infl <- attr(mt,"intercept")
#
n<-ncol(x)
if(n!=length(theta))
stop("Model formula and initial theta have incompatible dimensions")
# two closures are created by ComputeCostGradient function for J and grad
all_theta<-NULL
for (i in y_class){
y<-as.numeric(ymulti %in% i)
J<-ComputeCostGradient("J")
grad<-ComputeCostGradient("grad")
res<-optim(theta,J,grad,X=x,y=y,infl=infl,lambda=lambda,method=method)$par
#all_theta<-data.frame(rbind(all_theta,c(class=i,res)))
all_theta<-rbind(all_theta,res)
colnames(all_theta)<-colnames(x)
rownames(all_theta)<-NULL
}
out<-list(all_theta=all_theta,y_class=y_class)
}In this section I will examine predictOneVsAll function that mimics predictOneVsAll.m.
The function one input from gdlregMulti classifier that contains 2 elements:
all_theata a dataframe, each row contains parameter estimates for each class Ky_class vector of class labels in the modelEvaluation of model performance will consist of several steps:
df_handwrit<-as.data.frame(handwrit_data)
# training index
train_in<-factor(sample(c("train","cv","test"),m,replace=T, prob=c(.7,.2,.1)))
# training set
X_train<-df_handwrit[train_in=="train",-401]
y_train<-df_handwrit[train_in=="train",401]
# cross-validation set
X_cv<-df_handwrit[train_in=="cv",-401]
y_cv<-df_handwrit[train_in=="cv",401]
# test set
X_test<-df_handwrit[train_in=="test",-401]
y_test<-df_handwrit[train_in=="test",401]
# accuracy collector
accuracy<-NULL
for (i in 0:20){
# train the model
hr_train<-gdlregMulti(y~.,data=df_handwrit,subset=(train_in=="train"),theta=rep(0,401),lambda=i,method="BFGS")
# test CV to determine lambda
hr_cv_predict<-predictOneVsAll(hr_train,X_cv)
# accuracy test
temp_acc<-sum(hr_cv_predict==y_cv)/length(y_cv)
accuracy<-c(accuracy,temp_acc)
}
optimal_l<-(0:20)[which.max(accuracy)]
cat("Optimal value of lambda:",optimal_l)## Optimal value of lambda: 2
df_accuracy<-data.frame(lambda=0:20,Accuracy=accuracy)
ggplot(data=df_accuracy)+geom_point(aes(lambda,Accuracy))+geom_line(aes(lambda,Accuracy))+
geom_vline(aes(xintercept=optimal_l))+
geom_point(aes(x=optimal_l,y=accuracy[which.max(accuracy)]),color="red",size=3)+
scale_x_continuous(labels=0:20,breaks=0:20)+labs(title="Accuracy Vs. lambda")+
theme(panel.background = element_blank(),axis.line = element_line(colour = "black"))In this section I will be estimating the model based on the training set and optimal value of \(\lambda\)=3:
hr_train_optimal_l<-gdlregMulti(y~.,data=df_handwrit,subset=(train_in=="train"),theta=rep(0,401),lambda=optimal_l,method="BFGS")
hr_test_predict<-predictOneVsAll(hr_train,X_test)
test_accuracy<-sum(hr_test_predict==y_test)/length(y_test)
cat("Model Accuracy:",percent(test_accuracy))## Model Accuracy: 87.7%
## actual
## predicted 1 2 3 4 5 6 7 8 9 10
## 1 46 0 2 0 1 1 3 1 0 0
## 2 0 41 6 0 0 0 0 0 0 0
## 3 0 0 45 0 2 0 0 3 0 1
## 4 0 1 0 43 0 1 3 0 3 0
## 5 0 1 2 0 35 0 0 2 0 3
## 6 1 0 0 1 2 58 0 0 0 0
## 7 0 0 0 0 0 1 49 1 2 0
## 8 0 1 0 1 0 0 0 39 0 0
## 9 0 1 2 5 0 0 1 2 46 0
## 10 0 1 0 0 0 0 1 0 3 41
In this part I will implement collaborative filtering and apply it to a dataset of movie ratings. This dataset consists of ratings on a scale of 1 to 5. The dataset has nu=943 users, and nm=1,682 movies. Start with loading the movie database files:
# initial parameters
movieParams<-read.mat("w9/ex8_movieParams.mat")
# movie ratings
movies<-read.mat("w9/ex8_movies.mat")
# movie IDs
movieList<-readLines("w9/movie_ids.txt")
movieList<-data.frame(index=regmatches(movieList,regexpr("^[0-9]*",movieList)),MovieName=sub("^[0-9]*\\s","\\1",movieList))
Y<-movies[["Y"]]
R<-movies[["R"]]
X<-movieParams[["X"]]
Theta<-movieParams[["Theta"]]In this section we will be implementing function similar to cofiCostFunc.m. This function will have two outputs: cost and gradient.
Some additional notes about implementation in R:
type and output two closures that can be supplied to optim as separate function, cost and gradientThe function cofi takes the following inputs:
type 2 options: “J” to output cost function, “grad” to output gradient functionThe inside anonymous function takes the following inputs:
params initial unrolled values for X and Theta stackedY matrix of movie ratingsThe function outputs 2 closures:
J for cost functiongrad for gradient functionR binary-valued indicator matrix, were R(i,j)=1 if user j gave a rating to movie i, and 0 otherwisecofi<-function(type){
function(params, Y, R, num_users, num_movies, num_features, lambda){
X<-matrix(
params[1:(num_movies*num_features)],
nrow=num_movies,
ncol=num_features
)
Theta<-matrix(
params[(num_movies*num_features+1):length(params)],
nrow=num_users,
ncol=num_features
)
# COST FUNCTION
J_reg<-(lambda/2)*sum(Theta^2)+(lambda/2)*sum(X^2)
J<-(1/2)*sum(R*(X%*%t(Theta)-Y)^2)+J_reg
# GRADIENTS
Theta_grad=t(R*(X%*%t(Theta)-Y))%*%X+lambda*Theta
X_grad=(R*(X%*%t(Theta)-Y))%*%Theta+lambda*X
grad<-c(as.vector(X_grad),as.vector(Theta_grad))
if(type=="J"){
J
}
else if(type=="grad"){
grad
}
else stop("Invalid output request from CostGradient: acceptable values are: 'J' and 'grad'")
}
}Before proceeding further, I’m going to test the performance of cofi function, specifically I will be comparing analytical gradients to numeric. To perform the gradient checking, I created the function checkCOGradients. This fuction will output two rows of number: analytical and numeric gradients. The fuction takes the following inputs:
costFunc function to be evaluatedY matrix of movie ratingsR binary-valued indicator matrix, were R(i,j)=1 if user j gave a rating to movie i, and 0 otherwisenum_usersnum_moviesnum_featureslambda a regularization parametercheckCOGradients<-function(costFunc,Y, R, num_users, num_movies, num_features, lambda=0){
grad<-costFunc("grad")
J<-costFunc("J")
params<-rnorm((num_users+num_movies)*num_features)
#
g_analytical<-grad(params=params,Y=Y, R=R, num_users=num_users, num_movies=num_movies, num_features=num_features,lambda=lambda)
l<-length(g_analytical)
e<-1e-4
g_numeric<-vector(mode="numeric",length=l)
for (i in 1:l){
e_vector<-rep(0,l)
e_vector[i]<-e
param_plus<-params+e_vector
param_minus<-params-e_vector
JP<-J(param_plus,Y=Y, R=R, num_users=num_users, num_movies=num_movies, num_features=num_features,lambda=lambda)
JM<-J(param_minus,Y=Y, R=R, num_users=num_users, num_movies=num_movies, num_features=num_features,lambda=lambda)
g_numeric[i]<-(JP-JM)/(2*e)
}
cbind(Numeric=g_numeric,Analytical=g_analytical[1:l])
}To test the cost function I will be creating a small test dataset:
num_users = 4
num_movies = 5
num_features = 3
X = X[1:num_movies, 1:num_features]
Theta = Theta[1:num_users, 1:num_features]
Y = Y[1:num_movies, 1:num_users]
R = R[1:num_movies, 1:num_users]Perform gradient checking:
## Numeric Analytical
## [1,] 0.4906855 0.4906855
## [2,] 1.7131053 1.7131053
## [3,] 1.2870548 1.2870548
## [4,] 1.5119594 1.5119594
## [5,] 1.2060363 1.2060363
## [6,] -1.5411221 -1.5411221
## [7,] 1.9451505 1.9451505
## [8,] 1.4613903 1.4613903
## [9,] 1.7167589 1.7167589
## [10,] 1.3693976 1.3693976
## [11,] -4.2376488 -4.2376488
## [12,] -0.8775910 -0.8775910
## [13,] -0.6593336 -0.6593336
## [14,] -0.7745479 -0.7745479
## [15,] -0.6178293 -0.6178293
## [16,] 1.4007275 1.4007275
## [17,] -5.8526847 -5.8526847
## [18,] 0.0000000 0.0000000
## [19,] 0.0000000 0.0000000
## [20,] 7.4273223 7.4273223
## [21,] 6.2368812 6.2368812
## [22,] 0.0000000 0.0000000
## [23,] 0.0000000 0.0000000
## [24,] -4.4764255 -4.4764255
## [25,] -6.0927194 -6.0927194
## [26,] 0.0000000 0.0000000
## [27,] 0.0000000 0.0000000
Analytical and numerical gradients look exactly the same.
In this step we eill be implementing the function that normalizes movie ratings by mean rating for that movies. The rezulting normalized entries will have mean zero rating for each movie. The fuction takes the following inputs:
Y matrix of movie ratingsR binary-valued indicator matrix, were R(i,j)=1 if user j gave a rating to movie i, and 0 otherwiseThe function outputs 2 maricies:
Ynorm mean normalized version of YYmean vector of length num_movies with mean ration foe each movienormalizeRatings<-function(Y, R){
m<-dim(Y)[1] # movies
n<-dim(Y)[2] # users
Ymean<-vector(mode="numeric",length=m)
Ynorm<-matrix(0,nrow=m,ncol=n)
for (i in 1:m){
idx<-which(R[i,]==1)
Ymean[i]<-mean(Y[i,idx])
Ynorm[i,idx]<-Y[i,idx]-Ymean[i]
}
list(Ynorm=Ynorm,Ymean=Ymean)
}We can move on to creating movie recommendations
In this step we will train the collaborative filtering. I’ll be using the same movie preferences as in the course material:
# re-load full dataset
Y<-movies[["Y"]]
R<-movies[["R"]]
my_ratings<-vector(mode="numeric",length=dim(Y)[1])
my_ratings[1]<-4
my_ratings[98]<-2
my_ratings[7]<-3
my_ratings[12]<-5
my_ratings[54]<-4
my_ratings[64]<-5
my_ratings[66]<-3
my_ratings[69]<-5
my_ratings[183]<-4
my_ratings[226]<-5
my_ratings[355]<-5
cat('\n\nNew user ratings:\n')##
##
## New user ratings:
for (i in seq_along(my_ratings)){
if(my_ratings[i]>0){
cat(sprintf("Rated %d for %s \n",my_ratings[i],movieList$MovieName[i]))
}
}## Rated 4 for Toy Story (1995)
## Rated 3 for Twelve Monkeys (1995)
## Rated 5 for Usual Suspects, The (1995)
## Rated 4 for Outbreak (1995)
## Rated 5 for Shawshank Redemption, The (1994)
## Rated 3 for While You Were Sleeping (1995)
## Rated 5 for Forrest Gump (1994)
## Rated 2 for Silence of the Lambs, The (1991)
## Rated 4 for Alien (1979)
## Rated 5 for Die Hard 2 (1990)
## Rated 5 for Sphere (1998)
# append the data to the main dataset
Y<-cbind(as.matrix(my_ratings),Y)
R<-cbind(as.matrix(my_ratings!=0),R)
# Normalize (mean) ratings
Y_norm_list<-normalizeRatings(Y, R)
Ynorm<-Y_norm_list[["Ynorm"]]
Ymean<-Y_norm_list[["Ymean"]]
num_users<-dim(Y)[2]
num_movies<-dim(Y)[1]
num_features<-10 # arbitrary value
# Set initial values for Theta and X
X<-matrix(rnorm(num_movies*num_features),num_movies,num_features)
Theta<-matrix(rnorm(num_users*num_features),num_users,num_features)
# initial parameters
initial_parameters<-c(as.vector(X),as.vector(Theta))
lambda<-10
# run
J<-cofi("J")
grad<-cofi("grad")
res<-optim(initial_parameters,J,grad,Y=Ynorm,R=R,num_users=num_users, num_movies=num_movies, num_features=num_features,lambda=lambda,method="L-BFGS-B")
res<-res$par
# unroll to matricies
X<-matrix(
res[1:(num_movies*num_features)],
nrow=num_movies,
ncol=num_features
)
Theta<-matrix(
res[(num_movies*num_features+1):length(res)],
nrow=num_users,
ncol=num_features
)
# predictions
p<-X%*%t(Theta)
my_predictions <- p[,1] + Ymean
movieList[order(my_predictions,decreasing = T),][1:10,]## index MovieName
## 1293 1293 Star Kid (1997)
## 1189 1189 Prefontaine (1997)
## 1653 1653 Entertaining Angels: The Dorothy Day Story (1996)
## 1201 1201 Marlene Dietrich: Shadow and Light (1996)
## 1500 1500 Santa with Muscles (1996)
## 1599 1599 Someone Else's America (1995)
## 1122 1122 They Made Me a Criminal (1939)
## 1467 1467 Saint of Fort Washington, The (1993)
## 814 814 Great Day in Harlem, A (1994)
## 1536 1536 Aiqing wansui (1994)
I tried several optimizations methods: BFGS, CG and L-BFGS-B:
CG and L-BFGS-B push the same top 10 in a different order as Octave implementationBFGS produces a result that is close, but not exact. That is, I need to expand the list to top 12 to get the 10 recommndations from the other methods