We begin by installing and loading a few packages.
# install.packages("dslabs")
# install.packages("softImpute")
library("dslabs")
## Warning: package 'dslabs' was built under R version 4.0.4
library("softImpute")
## Warning: package 'softImpute' was built under R version 4.0.4
## Loading required package: Matrix
## Loaded softImpute 1.4
library("Matrix")
We will use the Movielens collaborative filtering data set.
data("movielens")
# we remove movies with no name
movielens = na.omit(movielens)
We create a vector of user IDs, a vector of movie titles, a vector of movie IDs, and a vector of all observed ratings.
users = movielens$userId
movies_titles = as.factor(movielens$title)
movies_IDs = as.numeric(movies_titles)
ratings = movielens$rating
head(movies_titles)
## [1] Dangerous Minds
## [2] Dumbo
## [3] Sleepers
## [4] Escape from New York
## [5] Cinema Paradiso (Nuovo cinema Paradiso)
## [6] Deer Hunter, The
## 8831 Levels: 'burbs, The ... Zulu
We remove the duplicates.
dup = duplicated(cbind(users,movies_IDs))
idx_cleaned = which(!duplicated(cbind(users,movies_IDs)))
users = users[idx_cleaned]
movies_IDs = movies_IDs[idx_cleaned]
ratings = ratings[idx_cleaned]
mean_rating = mean(ratings)
cat("The average observed rating is ")
## The average observed rating is
cat(mean_rating)
## 3.544333
cat(" out of 5.")
## out of 5.
In collaborative filtering, the rating matrices are huge, but very sparse. In that context, it is much easier to work with a matrix in sparse format. The softimpute package has a way to create that:
X = Incomplete(i = users, j = movies_IDs, x = ratings)
X[1:6,1:6]
## 6 x 6 sparse Matrix of class "dgCMatrix"
##
## [1,] . . . . . .
## [2,] . . . . . .
## [3,] . . . . . .
## [4,] . . . . . .
## [5,] . . . . . .
## [6,] 4 . . . . .
We can also easily go back to the usual matrix format this way:
as.matrix(X)[1:6,1:6]
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] NA NA NA NA NA NA
## [2,] NA NA NA NA NA NA
## [3,] NA NA NA NA NA NA
## [4,] NA NA NA NA NA NA
## [5,] NA NA NA NA NA NA
## [6,] 4 NA NA NA NA NA
Softimpute allows to impute the missing values by doing nuclear norm penalised matrix completion, as seen in class. For example like this:
res = softImpute(X,maxit = 100)
## Warning in Ssimpute.als(x, J, thresh, lambda, maxit, trace.it, warm.start, :
## Convergence not achieved by 100 iterations
Xhat = complete(X,res)
# that's the complete (imputed) matrix
Xhat[1:6,1:6]
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.428969 0.1679077 0.009187765 0.9191349 0.50185147 0.2938384
## [2,] 2.838157 0.3058214 0.015632513 2.0322370 1.14723028 0.5351874
## [3,] 2.731614 0.5028762 0.034760141 0.3982139 -0.02989460 0.8800334
## [4,] 3.599128 0.1464728 -0.002992381 4.3799482 2.76731641 0.2563274
## [5,] 3.205785 0.5978419 0.041519436 0.4100204 -0.07681244 1.0462233
## [6,] 4.000000 0.4740279 0.032996024 0.3006179 -0.07684634 0.8295489
Play around with Softimpute, and implement very simple baseline as competitors (e.g. using constant imputations). In particular, you should try to create a validation and test set to assess the quality of the imputation schemes. What is your best model? What are the best hyperparameters? Write a report as a notebook or as a pdf (with acompanying code)
First we split the data to train set and test set.
n_tot = length(users)
#indices of users/products/ratings in train
#we keep 95% of the data for training and the rest for the test
ind_train = sample(n_tot,floor(n_tot*0.95))
users_train = users[ind_train]
movies_titles_train = movies_titles[ind_train]
movies_IDs_train = movies_IDs[ind_train]
ratings_train = ratings[ind_train]
X_train = Incomplete(i = users_train, j = movies_IDs_train, x = ratings_train)
for (i in c(0.1,0.8,1,2)) {
res = softImpute(X_train,lambda =i,maxit = 300)
Xhat = complete(X_train,res)
print(mean((X[which(X_train!=X)]-Xhat[which(X_train!=X)])^2))
}
## [1] 1.545286
## [1] 1.567859
## [1] 1.519424
## [1] 1.558507
We compare it with the value of mean square error using constant imputations:
mean((X[which(X_train!=X)]- mean_rating)^2)
## [1] 1.141181
It can be seen that without scaling and centering the data, softimpute model does not perform very good and using the mean of rating as the constant performs better. Also, the smaller the lambda value is the better is model performing but still not perfectly. Now let us look to the results if we center and scale the data. # with centering and scaling the data
X = scale(X)
X_train = scale(X_train)
mean_rating = mean(scale(ratings))
for (i in c(0.1,0.8,1,2)) {
res = softImpute(X_train,lambda =i,maxit = 300)
Xhat = complete(X_train,res)
print(mean((X[which(X_train!=X)]-Xhat[which(X_train!=X)])^2))
}
## [1] 0.1420519
## [1] 0.1420519
## [1] 0.1420519
## [1] 0.1420519
mean((X[which(X_train!=X)]-mean_rating)^2)
## [1] 0.9985097
After centering and scaling the data, we can see that softimpute performs much better and the hyperparameter lambda does not change the performance of the model. we can demonstrate that softimpute is sensitive to scaling the data.
row_mean = rowMeans(X_train,na.rm = TRUE)
col_mean = colMeans(X_train,na.rm = TRUE)
X_train[is.na(X_train)] <- mean(X_train, na.rm = TRUE)
for(i in 1:ncol(X_train)){ X_train[is.na(X_train[,i]), i] <- mean(X_train[,i], na.rm = TRUE) }
ind <- which(is.na(X_train), arr.ind=TRUE) X_train[ind] <- rowMeans(X_train, na.rm = TRUE)[ind[,1]]
X_train[1:6,1:6]