Libraries

library("recommenderlab")
library("ggplot2")

I will begin by loading the movies lense data-set from the recommender lab into the environment. I will also subset the data for movies have been rated by more than 50 user and films rated more than 100 times.

Data

data("MovieLense")
ratings_movies = MovieLense[rowCounts(MovieLense)>50,colCounts(MovieLense)>100]

The movie rating distribution is right-skewed with the median of

quantile(colMeans(ratings_movies, na.rm = T), .5)

##      50% 
## 3.684294

movie_mean_rating = colMeans(ratings_movies, na.rm = T)

qplot(movie_mean_rating) + ggtitle("Distribution of movie Ratings")

rm(movie_mean_rating)

The user rating distribution seems to somewhat normal with the median of

quantile(rowMeans(ratings_movies, na.rm = T), .5)

##      50% 
## 3.740624

user_mean_rating = rowMeans(ratings_movies, na.rm = T)
qplot(user_mean_rating) + ggtitle("Distribution of User Ratings")

rm(user_mean_rating)

A heat map of the first 100 users and movies reveals sparsity, but also some very active users.

image(ratings_movies[1:100,1:100], main="Heatmap First 100 Users & movies")

To create a train and test set an evaluation scheme is needed. The function takes the rating data. The data will be split 80% train and 20% test.

set.seed(143)
eval_sets = evaluationScheme(data =ratings_movies,method ="cross-validation",train=.8,given=15,goodRating=3,k=10)

rm(MovieLense,MovieLenseMeta,ratings_movies)

IBCF

IBCF Model 1

The first model is an Item Base collaborative filtering model. The data will be normalized using centering. The distance measurement used is cosine. The models will be evaluated using the unknown data set and stored in a data frame for comparison to other models. Finally, the top predictions for the model will be printed.

IBCF_cosine_model = Recommender(data=getData(eval_sets,"train"),method ="IBCF",parameter=list( normalize = "center", method="cosine"))

pred_IBCF_cos = predict(object =IBCF_cosine_model,newdata=getData(eval_sets,"known"),n=10,type="ratings")

model_stats =data.frame(IBCF_cosine_model=calcPredictionAccuracy(x=pred_IBCF_cos,data=getData(eval_sets,"unknown")))

pred_IBCF_cos_top = predict(object =IBCF_cosine_model,newdata=getData(eval_sets,"known"),n=10,type="topNList")

pred_IBCF_cos_top@itemLabels[pred_IBCF_cos_top@items[[1]]]

##  [1] "Babe (1995)"                      "Usual Suspects, The (1995)"      
##  [3] "Taxi Driver (1976)"               "Apollo 13 (1995)"                
##  [5] "Hoop Dreams (1994)"               "Natural Born Killers (1994)"     
##  [7] "Quiz Show (1994)"                 "Shawshank Redemption, The (1994)"
##  [9] "Forrest Gump (1994)"              "Mask, The (1994)"

rm(IBCF_cosine_model,pred_IBCF_cos,pred_IBCF_cos_top)

IBCF Model 2

IBCF model 2 uses Pearson distance measurement otherwise it is similar to IBCF model one.

IBCF_pearson_model = Recommender(data=getData(eval_sets,"train"),method ="IBCF",parameter=list( normalize = "center", method="pearson"))

pred_IBCF_pear = predict(object =IBCF_pearson_model,newdata=getData(eval_sets,"known"),n=10,type="ratings")

model_stats$IBCF_pearson_model= cbind( calcPredictionAccuracy(x=pred_IBCF_pear,data=getData(eval_sets,"unknown")))

pred_IBCF_pear_top = predict(object =IBCF_pearson_model,newdata=getData(eval_sets,"known"),n=10,type="topNList")

pred_IBCF_pear_top@itemLabels[pred_IBCF_pear_top@items[[1]]]

##  [1] "Toy Story (1995)"            "Get Shorty (1995)"          
##  [3] "Usual Suspects, The (1995)"  "Braveheart (1995)"          
##  [5] "Crimson Tide (1995)"         "Ed Wood (1994)"             
##  [7] "Professional, The (1994)"    "Sleepless in Seattle (1993)"
##  [9] "True Romance (1993)"         "Cold Comfort Farm (1995)"

rm(IBCF_pearson_model,pred_IBCF_pear,pred_IBCF_pear_top)

UBCF

UBCF Model 1

The user base collaborative filtering model will use the same methodology implemented earlier. The model will be normalization using centering and the distance measurement used is cosine.

UBCF_cosine_model = Recommender(data=getData(eval_sets,"train"),method ="UBCF",parameter=list( normalize = "center", method="cosine"))

pred_UBCF_cos = predict(object =UBCF_cosine_model,newdata=getData(eval_sets,"known"),n=10,type="ratings")

model_stats$UBCF_cosine_model= cbind( calcPredictionAccuracy(x=pred_UBCF_cos,data=getData(eval_sets,"unknown")))

pred_UBCF_cos_top = predict(object =UBCF_cosine_model,newdata=getData(eval_sets,"known"),n=10,type="topNList")

pred_UBCF_cos_top@itemLabels[pred_UBCF_cos_top@items[[1]]]

##  [1] "Silence of the Lambs, The (1991)" "Empire Strikes Back, The (1980)" 
##  [3] "Godfather, The (1972)"            "Princess Bride, The (1987)"      
##  [5] "Shawshank Redemption, The (1994)" "Schindler's List (1993)"         
##  [7] "Casablanca (1942)"                "Return of the Jedi (1983)"       
##  [9] "Alien (1979)"                     "Blade Runner (1982)"

rm(UBCF_cosine_model,pred_UBCF_cos,pred_UBCF_cos_top)

UBCF Model 2

UBCF model 2 will use Z-score normalization technique and a distance measurement of Pearson.

UBCF_pearson_modelz = Recommender(data=getData(eval_sets,"train"),method ="UBCF",parameter=list( normalize = "Z-score", method="pearson"))

pred_UBCF_pear = predict(object =UBCF_pearson_modelz,newdata=getData(eval_sets,"known"),n=10,type="ratings")

model_stats$UBCF_pearson_modelz= cbind( calcPredictionAccuracy(x=pred_UBCF_pear,data=getData(eval_sets,"unknown")))

pred_UBCF_pear_top = predict(object =UBCF_pearson_modelz,newdata=getData(eval_sets,"known"),n=10,type="topNList")

pred_UBCF_pear_top@itemLabels[pred_UBCF_pear_top@items[[1]]]

##  [1] "Silence of the Lambs, The (1991)"                                           
##  [2] "Schindler's List (1993)"                                                    
##  [3] "Godfather, The (1972)"                                                      
##  [4] "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)"
##  [5] "L.A. Confidential (1997)"                                                   
##  [6] "Fargo (1996)"                                                               
##  [7] "Good Will Hunting (1997)"                                                   
##  [8] "Usual Suspects, The (1995)"                                                 
##  [9] "Raising Arizona (1987)"                                                     
## [10] "Close Shave, A (1995)"

rm(UBCF_pearson_modelz,pred_UBCF_pear,pred_UBCF_pear_top,eval_sets)

Conslusions

After evaluating all four models, the user based collaborative filtering models performed better than IBCF using RMSE. In my next build, I would like to search techniques that allow me to evaluate multiple models using ROC/AUC, and implementations of hybrid methods for collaborative filtering.

knitr::kable(model_stats,caption="Model Statistics")

Model Statistics
	IBCF_cosine_model	IBCF_pearson_model	UBCF_cosine_model	UBCF_pearson_modelz
RMSE	1.0887593	1.2041807	0.9490189	0.9759794
MSE	1.1853968	1.4500512	0.9006369	0.9525357
MAE	0.7971769	0.8791676	0.7488709	0.7756216

rm(model_stats)

Sources

R Bloggers

Stack Overflow

Predicting Ratings

Project 2 Data643

Christopher Estevez

June 19, 2018