library("recommenderlab")
library("ggplot2")
I will begin by loading the movies lense data-set from the recommender lab into the environment. I will also subset the data for movies have been rated by more than 50 user and films rated more than 100 times.
data("MovieLense")
ratings_movies = MovieLense[rowCounts(MovieLense)>50,colCounts(MovieLense)>100]
The movie rating distribution is right-skewed with the median of
quantile(colMeans(ratings_movies, na.rm = T), .5)
## 50%
## 3.684294
movie_mean_rating = colMeans(ratings_movies, na.rm = T)
qplot(movie_mean_rating) + ggtitle("Distribution of movie Ratings")
rm(movie_mean_rating)
The user rating distribution seems to somewhat normal with the median of
quantile(rowMeans(ratings_movies, na.rm = T), .5)
## 50%
## 3.740624
user_mean_rating = rowMeans(ratings_movies, na.rm = T)
qplot(user_mean_rating) + ggtitle("Distribution of User Ratings")
rm(user_mean_rating)
A heat map of the first 100 users and movies reveals sparsity, but also some very active users.
image(ratings_movies[1:100,1:100], main="Heatmap First 100 Users & movies")
To create a train and test set an evaluation scheme is needed. The function takes the rating data. The data will be split 80% train and 20% test.
set.seed(143)
eval_sets = evaluationScheme(data =ratings_movies,method ="cross-validation",train=.8,given=15,goodRating=3,k=10)
rm(MovieLense,MovieLenseMeta,ratings_movies)
The first model is an Item Base collaborative filtering model. The data will be normalized using centering. The distance measurement used is cosine. The models will be evaluated using the unknown data set and stored in a data frame for comparison to other models. Finally, the top predictions for the model will be printed.
IBCF_cosine_model = Recommender(data=getData(eval_sets,"train"),method ="IBCF",parameter=list( normalize = "center", method="cosine"))
pred_IBCF_cos = predict(object =IBCF_cosine_model,newdata=getData(eval_sets,"known"),n=10,type="ratings")
model_stats =data.frame(IBCF_cosine_model=calcPredictionAccuracy(x=pred_IBCF_cos,data=getData(eval_sets,"unknown")))
pred_IBCF_cos_top = predict(object =IBCF_cosine_model,newdata=getData(eval_sets,"known"),n=10,type="topNList")
pred_IBCF_cos_top@itemLabels[pred_IBCF_cos_top@items[[1]]]
## [1] "Babe (1995)" "Usual Suspects, The (1995)"
## [3] "Taxi Driver (1976)" "Apollo 13 (1995)"
## [5] "Hoop Dreams (1994)" "Natural Born Killers (1994)"
## [7] "Quiz Show (1994)" "Shawshank Redemption, The (1994)"
## [9] "Forrest Gump (1994)" "Mask, The (1994)"
rm(IBCF_cosine_model,pred_IBCF_cos,pred_IBCF_cos_top)
IBCF model 2 uses Pearson distance measurement otherwise it is similar to IBCF model one.
IBCF_pearson_model = Recommender(data=getData(eval_sets,"train"),method ="IBCF",parameter=list( normalize = "center", method="pearson"))
pred_IBCF_pear = predict(object =IBCF_pearson_model,newdata=getData(eval_sets,"known"),n=10,type="ratings")
model_stats$IBCF_pearson_model= cbind( calcPredictionAccuracy(x=pred_IBCF_pear,data=getData(eval_sets,"unknown")))
pred_IBCF_pear_top = predict(object =IBCF_pearson_model,newdata=getData(eval_sets,"known"),n=10,type="topNList")
pred_IBCF_pear_top@itemLabels[pred_IBCF_pear_top@items[[1]]]
## [1] "Toy Story (1995)" "Get Shorty (1995)"
## [3] "Usual Suspects, The (1995)" "Braveheart (1995)"
## [5] "Crimson Tide (1995)" "Ed Wood (1994)"
## [7] "Professional, The (1994)" "Sleepless in Seattle (1993)"
## [9] "True Romance (1993)" "Cold Comfort Farm (1995)"
rm(IBCF_pearson_model,pred_IBCF_pear,pred_IBCF_pear_top)
The user base collaborative filtering model will use the same methodology implemented earlier. The model will be normalization using centering and the distance measurement used is cosine.
UBCF_cosine_model = Recommender(data=getData(eval_sets,"train"),method ="UBCF",parameter=list( normalize = "center", method="cosine"))
pred_UBCF_cos = predict(object =UBCF_cosine_model,newdata=getData(eval_sets,"known"),n=10,type="ratings")
model_stats$UBCF_cosine_model= cbind( calcPredictionAccuracy(x=pred_UBCF_cos,data=getData(eval_sets,"unknown")))
pred_UBCF_cos_top = predict(object =UBCF_cosine_model,newdata=getData(eval_sets,"known"),n=10,type="topNList")
pred_UBCF_cos_top@itemLabels[pred_UBCF_cos_top@items[[1]]]
## [1] "Silence of the Lambs, The (1991)" "Empire Strikes Back, The (1980)"
## [3] "Godfather, The (1972)" "Princess Bride, The (1987)"
## [5] "Shawshank Redemption, The (1994)" "Schindler's List (1993)"
## [7] "Casablanca (1942)" "Return of the Jedi (1983)"
## [9] "Alien (1979)" "Blade Runner (1982)"
rm(UBCF_cosine_model,pred_UBCF_cos,pred_UBCF_cos_top)
UBCF model 2 will use Z-score normalization technique and a distance measurement of Pearson.
UBCF_pearson_modelz = Recommender(data=getData(eval_sets,"train"),method ="UBCF",parameter=list( normalize = "Z-score", method="pearson"))
pred_UBCF_pear = predict(object =UBCF_pearson_modelz,newdata=getData(eval_sets,"known"),n=10,type="ratings")
model_stats$UBCF_pearson_modelz= cbind( calcPredictionAccuracy(x=pred_UBCF_pear,data=getData(eval_sets,"unknown")))
pred_UBCF_pear_top = predict(object =UBCF_pearson_modelz,newdata=getData(eval_sets,"known"),n=10,type="topNList")
pred_UBCF_pear_top@itemLabels[pred_UBCF_pear_top@items[[1]]]
## [1] "Silence of the Lambs, The (1991)"
## [2] "Schindler's List (1993)"
## [3] "Godfather, The (1972)"
## [4] "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)"
## [5] "L.A. Confidential (1997)"
## [6] "Fargo (1996)"
## [7] "Good Will Hunting (1997)"
## [8] "Usual Suspects, The (1995)"
## [9] "Raising Arizona (1987)"
## [10] "Close Shave, A (1995)"
rm(UBCF_pearson_modelz,pred_UBCF_pear,pred_UBCF_pear_top,eval_sets)
After evaluating all four models, the user based collaborative filtering models performed better than IBCF using RMSE. In my next build, I would like to search techniques that allow me to evaluate multiple models using ROC/AUC, and implementations of hybrid methods for collaborative filtering.
knitr::kable(model_stats,caption="Model Statistics")
IBCF_cosine_model | IBCF_pearson_model | UBCF_cosine_model | UBCF_pearson_modelz | |
---|---|---|---|---|
RMSE | 1.0887593 | 1.2041807 | 0.9490189 | 0.9759794 |
MSE | 1.1853968 | 1.4500512 | 0.9006369 | 0.9525357 |
MAE | 0.7971769 | 0.8791676 | 0.7488709 | 0.7756216 |
rm(model_stats)