The goal of this assignment is give you practice working with accuracy and other recommender system metrics.

Get Data

The dataset was retrieved from kaggle website. The data includes user ratings for restaurants from 0-2 with 2 as the highest rating. Ratings include ratings , food ratings and service ratings. We table the initial ratings , then table the average of all ratings. We use the average of all ratings in our analysis.

Restaurant Review Dataset

https://www.kaggle.com/uciml/restaurant-data-with-consumer-ratings

## Observations: 1,161
## Variables: 5
## $ userID         <fct> U1077, U1077, U1077, U1077, U1068, U1068, U1068, U10...
## $ placeID        <int> 135085, 135038, 132825, 135060, 135104, 132740, 1326...
## $ rating         <int> 2, 2, 2, 1, 1, 0, 1, 0, 1, 2, 1, 1, 1, 0, 1, 1, 0, 1...
## $ food_rating    <int> 2, 2, 2, 2, 1, 0, 1, 0, 1, 2, 1, 2, 0, 0, 0, 2, 0, 2...
## $ service_rating <int> 2, 1, 2, 2, 2, 0, 1, 0, 1, 2, 1, 2, 1, 0, 0, 1, 2, 0...
##                vars    n      mean      sd median   trimmed   mad    min    max
## userID*           1 1161     70.90   40.60     73     71.25 53.37      1    138
## placeID           2 1161 134192.04 1100.92 135030 134275.03 68.20 132560 135109
## rating            3 1161      1.20    0.77      1      1.25  1.48      0      2
## food_rating       4 1161      1.22    0.79      1      1.27  1.48      0      2
## service_rating    5 1161      1.09    0.79      1      1.11  1.48      0      2
##                range  skew kurtosis    se
## userID*          137 -0.08    -1.22  1.19
## placeID         2549 -0.51    -1.73 32.31
## rating             2 -0.36    -1.25  0.02
## food_rating        2 -0.40    -1.30  0.02
## service_rating     2 -0.16    -1.39  0.02

Initial Ratings Table:

## 
##   0   1   2 
## 254 421 486

Average Ratings TABLE:

## 
##   0   1   2 
## 236 490 435

Create Matrix

In this section we are creating a realRatingMatrix to create a matrix with UserID in the rows, PlaceID in the columns and average ratings as the data.

set.seed(123)
data.mat <- matrix(data=urate$trating,ncol=length(unique(urate$placeID)),nrow=length(unique(urate$userID)))
rownames(data.mat)<-c(paste(unique(urate$userIDD)))
colnames(data.mat)<-c(paste(unique(urate$placeID)))
glimpse(data.mat)
##  num [1:138, 1:130] 2 2 2 2 1 0 1 0 1 2 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:130] "135085" "135038" "132825" "135060" ...
rdata.mat<-as(data.mat, "realRatingMatrix")
head(rdata.mat)
## 1 x 130 rating matrix of class 'realRatingMatrix' with 130 ratings.

Review

In this review we plot a histogram with ratings distribution with a user to items image. The similarity matrix displays the similarity of the first 7 users.

## 
##    0    1    2 
## 3643 7559 6738

##           1         2         3         4         5         6
## 2 0.8524753                                                  
## 3 0.8246838 0.8332040                                        
## 4 0.8014966 0.8106867 0.8253189                              
## 5 0.8083101 0.7837322 0.8184162 0.8128217                    
## 6 0.7427496 0.7594570 0.7638613 0.7550580 0.7988795          
## 7 0.7950447 0.7793995 0.7525363 0.7650547 0.7688747 0.7405923

Partition Data

For the partition we are setting the training set to 80% , the number of items to keep to 15 , goodrating is to 2 and number of times to run evaluation to 1.

percentage_training<-.8
items_to_keep<- 15
rating_thresold<- 2
n_eval<-1

Evaluation Techniques

Our first evaulation technique compares RMSE, MSE and MAE metrics:

  • Root mean square error (RMSE): This is the standard deviation of the difference between the real and predicted ratings.
  • Mean squared error (MSE): This is the mean of the squared difference between the real and predicted ratings. It’s the square of RMSE, so it contains the same information.
  • Mean absolute error (MAE): This is the mean of the absolute difference between the real and predicted ratings

Cross Validation K-fold

Here we are using Cross Validation with K-fold to split the data into chunks. We set K-fold to 4 which results in 4 chunks with 102 rows.

## [1] 102 102 102 102

In this technique the prefered model would have lower values for RMSE,MSE and MAE metrics, since the goal is to minimize errors . The values listed below are the first 6 values by user and one line for all users. The plots show the distribution of ratings and RMSE by user. It appears the bulk of the users are at the lower end of RMSE when using the IBCF model.

##           RMSE       MSE       MAE
## [1,] 0.8009738 0.6415590 0.6732201
## [2,] 0.7889011 0.6223649 0.6346206
## [3,] 0.9071276 0.8228804 0.7344508
## [4,] 0.8662004 0.7503031 0.6972052
## [5,] 0.8301767 0.6891933 0.6685685
## [6,] 0.8930307 0.7975038 0.7163296

##      RMSE       MSE       MAE 
## 0.8744908 0.7647342 0.7061930

ROC and Precision-Recall

Anonther evaluation technique is to use the ROC and Precision-recall to compare the accuracy of negatives and positives. In this dataset we are using a rating of 2 as a positive result. ROC compares True positive rate (TPR) to False positive rate (FPR). The precision recall in this dataset would compare the recommended or expected ratings to the actual ratings.

The ROC for this IBCF model has no curve and Precision-recall show that as prediction increases so does the actual ratings. This model appears to overfit.

Evaluate Recommendations

In this section we are evaluating the following models:

  • IBCF Cosine
  • IBCF Pearson
  • UBCF Cosine
  • UBCF Person

Model Selection

The Precision-recall plot shows that all of the model recall decreases as percision increases. The ROC curve shows the UBCF_COR model has the highest area under the curve (AUC) of all of the models.

Parameter Selection

Using UBCF COR, the list of vectors for the k parameter does not change the ROC or Percision-Recall plots.

Conclusion

With this ananlysis, we can select the UBCF COR model and with a default k value to generate the most acurate ratings prediction for this dataset.

References

https://blog.exploratory.io/working-with-json-data-in-very-simple-way-

https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

Building a Recommendation System with R, Suresh K Gorakala and Michele Usuelli, Packt Publishing, 2015

APPENDIX

Code used in analysis

knitr::opts_chunk$set(
    echo = FALSE,
    message = FALSE,
    warning = FALSE
)
#knitr::opts_chunk$set(echo = TRUE)
require(knitr)
library(ggplot2)
library(tidyr)
library(MASS)
library(psych)
library(kableExtra)
library(dplyr)
library(faraway)
library(gridExtra)
library(reshape2)
library(leaps)
library(pROC)
library(caret)
library(naniar)
library(pander)
library(pROC)
library(mlbench)
library(e1071)
library(fpp2)
library(mlr)
library(recommenderlab)
library(jsonlite)
library(stringr)
rm(list=ls())#removes all variables previously stored

urate<- read.csv("https://raw.githubusercontent.com/apag101/Data612/master/Projects/Project4/rating_final.csv", header=TRUE)

glimpse(urate)
describe(urate)
table(as.vector(urate$rating))
urate$trating<-round((urate$rating + urate$food_rating + urate$service_rating)/3,0)
table(as.vector(urate$trating))

set.seed(123)
data.mat <- matrix(data=urate$trating,ncol=length(unique(urate$placeID)),nrow=length(unique(urate$userID)))
rownames(data.mat)<-c(paste(unique(urate$userIDD)))
colnames(data.mat)<-c(paste(unique(urate$placeID)))
glimpse(data.mat)
rdata.mat<-as(data.mat, "realRatingMatrix")
head(rdata.mat)
table(as.vector(rdata.mat@data))
vector_rates<-as.vector(rdata.mat@data)
vector_rates<-factor(vector_rates)
qplot(vector_rates)+ggtitle("Distribution of ratings")
image(rdata.mat[1:138, 1:130])
similarity(rdata.mat[1:7,],method="cosine", which="userID")
percentage_training<-.8
items_to_keep<- 15
rating_thresold<- 2
n_eval<-1

n_fold<-4
eval_sets3<-evaluationScheme(data=rdata.mat, method="cross-validation", k=n_fold, given= items_to_keep, goodRating = rating_thresold)
size_sets<-sapply(eval_sets3@runsTrain, length)
size_sets

model_to_evaluate<- "IBCF"
model_parameters<-NULL

eval_recommender <- Recommender(data=getData(eval_sets3, "train"), method=model_to_evaluate, parameter=model_parameters)
items_to_recommend<-10

eval_prediction<-predict(object= eval_recommender, newdata=getData(eval_sets3, "known"), n=items_to_recommend, type="ratings")

qplot(rowCounts(eval_prediction)) + geom_histogram(binwidth = 5) + ggtitle("Distribution of ratings per user")

eval_accuracy<- calcPredictionAccuracy(x = eval_prediction, data=getData(eval_sets3, "unknown"), byUser=TRUE)
head(eval_accuracy)

qplot(eval_accuracy[,"RMSE"]) + geom_histogram(binwidth = .01) + ggtitle("Distribution of the RMSE by user")

eval_accuracy2<- calcPredictionAccuracy(x = eval_prediction, data=getData(eval_sets3, "unknown"), byUser=FALSE)
eval_accuracy2
results<-evaluate(x= eval_sets3, method= model_to_evaluate, n=seq(10,100,10))
columns_to_sum<-c("TP","FP","FN","TN")
indices_summed<-Reduce("+",getConfusionMatrix(results))[,columns_to_sum]
plot(results, annotate= TRUE, main= "ROC Curve")
plot(results, "prec/rec", annotate=TRUE, main="Precision-recall")
model_to_evaluate<- list(
    IBCF_cos = list(name="IBCF", param=list(method="cosine")),
    IBCF_cor = list(name="IBCF", param=list(method="pearson")),
    UBCF_cos = list(name="UBCF", param=list(method="cosine")),
    UBCF_cor = list(name="UBCF", param=list(method="pearson")),
    random = list(name="RANDOM", param=NULL))
n_recommendations<-c(1,5,seq(10,100,10))

list_results<-evaluate(x=eval_sets3, method=model_to_evaluate, n=n_recommendations)

plot(list_results, annotate=1, legend="topleft") 
title("ROC curve")
plot(list_results,"prec/rec", annotate=1, legend="bottomright") 
title("Precision-recall")
vector_k <- c(5,10,20,30,40)
model_to_evaluate <- lapply(vector_k, function(k){
    list(name="UBCF", param=list(method="pearson"),k=k)
})
names(model_to_evaluate)<-paste0("UBCF_k_", vector_k)
list_results<-evaluate(x=eval_sets3, method=model_to_evaluate, n=n_recommendations)
plot(list_results, annotate=1, legend="topleft") 
title("ROC curve")
plot(list_results,"prec/rec", annotate=1, legend="bottomright") 
title("Precision-recall")