The goal of this assignment is give you practice working with accuracy and other recommender system metrics.
The dataset was retrieved from kaggle website. The data includes user ratings for restaurants from 0-2 with 2 as the highest rating. Ratings include ratings , food ratings and service ratings. We table the initial ratings , then table the average of all ratings. We use the average of all ratings in our analysis.
Restaurant Review Dataset
https://www.kaggle.com/uciml/restaurant-data-with-consumer-ratings
## Observations: 1,161
## Variables: 5
## $ userID <fct> U1077, U1077, U1077, U1077, U1068, U1068, U1068, U10...
## $ placeID <int> 135085, 135038, 132825, 135060, 135104, 132740, 1326...
## $ rating <int> 2, 2, 2, 1, 1, 0, 1, 0, 1, 2, 1, 1, 1, 0, 1, 1, 0, 1...
## $ food_rating <int> 2, 2, 2, 2, 1, 0, 1, 0, 1, 2, 1, 2, 0, 0, 0, 2, 0, 2...
## $ service_rating <int> 2, 1, 2, 2, 2, 0, 1, 0, 1, 2, 1, 2, 1, 0, 0, 1, 2, 0...
## vars n mean sd median trimmed mad min max
## userID* 1 1161 70.90 40.60 73 71.25 53.37 1 138
## placeID 2 1161 134192.04 1100.92 135030 134275.03 68.20 132560 135109
## rating 3 1161 1.20 0.77 1 1.25 1.48 0 2
## food_rating 4 1161 1.22 0.79 1 1.27 1.48 0 2
## service_rating 5 1161 1.09 0.79 1 1.11 1.48 0 2
## range skew kurtosis se
## userID* 137 -0.08 -1.22 1.19
## placeID 2549 -0.51 -1.73 32.31
## rating 2 -0.36 -1.25 0.02
## food_rating 2 -0.40 -1.30 0.02
## service_rating 2 -0.16 -1.39 0.02
Initial Ratings Table:
##
## 0 1 2
## 254 421 486
Average Ratings TABLE:
##
## 0 1 2
## 236 490 435
In this section we are creating a realRatingMatrix to create a matrix with UserID in the rows, PlaceID in the columns and average ratings as the data.
set.seed(123)
data.mat <- matrix(data=urate$trating,ncol=length(unique(urate$placeID)),nrow=length(unique(urate$userID)))
rownames(data.mat)<-c(paste(unique(urate$userIDD)))
colnames(data.mat)<-c(paste(unique(urate$placeID)))
glimpse(data.mat)
## num [1:138, 1:130] 2 2 2 2 1 0 1 0 1 2 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:130] "135085" "135038" "132825" "135060" ...
rdata.mat<-as(data.mat, "realRatingMatrix")
head(rdata.mat)
## 1 x 130 rating matrix of class 'realRatingMatrix' with 130 ratings.
In this review we plot a histogram with ratings distribution with a user to items image. The similarity matrix displays the similarity of the first 7 users.
##
## 0 1 2
## 3643 7559 6738
## 1 2 3 4 5 6
## 2 0.8524753
## 3 0.8246838 0.8332040
## 4 0.8014966 0.8106867 0.8253189
## 5 0.8083101 0.7837322 0.8184162 0.8128217
## 6 0.7427496 0.7594570 0.7638613 0.7550580 0.7988795
## 7 0.7950447 0.7793995 0.7525363 0.7650547 0.7688747 0.7405923
For the partition we are setting the training set to 80% , the number of items to keep to 15 , goodrating is to 2 and number of times to run evaluation to 1.
percentage_training<-.8
items_to_keep<- 15
rating_thresold<- 2
n_eval<-1
Our first evaulation technique compares RMSE, MSE and MAE metrics:
Here we are using Cross Validation with K-fold to split the data into chunks. We set K-fold to 4 which results in 4 chunks with 102 rows.
## [1] 102 102 102 102
In this technique the prefered model would have lower values for RMSE,MSE and MAE metrics, since the goal is to minimize errors . The values listed below are the first 6 values by user and one line for all users. The plots show the distribution of ratings and RMSE by user. It appears the bulk of the users are at the lower end of RMSE when using the IBCF model.
## RMSE MSE MAE
## [1,] 0.8009738 0.6415590 0.6732201
## [2,] 0.7889011 0.6223649 0.6346206
## [3,] 0.9071276 0.8228804 0.7344508
## [4,] 0.8662004 0.7503031 0.6972052
## [5,] 0.8301767 0.6891933 0.6685685
## [6,] 0.8930307 0.7975038 0.7163296
## RMSE MSE MAE
## 0.8744908 0.7647342 0.7061930
Anonther evaluation technique is to use the ROC and Precision-recall to compare the accuracy of negatives and positives. In this dataset we are using a rating of 2 as a positive result. ROC compares True positive rate (TPR) to False positive rate (FPR). The precision recall in this dataset would compare the recommended or expected ratings to the actual ratings.
The ROC for this IBCF model has no curve and Precision-recall show that as prediction increases so does the actual ratings. This model appears to overfit.
In this section we are evaluating the following models:
The Precision-recall plot shows that all of the model recall decreases as percision increases. The ROC curve shows the UBCF_COR model has the highest area under the curve (AUC) of all of the models.
Using UBCF COR, the list of vectors for the k parameter does not change the ROC or Percision-Recall plots.
With this ananlysis, we can select the UBCF COR model and with a default k value to generate the most acurate ratings prediction for this dataset.
https://blog.exploratory.io/working-with-json-data-in-very-simple-way-
https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
Building a Recommendation System with R, Suresh K Gorakala and Michele Usuelli, Packt Publishing, 2015
Code used in analysis
knitr::opts_chunk$set(
echo = FALSE,
message = FALSE,
warning = FALSE
)
#knitr::opts_chunk$set(echo = TRUE)
require(knitr)
library(ggplot2)
library(tidyr)
library(MASS)
library(psych)
library(kableExtra)
library(dplyr)
library(faraway)
library(gridExtra)
library(reshape2)
library(leaps)
library(pROC)
library(caret)
library(naniar)
library(pander)
library(pROC)
library(mlbench)
library(e1071)
library(fpp2)
library(mlr)
library(recommenderlab)
library(jsonlite)
library(stringr)
rm(list=ls())#removes all variables previously stored
urate<- read.csv("https://raw.githubusercontent.com/apag101/Data612/master/Projects/Project4/rating_final.csv", header=TRUE)
glimpse(urate)
describe(urate)
table(as.vector(urate$rating))
urate$trating<-round((urate$rating + urate$food_rating + urate$service_rating)/3,0)
table(as.vector(urate$trating))
set.seed(123)
data.mat <- matrix(data=urate$trating,ncol=length(unique(urate$placeID)),nrow=length(unique(urate$userID)))
rownames(data.mat)<-c(paste(unique(urate$userIDD)))
colnames(data.mat)<-c(paste(unique(urate$placeID)))
glimpse(data.mat)
rdata.mat<-as(data.mat, "realRatingMatrix")
head(rdata.mat)
table(as.vector(rdata.mat@data))
vector_rates<-as.vector(rdata.mat@data)
vector_rates<-factor(vector_rates)
qplot(vector_rates)+ggtitle("Distribution of ratings")
image(rdata.mat[1:138, 1:130])
similarity(rdata.mat[1:7,],method="cosine", which="userID")
percentage_training<-.8
items_to_keep<- 15
rating_thresold<- 2
n_eval<-1
n_fold<-4
eval_sets3<-evaluationScheme(data=rdata.mat, method="cross-validation", k=n_fold, given= items_to_keep, goodRating = rating_thresold)
size_sets<-sapply(eval_sets3@runsTrain, length)
size_sets
model_to_evaluate<- "IBCF"
model_parameters<-NULL
eval_recommender <- Recommender(data=getData(eval_sets3, "train"), method=model_to_evaluate, parameter=model_parameters)
items_to_recommend<-10
eval_prediction<-predict(object= eval_recommender, newdata=getData(eval_sets3, "known"), n=items_to_recommend, type="ratings")
qplot(rowCounts(eval_prediction)) + geom_histogram(binwidth = 5) + ggtitle("Distribution of ratings per user")
eval_accuracy<- calcPredictionAccuracy(x = eval_prediction, data=getData(eval_sets3, "unknown"), byUser=TRUE)
head(eval_accuracy)
qplot(eval_accuracy[,"RMSE"]) + geom_histogram(binwidth = .01) + ggtitle("Distribution of the RMSE by user")
eval_accuracy2<- calcPredictionAccuracy(x = eval_prediction, data=getData(eval_sets3, "unknown"), byUser=FALSE)
eval_accuracy2
results<-evaluate(x= eval_sets3, method= model_to_evaluate, n=seq(10,100,10))
columns_to_sum<-c("TP","FP","FN","TN")
indices_summed<-Reduce("+",getConfusionMatrix(results))[,columns_to_sum]
plot(results, annotate= TRUE, main= "ROC Curve")
plot(results, "prec/rec", annotate=TRUE, main="Precision-recall")
model_to_evaluate<- list(
IBCF_cos = list(name="IBCF", param=list(method="cosine")),
IBCF_cor = list(name="IBCF", param=list(method="pearson")),
UBCF_cos = list(name="UBCF", param=list(method="cosine")),
UBCF_cor = list(name="UBCF", param=list(method="pearson")),
random = list(name="RANDOM", param=NULL))
n_recommendations<-c(1,5,seq(10,100,10))
list_results<-evaluate(x=eval_sets3, method=model_to_evaluate, n=n_recommendations)
plot(list_results, annotate=1, legend="topleft")
title("ROC curve")
plot(list_results,"prec/rec", annotate=1, legend="bottomright")
title("Precision-recall")
vector_k <- c(5,10,20,30,40)
model_to_evaluate <- lapply(vector_k, function(k){
list(name="UBCF", param=list(method="pearson"),k=k)
})
names(model_to_evaluate)<-paste0("UBCF_k_", vector_k)
list_results<-evaluate(x=eval_sets3, method=model_to_evaluate, n=n_recommendations)
plot(list_results, annotate=1, legend="topleft")
title("ROC curve")
plot(list_results,"prec/rec", annotate=1, legend="bottomright")
title("Precision-recall")