Data 612 - Project 4

The goal of this assignment is give you practice working with accuracy and other recommender system metrics. In this assignment you’re asked to do at least one or (if you like) both of the following:

. Work in a small group, and/or . Choose a different dataset to work with from your previous projects.

Data Transformation: Raw Data

## [1] "Raw.Df:   c(24983, 101)"

Data Filtering

  • The dataset is very large (25,983, 101). I would like to remove some data to make it more managable
  • To make the dataset more managable, subset by the number of jokes rated by a user, column 0. Randomly selected 80 jokes rated.
  • Remove all the ALL null columns (unrated joke)
  • Remove the columns where the joke was rated by all users
  • The final subset dimensions are more managable with 145 rows (reviewers) & 25 jokes
## [1] "Subset.1 (Num Jokes Filter):   c(145, 101)"
## [1] "Subset.2(Null Eval):   c(145, 25)"

EDA

  • Evaluate how many users ranked each joke (display with a histogram)
  • Evaluate the average ranking of each joke (display with histogram)
  • I visualize the whole matrix of ratings by building a heatmap image() whose colors represent the ratings. Y-axis = selected users (50), X-axis = Jokes (25)
  • Create a matrix from the subset2 dataset for later use
  • Create a realRatingMatrix

Joke Histograms

  • Count of rating per joke (Num. Joke Ranked)
  • Mean joke rating (Average Joke Ranking - Joke)
  • Mean Joke rating per User (Average joke Ranking - User)

Joke Heat Map

  • rating vs users

Modeling

Modeling: Train/Test Dataset

  • Split the data into train, test datasets
  • Develop a dynamic functions for Recommender(), Predict() & calcPredictionAccuracy() commands
  • Pass attributes for IBCF COS, IBCF PEARSON, UBCF COS & UBCF PEARSON into the dynamic functions
  • Evaluate performance in the model statistics fucntions
## Evaluation scheme using all-but-1 items
## Method: 'cross-validation' with 10 run(s).
## Good ratings: >=1.164286
## Data set: 145 x 25 rating matrix of class 'realRatingMatrix' with 725 ratings.
## 126 x 25 rating matrix of class 'realRatingMatrix' with 630 ratings.
## 19 x 25 rating matrix of class 'realRatingMatrix' with 76 ratings.
## 19 x 25 rating matrix of class 'realRatingMatrix' with 19 ratings.

Modeling: IBCF Model - Cosine

## [1] "IBCF.COS PRED (Joke IDS): c(\"V74\", \"V80\", \"V81\", \"V83\", \"V85\", \"V86\", \"V88\", \"V91\", \"V93\", \"V95\")"

Modeling: IBCF Model - Pearson

## [1] "IBCF.PEARSON.PRED (Joke IDS): c(\"V75\", \"V85\", \"V77\", \"V72\", \"V80\", \"V82\", \"V83\", \"V86\", \"V87\", \"V89\")"

Modeling: UBCF Model - Cosine

## [1] "UBCF.COS PRED (Joke IDS): c(\"V95\", \"V82\", \"V73\", \"V77\", \"V83\", \"V90\", \"V86\", \"V72\", \"V74\", \"V92\")"

Modeling: UBCF Model - Pearson

## [1] "UBCF.PEARSON.PRED (Joke IDS): c(\"V85\", \"V83\", \"V88\", \"V87\", \"V82\", \"V81\", \"V77\", \"V86\", \"V75\", \"V74\")"

Modeling: Model Statistics

IBCF.COS IBCF.PEARSON UBCF.COS UBCF.PEARSON
RMSE 6.223929 4.701376 4.409739 4.988854
MSE 38.737289 22.102935 19.445803 24.888662
MAE 5.663571 3.445970 3.342035 4.030587

Modeling: Findings

  • The user-based collaborative filtering models performed better than their item-based counterparts.

Comparing Modeling: Additional Models

  • Building on the above, eight models of different attributes will be evaluated
  • IBCF & UBCF (cosine or pearson or Jaccard)
  • Normalized by Z-score
## IBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.01sec] 
##   2  [0sec/0.02sec] 
##   3  [0.01sec/0sec] 
##   4  [0sec/0.01sec] 
##   5  [0sec/0.02sec] 
##   6  [0sec/0.01sec] 
##   7  [0sec/0.02sec] 
##   8  [0sec/0.02sec] 
##   9  [0.01sec/0.02sec] 
##   10  [0sec/0.02sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0sec/0sec] 
##   2  [0sec/0sec] 
##   3  [0sec/0sec] 
##   4  [0sec/0sec] 
##   5  [0.02sec/0sec] 
##   6  [0sec/0.02sec] 
##   7  [0sec/0.01sec] 
##   8  [0sec/0.02sec] 
##   9  [0sec/0.02sec] 
##   10  [0sec/0.01sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.01sec] 
##   2  [0sec/0.02sec] 
##   3  [0sec/0.02sec] 
##   4  [0sec/0sec] 
##   5  [0.01sec/0sec] 
##   6  [0.02sec/0sec] 
##   7  [0sec/0.01sec] 
##   8  [0sec/0.02sec] 
##   9  [0sec/0.02sec] 
##   10  [0sec/0.01sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.01sec/0sec] 
##   2  [0sec/0.03sec] 
##   3  [0.01sec/0.02sec] 
##   4  [0.01sec/0.03sec] 
##   5  [0sec/0.02sec] 
##   6  [0.02sec/0.01sec] 
##   7  [0.02sec/0.01sec] 
##   8  [0.01sec/0sec] 
##   9  [0.02sec/0.01sec] 
##   10  [0.01sec/0sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.03sec] 
##   2  [0sec/0.01sec] 
##   3  [0sec/0.03sec] 
##   4  [0sec/0sec] 
##   5  [0sec/0.01sec] 
##   6  [0sec/0.01sec] 
##   7  [0sec/0.01sec] 
##   8  [0sec/0.01sec] 
##   9  [0sec/0.01sec] 
##   10  [0sec/0.02sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.02sec] 
##   2  [0sec/0.01sec] 
##   3  [0sec/0.01sec] 
##   4  [0sec/0.03sec] 
##   5  [0sec/0.03sec] 
##   6  [0sec/0.01sec] 
##   7  [0sec/0.02sec] 
##   8  [0sec/0.01sec] 
##   9  [0sec/0.01sec] 
##   10  [0sec/0.01sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.02sec] 
##   2  [0sec/0.01sec] 
##   3  [0sec/0.01sec] 
##   4  [0sec/0.02sec] 
##   5  [0sec/0.01sec] 
##   6  [0sec/0.01sec] 
##   7  [0sec/0.03sec] 
##   8  [0sec/0.01sec] 
##   9  [0sec/0.01sec] 
##   10  [0sec/0.01sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.02sec/0.01sec] 
##   2  [0.01sec/0.02sec] 
##   3  [0sec/0.02sec] 
##   4  [0sec/0.02sec] 
##   5  [0sec/0.01sec] 
##   6  [0.02sec/0.01sec] 
##   7  [0.01sec/0.02sec] 
##   8  [0sec/0.01sec] 
##   9  [0.02sec/0.02sec] 
##   10  [0.02sec/0.01sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/0.02sec] 
##   2  [0sec/0.02sec] 
##   3  [0sec/0sec] 
##   4  [0sec/0sec] 
##   5  [0sec/0sec] 
##   6  [0sec/0.02sec] 
##   7  [0sec/0sec] 
##   8  [0.02sec/0sec] 
##   9  [0sec/0.01sec] 
##   10  [0sec/0sec]
## IBCF run fold/sample [model time/prediction time]
##   1  [0sec/0sec] 
##   2  [0sec/0sec] 
##   3  [0sec/0sec] 
##   4  [0sec/0sec] 
##   5  [0.01sec/0sec] 
##   6  [0.02sec/0sec] 
##   7  [0.01sec/0sec] 
##   8  [0.02sec/0sec] 
##   9  [0.02sec/0sec] 
##   10  [0sec/0sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0sec/0sec] 
##   2  [0sec/0sec] 
##   3  [0sec/0sec] 
##   4  [0.01sec/0sec] 
##   5  [0.02sec/0sec] 
##   6  [0.02sec/0sec] 
##   7  [0.01sec/0sec] 
##   8  [0.02sec/0sec] 
##   9  [0sec/0.01sec] 
##   10  [0sec/0.02sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.01sec] 
##   2  [0sec/0.02sec] 
##   3  [0sec/0.01sec] 
##   4  [0sec/0sec] 
##   5  [0sec/0sec] 
##   6  [0sec/0sec] 
##   7  [0sec/0sec] 
##   8  [0.01sec/0sec] 
##   9  [0.02sec/0sec] 
##   10  [0.01sec/0sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.02sec/0.01sec] 
##   2  [0.02sec/0.01sec] 
##   3  [0.03sec/0.02sec] 
##   4  [0.03sec/0.04sec] 
##   5  [0.01sec/0.03sec] 
##   6  [0.03sec/0.01sec] 
##   7  [0.02sec/0.02sec] 
##   8  [0.03sec/0.01sec] 
##   9  [0.02sec/0.01sec] 
##   10  [0.02sec/0sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.02sec] 
##   2  [0sec/0.02sec] 
##   3  [0.01sec/0sec] 
##   4  [0sec/0.01sec] 
##   5  [0sec/0.02sec] 
##   6  [0sec/0.01sec] 
##   7  [0sec/0.04sec] 
##   8  [0sec/0.01sec] 
##   9  [0sec/0.01sec] 
##   10  [0sec/0.01sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.02sec] 
##   2  [0sec/0.01sec] 
##   3  [0sec/0.04sec] 
##   4  [0.01sec/0sec] 
##   5  [0sec/0sec] 
##   6  [0sec/0.02sec] 
##   7  [0sec/0.03sec] 
##   8  [0sec/0.02sec] 
##   9  [0sec/0.02sec] 
##   10  [0sec/0.01sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.02sec] 
##   2  [0sec/0.02sec] 
##   3  [0sec/0.01sec] 
##   4  [0sec/0.02sec] 
##   5  [0sec/0.02sec] 
##   6  [0.01sec/0sec] 
##   7  [0sec/0.01sec] 
##   8  [0sec/0.01sec] 
##   9  [0sec/0.02sec] 
##   10  [0sec/0.02sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.01sec/0.02sec] 
##   2  [0.01sec/0.03sec] 
##   3  [0sec/0.02sec] 
##   4  [0.01sec/0.02sec] 
##   5  [0.02sec/0.01sec] 
##   6  [0sec/0.02sec] 
##   7  [0sec/0.01sec] 
##   8  [0sec/0.01sec] 
##   9  [0.02sec/0.02sec] 
##   10  [0.01sec/0.02sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/0.01sec] 
##   2  [0sec/0sec] 
##   3  [0sec/0sec] 
##   4  [0sec/0.01sec] 
##   5  [0sec/0sec] 
##   6  [0sec/0sec] 
##   7  [0sec/0sec] 
##   8  [0sec/0sec] 
##   9  [0sec/0.02sec] 
##   10  [0sec/0sec]

Comparing Modeling: Additional Models ROC

Comparing Modeling: Additional Models Precision/Recall

Comparing Modeling: RMSE

RMSE MSE MAE
IBCF.COS 6.305947 41.02033 5.272161
IBCF.PEAR 5.601020 33.05513 4.196643
IBCF.JACC 5.658788 32.56521 4.378391
IBCF.COS.ZSCORE 6.018913 37.80703 5.076153
UBCF.COS 5.098177 26.50560 3.839372
UBCF.PEAR 5.244337 27.93409 3.988254
UBCF.JACC 5.082091 26.37037 3.938151
UBCF.COS.ZSCORE 5.132665 26.83506 3.864089
random 6.154002 38.28992 4.749700

Comparing Modeling: Summary

The user-based models performed better their item-based counterparts. The best model was the user-based Jaccard model. The user-based models also had the most “normal” distributions. However, based on the shape of the ROC curves for all the models, there is some room for improvement. The Jaccard model had the lowest RMSE at 5.08.

Increased serendipity, the finding something good or useful while not specifically searching for it, can possibly be achieved by appending new jokes into the suggestion list.

The models for the jester analysis assessed offline accuracy utilizing the dataset of jokes that already has rating. The online implementation of this analysis could work as a recommender system predicting/recommending jokes to the users to rate and the user rating it similarly. Should the jester dataset implement an online system, A/B testing could be used to determine the more effective system.