Project4_Data612

Data 612 - Project 4

The goal of this assignment is give you practice working with accuracy and other recommender system metrics. In this assignment you’re asked to do at least one or (if you like) both of the following:

. Work in a small group, and/or . Choose a different dataset to work with from your previous projects.

Data Transformation: Raw Data

## [1] "Raw.Df:   c(24983, 101)"

Data Filtering

The dataset is very large (25,983, 101). I would like to remove some data to make it more managable
To make the dataset more managable, subset by the number of jokes rated by a user, column 0. Randomly selected 80 jokes rated.
Remove all the ALL null columns (unrated joke)
Remove the columns where the joke was rated by all users
The final subset dimensions are more managable with 145 rows (reviewers) & 25 jokes

## [1] "Subset.1 (Num Jokes Filter):   c(145, 101)"

## [1] "Subset.2(Null Eval):   c(145, 25)"

EDA

Evaluate how many users ranked each joke (display with a histogram)
Evaluate the average ranking of each joke (display with histogram)
I visualize the whole matrix of ratings by building a heatmap image() whose colors represent the ratings. Y-axis = selected users (50), X-axis = Jokes (25)
Create a matrix from the subset2 dataset for later use
Create a realRatingMatrix

Joke Histograms

Count of rating per joke (Num. Joke Ranked)
Mean joke rating (Average Joke Ranking - Joke)
Mean Joke rating per User (Average joke Ranking - User)

Joke Heat Map

rating vs users

Modeling

Modeling: Train/Test Dataset

Split the data into train, test datasets
Develop a dynamic functions for Recommender(), Predict() & calcPredictionAccuracy() commands
Pass attributes for IBCF COS, IBCF PEARSON, UBCF COS & UBCF PEARSON into the dynamic functions
Evaluate performance in the model statistics fucntions

## Evaluation scheme using all-but-1 items
## Method: 'cross-validation' with 10 run(s).
## Good ratings: >=1.164286
## Data set: 145 x 25 rating matrix of class 'realRatingMatrix' with 725 ratings.

## 126 x 25 rating matrix of class 'realRatingMatrix' with 630 ratings.

## 19 x 25 rating matrix of class 'realRatingMatrix' with 76 ratings.

## 19 x 25 rating matrix of class 'realRatingMatrix' with 19 ratings.

Modeling: IBCF Model - Cosine

## [1] "IBCF.COS PRED (Joke IDS): c(\"V74\", \"V80\", \"V81\", \"V83\", \"V85\", \"V86\", \"V88\", \"V91\", \"V93\", \"V95\")"

Modeling: IBCF Model - Pearson

## [1] "IBCF.PEARSON.PRED (Joke IDS): c(\"V75\", \"V85\", \"V77\", \"V72\", \"V80\", \"V82\", \"V83\", \"V86\", \"V87\", \"V89\")"

Modeling: UBCF Model - Cosine

## [1] "UBCF.COS PRED (Joke IDS): c(\"V95\", \"V82\", \"V73\", \"V77\", \"V83\", \"V90\", \"V86\", \"V72\", \"V74\", \"V92\")"

Modeling: UBCF Model - Pearson

## [1] "UBCF.PEARSON.PRED (Joke IDS): c(\"V85\", \"V83\", \"V88\", \"V87\", \"V82\", \"V81\", \"V77\", \"V86\", \"V75\", \"V74\")"

Modeling: Model Statistics

	IBCF.COS	IBCF.PEARSON	UBCF.COS	UBCF.PEARSON
RMSE	6.223929	4.701376	4.409739	4.988854
MSE	38.737289	22.102935	19.445803	24.888662
MAE	5.663571	3.445970	3.342035	4.030587

Modeling: Findings

The user-based collaborative filtering models performed better than their item-based counterparts.

Comparing Modeling: Additional Models

Building on the above, eight models of different attributes will be evaluated
IBCF & UBCF (cosine or pearson or Jaccard)
Normalized by Z-score

## IBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.01sec] 
##   2  [0sec/0.02sec] 
##   3  [0.01sec/0sec] 
##   4  [0sec/0.01sec] 
##   5  [0sec/0.02sec] 
##   6  [0sec/0.01sec] 
##   7  [0sec/0.02sec] 
##   8  [0sec/0.02sec] 
##   9  [0.01sec/0.02sec] 
##   10  [0sec/0.02sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0sec/0sec] 
##   2  [0sec/0sec] 
##   3  [0sec/0sec] 
##   4  [0sec/0sec] 
##   5  [0.02sec/0sec] 
##   6  [0sec/0.02sec] 
##   7  [0sec/0.01sec] 
##   8  [0sec/0.02sec] 
##   9  [0sec/0.02sec] 
##   10  [0sec/0.01sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.01sec] 
##   2  [0sec/0.02sec] 
##   3  [0sec/0.02sec] 
##   4  [0sec/0sec] 
##   5  [0.01sec/0sec] 
##   6  [0.02sec/0sec] 
##   7  [0sec/0.01sec] 
##   8  [0sec/0.02sec] 
##   9  [0sec/0.02sec] 
##   10  [0sec/0.01sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.01sec/0sec] 
##   2  [0sec/0.03sec] 
##   3  [0.01sec/0.02sec] 
##   4  [0.01sec/0.03sec] 
##   5  [0sec/0.02sec] 
##   6  [0.02sec/0.01sec] 
##   7  [0.02sec/0.01sec] 
##   8  [0.01sec/0sec] 
##   9  [0.02sec/0.01sec] 
##   10  [0.01sec/0sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.03sec] 
##   2  [0sec/0.01sec] 
##   3  [0sec/0.03sec] 
##   4  [0sec/0sec] 
##   5  [0sec/0.01sec] 
##   6  [0sec/0.01sec] 
##   7  [0sec/0.01sec] 
##   8  [0sec/0.01sec] 
##   9  [0sec/0.01sec] 
##   10  [0sec/0.02sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.02sec] 
##   2  [0sec/0.01sec] 
##   3  [0sec/0.01sec] 
##   4  [0sec/0.03sec] 
##   5  [0sec/0.03sec] 
##   6  [0sec/0.01sec] 
##   7  [0sec/0.02sec] 
##   8  [0sec/0.01sec] 
##   9  [0sec/0.01sec] 
##   10  [0sec/0.01sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.02sec] 
##   2  [0sec/0.01sec] 
##   3  [0sec/0.01sec] 
##   4  [0sec/0.02sec] 
##   5  [0sec/0.01sec] 
##   6  [0sec/0.01sec] 
##   7  [0sec/0.03sec] 
##   8  [0sec/0.01sec] 
##   9  [0sec/0.01sec] 
##   10  [0sec/0.01sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.02sec/0.01sec] 
##   2  [0.01sec/0.02sec] 
##   3  [0sec/0.02sec] 
##   4  [0sec/0.02sec] 
##   5  [0sec/0.01sec] 
##   6  [0.02sec/0.01sec] 
##   7  [0.01sec/0.02sec] 
##   8  [0sec/0.01sec] 
##   9  [0.02sec/0.02sec] 
##   10  [0.02sec/0.01sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/0.02sec] 
##   2  [0sec/0.02sec] 
##   3  [0sec/0sec] 
##   4  [0sec/0sec] 
##   5  [0sec/0sec] 
##   6  [0sec/0.02sec] 
##   7  [0sec/0sec] 
##   8  [0.02sec/0sec] 
##   9  [0sec/0.01sec] 
##   10  [0sec/0sec]

## IBCF run fold/sample [model time/prediction time]
##   1  [0sec/0sec] 
##   2  [0sec/0sec] 
##   3  [0sec/0sec] 
##   4  [0sec/0sec] 
##   5  [0.01sec/0sec] 
##   6  [0.02sec/0sec] 
##   7  [0.01sec/0sec] 
##   8  [0.02sec/0sec] 
##   9  [0.02sec/0sec] 
##   10  [0sec/0sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0sec/0sec] 
##   2  [0sec/0sec] 
##   3  [0sec/0sec] 
##   4  [0.01sec/0sec] 
##   5  [0.02sec/0sec] 
##   6  [0.02sec/0sec] 
##   7  [0.01sec/0sec] 
##   8  [0.02sec/0sec] 
##   9  [0sec/0.01sec] 
##   10  [0sec/0.02sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.01sec] 
##   2  [0sec/0.02sec] 
##   3  [0sec/0.01sec] 
##   4  [0sec/0sec] 
##   5  [0sec/0sec] 
##   6  [0sec/0sec] 
##   7  [0sec/0sec] 
##   8  [0.01sec/0sec] 
##   9  [0.02sec/0sec] 
##   10  [0.01sec/0sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.02sec/0.01sec] 
##   2  [0.02sec/0.01sec] 
##   3  [0.03sec/0.02sec] 
##   4  [0.03sec/0.04sec] 
##   5  [0.01sec/0.03sec] 
##   6  [0.03sec/0.01sec] 
##   7  [0.02sec/0.02sec] 
##   8  [0.03sec/0.01sec] 
##   9  [0.02sec/0.01sec] 
##   10  [0.02sec/0sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.02sec] 
##   2  [0sec/0.02sec] 
##   3  [0.01sec/0sec] 
##   4  [0sec/0.01sec] 
##   5  [0sec/0.02sec] 
##   6  [0sec/0.01sec] 
##   7  [0sec/0.04sec] 
##   8  [0sec/0.01sec] 
##   9  [0sec/0.01sec] 
##   10  [0sec/0.01sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.02sec] 
##   2  [0sec/0.01sec] 
##   3  [0sec/0.04sec] 
##   4  [0.01sec/0sec] 
##   5  [0sec/0sec] 
##   6  [0sec/0.02sec] 
##   7  [0sec/0.03sec] 
##   8  [0sec/0.02sec] 
##   9  [0sec/0.02sec] 
##   10  [0sec/0.01sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.02sec] 
##   2  [0sec/0.02sec] 
##   3  [0sec/0.01sec] 
##   4  [0sec/0.02sec] 
##   5  [0sec/0.02sec] 
##   6  [0.01sec/0sec] 
##   7  [0sec/0.01sec] 
##   8  [0sec/0.01sec] 
##   9  [0sec/0.02sec] 
##   10  [0sec/0.02sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0.01sec/0.02sec] 
##   2  [0.01sec/0.03sec] 
##   3  [0sec/0.02sec] 
##   4  [0.01sec/0.02sec] 
##   5  [0.02sec/0.01sec] 
##   6  [0sec/0.02sec] 
##   7  [0sec/0.01sec] 
##   8  [0sec/0.01sec] 
##   9  [0.02sec/0.02sec] 
##   10  [0.01sec/0.02sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/0.01sec] 
##   2  [0sec/0sec] 
##   3  [0sec/0sec] 
##   4  [0sec/0.01sec] 
##   5  [0sec/0sec] 
##   6  [0sec/0sec] 
##   7  [0sec/0sec] 
##   8  [0sec/0sec] 
##   9  [0sec/0.02sec] 
##   10  [0sec/0sec]

Comparing Modeling: Additional Models ROC

Comparing Modeling: Additional Models Precision/Recall

Comparing Modeling: RMSE

	RMSE	MSE	MAE
IBCF.COS	6.305947	41.02033	5.272161
IBCF.PEAR	5.601020	33.05513	4.196643
IBCF.JACC	5.658788	32.56521	4.378391
IBCF.COS.ZSCORE	6.018913	37.80703	5.076153
UBCF.COS	5.098177	26.50560	3.839372
UBCF.PEAR	5.244337	27.93409	3.988254
UBCF.JACC	5.082091	26.37037	3.938151
UBCF.COS.ZSCORE	5.132665	26.83506	3.864089
random	6.154002	38.28992	4.749700

Comparing Modeling: Summary

The user-based models performed better their item-based counterparts. The best model was the user-based Jaccard model. The user-based models also had the most “normal” distributions. However, based on the shape of the ROC curves for all the models, there is some room for improvement. The Jaccard model had the lowest RMSE at 5.08.

Increased serendipity, the finding something good or useful while not specifically searching for it, can possibly be achieved by appending new jokes into the suggestion list.

The models for the jester analysis assessed offline accuracy utilizing the dataset of jokes that already has rating. The online implementation of this analysis could work as a recommender system predicting/recommending jokes to the users to rate and the user rating it similarly. Should the jester dataset implement an online system, A/B testing could be used to determine the more effective system.