Final Project - Data 612

Data 612 - Final Project

This project is an opportunity to implement a couple of recommendation algorithms utilizing the relatively large, and interesting, beer review data set. The data set is comprised of reviewers and their explicit ratings of each beer. The goal is simple: provide a tailored list of beers for each reviewer. Achieving this goal will consider two recommendation models: user-based collaborative filtering model and singular value decomposition model. Because bigger data sets present bigger challenges, the size and complexity of the base beer data reduction. The result of each model will be a maximum of ten beer recommendations per randomly generated user.

Data

Requirements: The goal for the final project is for to build out a recommended system using a large data set (ex: 1M+ ratings or 10k+ users, 10k+ items) Below, please find the dimensions of the beer advocate data set raw data file.
Data set: - The Beer Advocate data set provides reviews for a variety of beers over a period of more than 10 years. The data set includes approximately 1.5 million reviews, scoring on five “aspects”: appearance, aroma, palate, taste, and overall impression. Reviews include product and user information, followed by each of these five ratings, and a plain text review. Source:BeerAdvocate.

Data: Load

Load the CSV and print the dimensions (ratings, reviewers, items )

## [1] "Raw.Df Dimensions:   c(1586614, 13)"

## [1] "Distinct Reviewers 33388"

## [1] "Distinct Beers 66055"

Data: Metadata Storage

Map reviewer profile name to a sequential ID
Map the beer names to a sequential ID
This makes the data easier to work with. This metadata will be used for the predictions

Data: Transform/Filter

To make this a more actionable and complete data set: - Filter out recommendations that have no reviewer name - Filter for beers that have been reviewed greater than 100 times - Filter for reviewers that have reviewed more than 50 times - Create a summary from the filtered data set by beer and by user

This should provide a more actively reviewed data set and hopefully more meaningful recommendations.

## [1] "Filtered Dimensions:   c(643604, 12)"

## [1] "Distinct Reviewers 1763"

## [1] "Distinct Beers 2227"

Data: Summarize

The data summary for the original data set will serve as context for the prediction-based portion of the project
beer.summary.user: group by reviewer & assess the mean values for all the review based criteria
beer.summary.beer: group by beer & assess the mean values for all the review based criteria

beer.summary.user
reviewer_ID	review_profilename	overall.rt.mean	aroma.rt.mean	appreance.rt.mean	palate.rt.mean	taste.rt.mean	num.beer.reviewed
129	BuckeyeNation	3.914141	3.761031	3.931951	3.750665	3.793727	1881
64	mikesgroove	4.084906	3.837907	3.964265	3.946541	3.910520	1749
161	BEERchitect	3.886252	3.859741	3.890567	3.735820	3.864365	1622
108	brentk56	3.906600	3.937422	4.069116	3.954234	3.951744	1606
43	northyorksammy	3.805243	3.727528	3.785268	3.662921	3.716916	1602
189	WesWes	3.961615	3.965917	3.896095	3.889146	3.870946	1511

beer.summary.beer
beer_beerid	beer_name	overall.rt.mean	aroma.rt.mean	appreance.rt.mean	palate.rt.mean	taste.rt.mean	num.beer.reviewed
2093	90 Minute IPA	4.125526	4.191094	4.175666	4.166550	4.279804	1426
412	Old Rasputin Russian Imperial Stout	4.171418	4.185674	4.353528	4.211333	4.318603	1403
1904	Sierra Nevada Celebration Ale	4.184345	4.073884	4.235552	4.076810	4.165325	1367
1093	Two Hearted Ale	4.353481	4.286917	4.183627	4.153405	4.329763	1307
4083	Stone Ruination IPA	4.152140	4.333463	4.175486	4.185214	4.322568	1285
680	Brooklyn Black Chocolate Stout	4.031915	4.105989	4.270686	4.162727	4.181245	1269

Data: Pre-Process

A user-item matrix was constructed and later used in the SVD and UBCF models
Since some users reviewed beers multiple times, the dcast() function was utilized. dcast() allows for the aggregation the data when reshaping it to a user-item matrix
Print dimensions and sample

## [1] "The dimensions of the user-item matrix:  c(1763, 2203)"

user.beer.matrix
	#9	10 Commandments	10th Anniversary Double India Pale Ale	120 Minute IPA	14’ER ESB	1554 Enlightened Black Ale
1fastz28	0.0	0	0.0	3.0	3	3.5
4DAloveofSTOUT	0.0	0	0.0	3.5	0	0.0
99bottles	0.0	0	0.0	3.0	0	0.0
9InchNails	0.0	0	0.0	0.0	0	0.0
aaronh	2.0	0	3.5	3.5	0	5.0
AaronHomoya	3.5	0	0.0	4.0	0	0.0
AaronRed	3.0	0	0.0	0.0	0	0.0
aasher	2.5	0	0.0	0.0	0	4.0
abcsofbeer	3.0	0	0.0	4.0	4	0.0
abrand	0.0	4	0.0	0.0	0	4.5

Model 1: User-Based Collaborative Filter

UBCF: Setup

Before building the model, create the various matrices that will be evaluated for use in the model
realRatingMatrix & normalized realRatingMatrix -Perform some initial EDA (Heat Map, Distribution of Ratings )
Binarizing the matrices: the ultimate goal for the UBCF is to ultimately provide the reviewer with beers s/he will enjoy. Since binary models tend to yield more accurate results, the binary version of the user.beer.matrix matrix will feed the UBCF model.

UBCF: Setup - Heat Maps

UBCF: Setup - Summary Stats of Beer Ratings

Using the regular and normalize versions of the user.beer.matrix

UBCF: Model Construction

The UBCF model was constructed using the R recommenderlab package
As mentioned above, the binary version of the “user.beer.matrix” as it traditionally produces better results
The binary matrix was split into train and test data sets
The UBCF Jaccard method was employed as it has produced the best UBCF results in previous projects
Generate a listing of predicted beers by user
In the performance metrics(model_stats) the precision and recall are both pretty low

##           model.name        ID
## TP              7.38        TP
## FP              2.62        FP
## FN            354.24        FN
## TN           1828.76        TN
## precision       0.74 precision
## recall          0.03    recall
## TPR             0.03       TPR
## FPR             0.00       FPR

UBCF: Model Beer Recommender Generator(Beers You Might Enjoy!)

Utilizing the model results above, generate a maximum of 10 beer recommendations per user
Loop through the predicted values & construct a data frame from beer names
Remove all beers that have been reviewed by the user
randomly generate a user & return the beer recommendations

## [1] "Beers  abcsofbeer  may enjoy (below list):"

##                       beer_name
## 1  Belhaven Twisted Thistle IPA
## 2                           Hex
## 3          Sless' Oatmeal Stout
## 4                Menace De Dieu
## 5                 Wild Ride IPA
## 6                   Gumballhead
## 7   Atwater Vanilla Java Porter
## 8                        Uff-da
## 9              Brooklyn Local 1
## 10                      Choklat

Model 2: SVD

SVD: Setup

The SVD is calculated by using the base R svd() function
Utilize the previously created “user.beer.matrix” as the input to the function
Accessed the row values: $u, stored as “U”
Accessed the column.values: $v, stored as “V”
https://www.rdocumentation.org/packages/base/versions/3.6.0/topics/diag, stoed as “SMat”

## List of 3
##  $ d: num [1:1763] 1759 610 412 383 357 ...
##  $ u: num [1:1763, 1:1763] -0.0203 -0.0106 -0.0163 -0.0106 -0.0276 ...
##  $ v: num [1:2203, 1:1763] -0.0123 -0.0374 -0.0148 -0.0075 -0.0105 ...

SVD: Plot Singular Values

The plot shows the descending order of the singular values
The magnitudes appear to rapidly decline through the first ~250 & level out around 500
The plot seems to indicate that the first ~250 values capture a lot of the variability from the user beer matrix

SVD:Dimensionality Reduction

Reduce the dimensionality of the SVD utilizing the sum-on-squares approach
Sum the squares of each singular value; Sum(beer_svd$d^2)
Goal: find the first k singular values whose squares sum to at least 90% of the variability
Plot of a running sum of squares for the singular values
The model can retain 90% of the variability by keeping 650 singular values
Create new SVD matrices considering new k
Predict values based on the SVD matrices new.U%% new.SMat%% new.V
Assign metadata from original user.beer.matrix

## [1] "The model can retain 90% of the variability by keeping:  650  singular values"

## [1] 650 650

## [1] 1763  650

## [1]  650 2203

## [1] 1763 2203

SVD:Prediction Assessment

Check min/max predicted ratings
Overwrite values that are outside rating bounds (1- 5)
Get prediction summary
Get accuracy/performance metrics
RMSE: Root Mean Squared Error. This function takes the square root of the average difference squared between the predicted value and actual value at each point. For the RMSE, the calculated RMSE score for the SVD model is, on average, .44 away from the actual rating.

## [1] "min values: -2.06704979248077"

## [1] "max values: 6.70776345237745"

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.1711  0.7400  0.6166  5.0000

## [1] "RMSE of SVD Model; 0.44"

SVD:Predictions Examples

Loop through the row names & construct a data frame from beer name, user and ratings
Remove all beers that have been reviewed by the user
Sort resulting data frame by rating in descending order
Store the top 10 ratings to a df_total data frame
Randomly select a user & call for the top 10 ratings stored in df_total
Provide context by displaying the average rating by the randomly generated user in the base data set

SVD Top Review/User Sample
Rating_Pred	beer_name	review_profilename	reviewer_ID
5.00	Gulden Draak (Dark Triple)	1fastz28	1076
5.00	Hoegaarden Original White Ale	1fastz28	1076
5.00	Pranqster	1fastz28	1076
5.00	Samuel Smith’s Nut Brown Ale	1fastz28	1076
4.94	Nut Brown Ale	1fastz28	1076
4.90	La Chouffe	1fastz28	1076

## [1] "Top 10 Beer Recommendations for User:  axeman9182  found below"

##    Rating_Pred                          beer_name review_profilename
## 1         5.00                     Festina PÃªche         axeman9182
## 2         5.00                    La Fin Du Monde         axeman9182
## 3         5.00                         Temptation         axeman9182
## 4         4.99             Racer 5 India Pale Ale         axeman9182
## 5         4.98                 Bell's Hopslam Ale         axeman9182
## 6         4.98                       Consecration         axeman9182
## 7         4.97 Bourbon County Brand Vanilla Stout         axeman9182
## 8         4.95              Terrapin Rye Pale Ale         axeman9182
## 9         4.93        Founders CBS Imperial Stout         axeman9182
## 10        4.90                       Supplication         axeman9182
##    reviewer_ID
## 1         3908
## 2         3908
## 3         3908
## 4         3908
## 5         3908
## 6         3908
## 7         3908
## 8         3908
## 9         3908
## 10        3908

## [1] "The Average Beer Rating for User in The Base Dataset:"

## # A tibble: 0 x 8
## # Groups:   reviewer_ID [0]
## # ... with 8 variables: reviewer_ID <int>, review_profilename <fct>,
## #   overall.rt.mean <dbl>, aroma.rt.mean <dbl>, appreance.rt.mean <dbl>,
## #   palate.rt.mean <dbl>, taste.rt.mean <dbl>, num.beer.reviewed <int>

Conclusions:

This project allowed for a greater understanding of user discovery based recommender systems. More specifically, I had the opportunity to understand how to better produce and optimize recommendations with the collaborative filtering model and singular value decomposition model. In future iterations of this project, I would like to explore building a nice user interface to interact with the model.