Data 612 - Final Project

Data

  • Requirements: The goal for the final project is for to build out a recommended system using a large data set (ex: 1M+ ratings or 10k+ users, 10k+ items) Below, please find the dimensions of the beer advocate data set raw data file.

  • Data set: - The Beer Advocate data set provides reviews for a variety of beers over a period of more than 10 years. The data set includes approximately 1.5 million reviews, scoring on five “aspects”: appearance, aroma, palate, taste, and overall impression. Reviews include product and user information, followed by each of these five ratings, and a plain text review. Source:BeerAdvocate.

Data: Load

  • Load the CSV and print the dimensions (ratings, reviewers, items )
## [1] "Raw.Df Dimensions:   c(1586614, 13)"
## [1] "Distinct Reviewers 33388"
## [1] "Distinct Beers 66055"

Data: Metadata Storage

  • Map reviewer profile name to a sequential ID
  • Map the beer names to a sequential ID
  • This makes the data easier to work with. This metadata will be used for the predictions

Data: Transform/Filter

To make this a more actionable and complete data set: - Filter out recommendations that have no reviewer name - Filter for beers that have been reviewed greater than 100 times - Filter for reviewers that have reviewed more than 50 times - Create a summary from the filtered data set by beer and by user

This should provide a more actively reviewed data set and hopefully more meaningful recommendations.

## [1] "Filtered Dimensions:   c(643604, 12)"
## [1] "Distinct Reviewers 1763"
## [1] "Distinct Beers 2227"

Data: Summarize

  • The data summary for the original data set will serve as context for the prediction-based portion of the project
  • beer.summary.user: group by reviewer & assess the mean values for all the review based criteria
  • beer.summary.beer: group by beer & assess the mean values for all the review based criteria
beer.summary.user
reviewer_ID review_profilename overall.rt.mean aroma.rt.mean appreance.rt.mean palate.rt.mean taste.rt.mean num.beer.reviewed
129 BuckeyeNation 3.914141 3.761031 3.931951 3.750665 3.793727 1881
64 mikesgroove 4.084906 3.837907 3.964265 3.946541 3.910520 1749
161 BEERchitect 3.886252 3.859741 3.890567 3.735820 3.864365 1622
108 brentk56 3.906600 3.937422 4.069116 3.954234 3.951744 1606
43 northyorksammy 3.805243 3.727528 3.785268 3.662921 3.716916 1602
189 WesWes 3.961615 3.965917 3.896095 3.889146 3.870946 1511
beer.summary.beer
beer_beerid beer_name overall.rt.mean aroma.rt.mean appreance.rt.mean palate.rt.mean taste.rt.mean num.beer.reviewed
2093 90 Minute IPA 4.125526 4.191094 4.175666 4.166550 4.279804 1426
412 Old Rasputin Russian Imperial Stout 4.171418 4.185674 4.353528 4.211333 4.318603 1403
1904 Sierra Nevada Celebration Ale 4.184345 4.073884 4.235552 4.076810 4.165325 1367
1093 Two Hearted Ale 4.353481 4.286917 4.183627 4.153405 4.329763 1307
4083 Stone Ruination IPA 4.152140 4.333463 4.175486 4.185214 4.322568 1285
680 Brooklyn Black Chocolate Stout 4.031915 4.105989 4.270686 4.162727 4.181245 1269

Data: Pre-Process

  • A user-item matrix was constructed and later used in the SVD and UBCF models
  • Since some users reviewed beers multiple times, the dcast() function was utilized. dcast() allows for the aggregation the data when reshaping it to a user-item matrix
  • Print dimensions and sample
## [1] "The dimensions of the user-item matrix:  c(1763, 2203)"
user.beer.matrix
# 100 #9 10 Commandments 1000 IBU 10th Anniversary Double India Pale Ale 120 Minute IPA 12th Anniversary Undercover Investigation Shut-Down Ale 14’ER ESB 1554 Enlightened Black Ale 15th Anniversary Wood Aged
1fastz28 0 0.0 0 0 0.0 3.0 0 3 3.5 0
4DAloveofSTOUT 0 0.0 0 0 0.0 3.5 0 0 0.0 0
99bottles 0 0.0 0 0 0.0 3.0 0 0 0.0 0
9InchNails 0 0.0 0 0 0.0 0.0 0 0 0.0 0
aaronh 0 2.0 0 0 3.5 3.5 0 0 5.0 0
AaronHomoya 0 3.5 0 0 0.0 4.0 0 0 0.0 0
AaronRed 0 3.0 0 0 0.0 0.0 0 0 0.0 0
aasher 0 2.5 0 0 0.0 0.0 0 0 4.0 0
abcsofbeer 0 3.0 0 0 0.0 4.0 0 4 0.0 0
abrand 0 0.0 4 0 0.0 0.0 0 0 4.5 0

Model 1: User-Based Collaborative Filter

UBCF: Setup

  • Before building the model, create the various matrices that will be evaluated for use in the model
  • realRatingMatrix & normalized realRatingMatrix -Perform some initial EDA (Heat Map, Distribution of Ratings )
  • Binarizing the matrices: the ultimate goal for the UBCF is to ultimately provide the reviewer with beers s/he will enjoy. Since binary models tend to yield more accurate results, the binary version of the user.beer.matrix matrix will feed the UBCF model.
UBCF: Setup - Heat Maps

UBCF: Setup - Summary Stats of Beer Ratings
  • Using the regular and normalize versions of the user.beer.matrix

UBCF: Model Construction

  • The UBCF model was constructed using the R recommenderlab package
  • As mentioned above, the binary version of the “user.beer.matrix” as it traditionally produces better results
  • The binary matrix was split into train and test data sets
  • The UBCF Jaccard method was employed as it has produced the best UBCF results in previous projects
  • Generate a listing of predicted beers by user
  • In the performance metrics(model_stats) the precision and recall are both pretty low
##           model.name        ID
## TP              7.38        TP
## FP              2.62        FP
## FN            354.24        FN
## TN           1828.76        TN
## precision       0.74 precision
## recall          0.03    recall
## TPR             0.03       TPR
## FPR             0.00       FPR

UBCF: Model Beer Recommender Generator(Beers You Might Enjoy!)

  • Utilizing the model results above, generate a maximum of 10 beer recommendations per user
  • Loop through the predicted values & construct a data frame from beer names
  • Remove all beers that have been reviewed by the user
  • randomly generate a user & return the beer recommendations
## [1] "Beers  abcsofbeer  may enjoy (below list):"
##                       beer_name
## 1  Belhaven Twisted Thistle IPA
## 2                           Hex
## 3          Sless' Oatmeal Stout
## 4                Menace De Dieu
## 5                 Wild Ride IPA
## 6                   Gumballhead
## 7   Atwater Vanilla Java Porter
## 8                        Uff-da
## 9              Brooklyn Local 1
## 10                      Choklat

Model 2: SVD

SVD: Setup

## List of 3
##  $ d: num [1:1763] 1759 610 412 383 357 ...
##  $ u: num [1:1763, 1:1763] -0.0203 -0.0106 -0.0163 -0.0106 -0.0276 ...
##  $ v: num [1:2203, 1:1763] -0.0123 -0.0374 -0.0148 -0.0075 -0.0105 ...

SVD: Plot Singular Values

  • The plot shows the descending order of the singular values
  • The magnitudes appear to rapidly decline through the first ~250 & level out around 500
  • The plot seems to indicate that the first ~250 values capture a lot of the variability from the user beer matrix

SVD:Dimensionality Reduction

  • Reduce the dimensionality of the SVD utilizing the sum-on-squares approach
  • Sum the squares of each singular value; Sum(beer_svd$d^2)

  • Goal: find the first k singular values whose squares sum to at least 90% of the variability
  • Plot of a running sum of squares for the singular values
  • The model can retain 90% of the variability by keeping 650 singular values

  • Create new SVD matrices considering new k
  • Predict values based on the SVD matrices new.U%% new.SMat%% new.V
  • Assign metadata from original user.beer.matrix

## [1] "The model can retain 90% of the variability by keeping:  650  singular values"
## [1] 650 650
## [1] 1763  650
## [1]  650 2203
## [1] 1763 2203

SVD:Prediction Assessment

  • Check min/max predicted ratings
  • Overwrite values that are outside rating bounds (1- 5)
  • Get prediction summary
  • Get accuracy/performance metrics
  • RMSE: Root Mean Squared Error. This function takes the square root of the average difference squared between the predicted value and actual value at each point. For the RMSE, the calculated RMSE score for the SVD model is, on average, .44 away from the actual rating.
## [1] "min values: -2.06704979248077"
## [1] "max values: 6.70776345237745"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.1711  0.7400  0.6166  5.0000
## [1] "RMSE of SVD Model; 0.44"

SVD:Predictions Examples

  • Loop through the row names & construct a data frame from beer name, user and ratings
  • Remove all beers that have been reviewed by the user
  • Sort resulting data frame by rating in descending order
  • Store the top 10 ratings to a df_total data frame

  • Randomly select a user & call for the top 10 ratings stored in df_total
  • Provide context by displaying the average rating by the randomly generated user in the base data set

SVD Top Review/User Sample
Rating_Pred beer_name review_profilename reviewer_ID
5.00 Gulden Draak (Dark Triple) 1fastz28 1076
5.00 Hoegaarden Original White Ale 1fastz28 1076
5.00 Pranqster 1fastz28 1076
5.00 Samuel Smith’s Nut Brown Ale 1fastz28 1076
4.94 Nut Brown Ale 1fastz28 1076
4.90 La Chouffe 1fastz28 1076
## [1] "Top 10 Beer Recommendations for User:  axeman9182  found below"
##    Rating_Pred                          beer_name review_profilename
## 1         5.00                     Festina Pêche         axeman9182
## 2         5.00                    La Fin Du Monde         axeman9182
## 3         5.00                         Temptation         axeman9182
## 4         4.99             Racer 5 India Pale Ale         axeman9182
## 5         4.98                 Bell's Hopslam Ale         axeman9182
## 6         4.98                       Consecration         axeman9182
## 7         4.97 Bourbon County Brand Vanilla Stout         axeman9182
## 8         4.95              Terrapin Rye Pale Ale         axeman9182
## 9         4.93        Founders CBS Imperial Stout         axeman9182
## 10        4.90                       Supplication         axeman9182
##    reviewer_ID
## 1         3908
## 2         3908
## 3         3908
## 4         3908
## 5         3908
## 6         3908
## 7         3908
## 8         3908
## 9         3908
## 10        3908
## [1] "The Average Beer Rating for User in The Base Dataset:"
## # A tibble: 0 x 8
## # Groups:   reviewer_ID [0]
## # ... with 8 variables: reviewer_ID <int>, review_profilename <fct>,
## #   overall.rt.mean <dbl>, aroma.rt.mean <dbl>, appreance.rt.mean <dbl>,
## #   palate.rt.mean <dbl>, taste.rt.mean <dbl>, num.beer.reviewed <int>

`

Conclusions:

This project allowed for a greater understanding of user discovery based recommender systems. More specifically, I had the opportunity to understand how to better produce and optimize recommendations with the collaborative filtering model and singular value decomposition model. In future iterations of this project, I would like to explore building a nice user interface to interact with the model.