PROJECT 1: GLOBAL BASELINE PREDICTORS & RMSE

Briefly describe the recommender system that you’re going to build out from a business perspective, e.g. “This system recommends data science books to readers.”

The recommender system being built for this project recommends music albums to users based on reviews from the normalized scores of professional critics (from other websites and print publications) found on Metacritic.com. Since the ratings from different critics are on differnt scales (out of 4, 5, 10, etc.), Metacritic “normalizes” these ratings and then combines them to assign a “Metascore” to the album, movie, game, or television show. A not-so-detailed explanation of their system can be found here: http://www.metacritic.com/about-metascores. For this project, the individual ratings of different crtics, not the overall “metascore” will be used to build the system.

Find a dataset, or build out your own toy dataset. As a minimum requirement for complexity, please include numeric ratings for at least five users, across at least five items, with some missing data.

To start out, a handful of different albums with reviews by mainly the same websites/publications will be selected. Reviews were used from Rolling Stone, Entertainment Weekly, Pitchfork, The New York Times, Spin, and AllMusic.com. This did not provide a sparse enough matrix, so a few of the scores were randomly deleted.

Load your data into (for example) an R or pandas dataframe, a Python dictionary or list of lists, (or another data structure of your choosing). From there, create a user-item matrix.

Artist.Album <- c("Kendrick Lamar, DAMN", "Ed Sheeran, Divide", "Bruno Mars, 24K Magic", 
                  "Chainsmokers, Memories...", "Adele, 25", "Harry Styles, Harry Styles")

critics <- c("Rolling Stone", "Entertainment Weekly", "Pitchfork", "New York Times", "Spin", "AllMusic")

RS <- c(90, 80, 60, 40, 100, 80)
EW <- c(100, 67, 83, 58, 91, 75)
PF <- c(92, 28, 62, 42, 73, NA)
NYT <- c(90, 70, 70, NA, 70, NA)
SPN <- c(90, NA, NA, 30, 60, 50)
AM <- c(90, 70, 80, 40, NA, 70)

scores <- data.frame(RS, EW, PF, NYT, SPN, AM, stringsAsFactors = F)
rownames(scores) <- Artist.Album
scores
##                             RS  EW PF NYT SPN AM
## Kendrick Lamar, DAMN        90 100 92  90  90 90
## Ed Sheeran, Divide          80  67 28  70  NA 70
## Bruno Mars, 24K Magic       60  83 62  70  NA 80
## Chainsmokers, Memories...   40  58 42  NA  30 40
## Adele, 25                  100  91 73  70  60 NA
## Harry Styles, Harry Styles  80  75 NA  NA  50 70

Break your ratings into separate training and test datasets.

First, we’ll select a few user/item (reviewer/album) combinations at random, and then store those in a vector for our test dataset:

vals <- rbind(c(2,1), c(4,2), c(1,3), c(2,6), c(5,5), c(3,2))
test <- as.numeric(scores[vals])
test
## [1] 80 58 92 70 60 83

To make sure we’re not using those same values in the training data, those user/item combinations will be replaced with NAs.

training <- scores
training[vals] <- NA  # replace the elements that went to the test set with NA values
training
##                             RS  EW PF NYT SPN AM
## Kendrick Lamar, DAMN        90 100 NA  90  90 90
## Ed Sheeran, Divide          NA  67 28  70  NA NA
## Bruno Mars, 24K Magic       60  NA 62  70  NA 80
## Chainsmokers, Memories...   40  NA 42  NA  30 40
## Adele, 25                  100  91 73  70  NA NA
## Harry Styles, Harry Styles  80  75 NA  NA  50 70

Using your training data, calculate the raw average (mean) rating for every user-item combination.

Removing the NA values from the training data set, the expected value of the entire matrix can be calculated.

train_avg <- mean(as.matrix(training), na.rm=TRUE)
train_avg
## [1] 69.08333

Here, the average value is 69.08333, just under 70. On a scale of 5, this would be \(3\frac{1}{2}\) stars, which seems to make sense as a lot of reviews for media skew more positive (3-5 out of 5 stars, for example).

Calculate the RMSE for raw average for both your training data and your test data.

To calculate the RMSE for the training and test sets, we’ll first find the difference (error) between the actual and the expected values for the training set:

train_err <- training - train_avg
train_err
##                                    RS        EW         PF        NYT
## Kendrick Lamar, DAMN        20.916667 30.916667         NA 20.9166667
## Ed Sheeran, Divide                 NA -2.083333 -41.083333  0.9166667
## Bruno Mars, 24K Magic       -9.083333        NA  -7.083333  0.9166667
## Chainsmokers, Memories...  -29.083333        NA -27.083333         NA
## Adele, 25                   30.916667 21.916667   3.916667  0.9166667
## Harry Styles, Harry Styles  10.916667  5.916667         NA         NA
##                                  SPN          AM
## Kendrick Lamar, DAMN        20.91667  20.9166667
## Ed Sheeran, Divide                NA          NA
## Bruno Mars, 24K Magic             NA  10.9166667
## Chainsmokers, Memories...  -39.08333 -29.0833333
## Adele, 25                         NA          NA
## Harry Styles, Harry Styles -19.08333   0.9166667

Since there are negative values, we’ll square the difference so that all of the values are positive:

train_RMSE <- (train_err)^2
train_RMSE
##                                   RS         EW         PF         NYT
## Kendrick Lamar, DAMN       437.50694 955.840278         NA 437.5069444
## Ed Sheeran, Divide                NA   4.340278 1687.84028   0.8402778
## Bruno Mars, 24K Magic       82.50694         NA   50.17361   0.8402778
## Chainsmokers, Memories...  845.84028         NA  733.50694          NA
## Adele, 25                  955.84028 480.340278   15.34028   0.8402778
## Harry Styles, Harry Styles 119.17361  35.006944         NA          NA
##                                  SPN          AM
## Kendrick Lamar, DAMN        437.5069 437.5069444
## Ed Sheeran, Divide                NA          NA
## Bruno Mars, 24K Magic             NA 119.1736111
## Chainsmokers, Memories...  1527.5069 845.8402778
## Adele, 25                         NA          NA
## Harry Styles, Harry Styles  364.1736   0.8402778

Next, we’ll sum all of the values, and divide by the number of non-NA values in the training set data frame:

train_RMSE <- sum(train_RMSE, na.rm=TRUE)

train_RMSE <- train_RMSE / length(train_RMSE[!is.na(train_err)])
train_RMSE
## [1] 440.6597

Last, we’ll take the square root to find the RMSE of the training set:

train_RMSE <- sqrt(train_RMSE)
train_RMSE
## [1] 20.9919

The RMSE of the training set for the raw average is above. Next, we’ll do the same, but for the test set:

test_RMSE <- sqrt(sum(((test - train_avg)^2), na.rm = TRUE) / length(test[!is.na(test)]))
test_RMSE
## [1] 13.19222

Using your training data, calculate the bias for each user and each item.

To find the bias for each user (reviewer/publication) and item (album), we simply find the average rating for the applicable user or item, and then subtract the raw average found earlier using the training set:

album_bias <- rowMeans(training, na.rm = TRUE) - train_avg
album_bias
##       Kendrick Lamar, DAMN         Ed Sheeran, Divide 
##                 22.9166667                -14.0833333 
##      Bruno Mars, 24K Magic  Chainsmokers, Memories... 
##                 -1.0833333                -31.0833333 
##                  Adele, 25 Harry Styles, Harry Styles 
##                 14.4166667                 -0.3333333

Looking at the bias for each album, we can see the average rating for each album and how far above or below the average rating acrosss all user/item combinations it is. The Harry Styles and Bruno Mars albums are rated right around the average score, whereas Kendrick Lamar has much better scores than the average, and the Chainsmokers album seems to be really disliked by critics.

rev_bias <- colMeans(training, na.rm = TRUE) - train_avg
rev_bias
##          RS          EW          PF         NYT         SPN          AM 
##   4.9166667  14.1666667 -17.8333333   5.9166667 -12.4166667   0.9166667

As with the album reviews, the critics also appear to skew more positive or negative with their reviews. Rolling Stone, NYT, and AllMusic appear to be fairly even across the small sample of scores, Entertainment Weekly looks to give more positive scores, and Pitchfork/Spin lean more negative with their reviews.

From the raw average, and the appropriate user and item biases, calculate the baseline predictors

for every user-item combination.

baseline <- as.data.frame(matrix(nrow=6, ncol=6))

for (i in 1:length(album_bias)){
  row <- c(rev_bias + album_bias[i] + train_avg)
  #print(row)
  baseline[i, ] <- row
}

# Cap any values over the maximum possible score of 100
baseline[baseline > 100] <- 100

rownames(baseline) <- Artist.Album
colnames(baseline) <- colnames(scores)
round(baseline, 2)
##                               RS     EW    PF   NYT   SPN    AM
## Kendrick Lamar, DAMN       96.92 100.00 74.17 97.92 79.58 92.92
## Ed Sheeran, Divide         59.92  69.17 37.17 60.92 42.58 55.92
## Bruno Mars, 24K Magic      72.92  82.17 50.17 73.92 55.58 68.92
## Chainsmokers, Memories...  42.92  52.17 20.17 43.92 25.58 38.92
## Adele, 25                  88.42  97.67 65.67 89.42 71.08 84.42
## Harry Styles, Harry Styles 73.67  82.92 50.92 74.67 56.33 69.67

Since we needed to use the same calculation, regardless of what the actual score was (or if it was missing), the matrix of scores using the baseline predictors is filled with the appropriate results of training set average + user bias + item bias.

Compared to the original scores, some of the results are very close. Where some of the scores are off, the bias of the other review sources skewed the results. For example, the actual scores for Kendrick Lamar’s new album “DAMN” are nearly perfect across the board, however Pitchfork tended to rate everything much lower than the other sources. So, although “DAMN” still as the highest rating of all of the albums reviewed by Pitchfork, it’s still far off from the actual score of 92.

Calculate the RMSE for the baseline predictors for both your training data and your test data.

Since the original training data contained NAs for some of the values, we can just subtract the baseline predictor data frame from the training one without needed to take out the values that were originally missing, and those removed to make the test set. The RMSE will be calculated the same way as before:

RMSE for the Training Set:

train_bias_RMSE <- sqrt(sum((training - baseline)^2, na.rm=TRUE) / length(scores[!is.na(scores)]))
train_bias_RMSE
## [1] 8.370777

RMSE for the Test Set:

# pull out the user/item combinations used in the Test set from the baseline predictions
test_baseline <- baseline[vals]
test_baseline
## [1] 59.91667 52.16667 74.16667 55.91667 71.08333 82.16667
test
## [1] 80 58 92 70 60 83

The above scores are the baseline predictor results for those user/item pairs selected for the test set, and the actual test set scores. Below, the same RMSE calculation is done using the original test set, and the test set baseline predictors:

# calculate the RMSE
test_bias_RMSE <- sqrt(sum((test - test_baseline)^2) / length(test))
test_bias_RMSE
## [1] 13.39945

Summarize your results. (What method worked best, what did you learn?)

A comparison of the RMSE calculated for the raw average and the baseline predictors is below:

Data Subset Raw AVG RMSE Baseline Pred. RMSE % Increase/Decrease
Training 20.9919 8.370777 60.1 % improvement
Test 13.19222 13.39945 -1.57 % decrease

Examining the two methods, accounting for the bias of the review sources and artist resulted in a much lower RMSE, and fairly accurate predicted scores for each music album. Using the raw average of scores yielded some acceptable results, but similar to regression analysis, sometimes considering more parameters will give a better prediction than a single predictor.

There was not a huge difference in the RMSE for the raw average of scores, or the baseline predictors (in fact, there was a very small increase in the RMSE). The reason for this may have been the relatively small size of the user/item matrix, which would be affected by a single heavily skewed score, or 1 or 2 missing values for review source or album.

It was interesting to see the bias introduced for different review sources. In the past, I have noticed some reviews from niche sources (blogs or smaller publications) seem a little more pretentious, whereas reviews from publications with a wider appeal (Entertainment Weekly, USA Today) seem to be a little more friendly with their reviews. Using some fairly simple math and logic yielded acceptable results for a recommender system, however, using some more complex linear algebra and/or algorithms would hopefully give more accurate predictions.