The recommender system being built for this project recommends music albums to users based on reviews from the normalized scores of professional critics (from other websites and print publications) found on Metacritic.com. Since the ratings from different critics are on differnt scales (out of 4, 5, 10, etc.), Metacritic “normalizes” these ratings and then combines them to assign a “Metascore” to the album, movie, game, or television show. A not-so-detailed explanation of their system can be found here: http://www.metacritic.com/about-metascores. For this project, the individual ratings of different crtics, not the overall “metascore” will be used to build the system.
To start out, a handful of different albums with reviews by mainly the same websites/publications will be selected. Reviews were used from Rolling Stone, Entertainment Weekly, Pitchfork, The New York Times, Spin, and AllMusic.com. This did not provide a sparse enough matrix, so a few of the scores were randomly deleted.
Artist.Album <- c("Kendrick Lamar, DAMN", "Ed Sheeran, Divide", "Bruno Mars, 24K Magic",
"Chainsmokers, Memories...", "Adele, 25", "Harry Styles, Harry Styles")
critics <- c("Rolling Stone", "Entertainment Weekly", "Pitchfork", "New York Times", "Spin", "AllMusic")
RS <- c(90, 80, 60, 40, 100, 80)
EW <- c(100, 67, 83, 58, 91, 75)
PF <- c(92, 28, 62, 42, 73, NA)
NYT <- c(90, 70, 70, NA, 70, NA)
SPN <- c(90, NA, NA, 30, 60, 50)
AM <- c(90, 70, 80, 40, NA, 70)
scores <- data.frame(RS, EW, PF, NYT, SPN, AM, stringsAsFactors = F)
rownames(scores) <- Artist.Album
scores
## RS EW PF NYT SPN AM
## Kendrick Lamar, DAMN 90 100 92 90 90 90
## Ed Sheeran, Divide 80 67 28 70 NA 70
## Bruno Mars, 24K Magic 60 83 62 70 NA 80
## Chainsmokers, Memories... 40 58 42 NA 30 40
## Adele, 25 100 91 73 70 60 NA
## Harry Styles, Harry Styles 80 75 NA NA 50 70
First, we’ll select a few user/item (reviewer/album) combinations at random, and then store those in a vector for our test dataset:
vals <- rbind(c(2,1), c(4,2), c(1,3), c(2,6), c(5,5), c(3,2))
test <- as.numeric(scores[vals])
test
## [1] 80 58 92 70 60 83
To make sure we’re not using those same values in the training data, those user/item combinations will be replaced with NAs.
training <- scores
training[vals] <- NA # replace the elements that went to the test set with NA values
training
## RS EW PF NYT SPN AM
## Kendrick Lamar, DAMN 90 100 NA 90 90 90
## Ed Sheeran, Divide NA 67 28 70 NA NA
## Bruno Mars, 24K Magic 60 NA 62 70 NA 80
## Chainsmokers, Memories... 40 NA 42 NA 30 40
## Adele, 25 100 91 73 70 NA NA
## Harry Styles, Harry Styles 80 75 NA NA 50 70
Removing the NA values from the training data set, the expected value of the entire matrix can be calculated.
train_avg <- mean(as.matrix(training), na.rm=TRUE)
train_avg
## [1] 69.08333
Here, the average value is 69.08333, just under 70. On a scale of 5, this would be \(3\frac{1}{2}\) stars, which seems to make sense as a lot of reviews for media skew more positive (3-5 out of 5 stars, for example).
To calculate the RMSE for the training and test sets, we’ll first find the difference (error) between the actual and the expected values for the training set:
train_err <- training - train_avg
train_err
## RS EW PF NYT
## Kendrick Lamar, DAMN 20.916667 30.916667 NA 20.9166667
## Ed Sheeran, Divide NA -2.083333 -41.083333 0.9166667
## Bruno Mars, 24K Magic -9.083333 NA -7.083333 0.9166667
## Chainsmokers, Memories... -29.083333 NA -27.083333 NA
## Adele, 25 30.916667 21.916667 3.916667 0.9166667
## Harry Styles, Harry Styles 10.916667 5.916667 NA NA
## SPN AM
## Kendrick Lamar, DAMN 20.91667 20.9166667
## Ed Sheeran, Divide NA NA
## Bruno Mars, 24K Magic NA 10.9166667
## Chainsmokers, Memories... -39.08333 -29.0833333
## Adele, 25 NA NA
## Harry Styles, Harry Styles -19.08333 0.9166667
Since there are negative values, we’ll square the difference so that all of the values are positive:
train_RMSE <- (train_err)^2
train_RMSE
## RS EW PF NYT
## Kendrick Lamar, DAMN 437.50694 955.840278 NA 437.5069444
## Ed Sheeran, Divide NA 4.340278 1687.84028 0.8402778
## Bruno Mars, 24K Magic 82.50694 NA 50.17361 0.8402778
## Chainsmokers, Memories... 845.84028 NA 733.50694 NA
## Adele, 25 955.84028 480.340278 15.34028 0.8402778
## Harry Styles, Harry Styles 119.17361 35.006944 NA NA
## SPN AM
## Kendrick Lamar, DAMN 437.5069 437.5069444
## Ed Sheeran, Divide NA NA
## Bruno Mars, 24K Magic NA 119.1736111
## Chainsmokers, Memories... 1527.5069 845.8402778
## Adele, 25 NA NA
## Harry Styles, Harry Styles 364.1736 0.8402778
Next, we’ll sum all of the values, and divide by the number of non-NA values in the training set data frame:
train_RMSE <- sum(train_RMSE, na.rm=TRUE)
train_RMSE <- train_RMSE / length(train_RMSE[!is.na(train_err)])
train_RMSE
## [1] 440.6597
Last, we’ll take the square root to find the RMSE of the training set:
train_RMSE <- sqrt(train_RMSE)
train_RMSE
## [1] 20.9919
The RMSE of the training set for the raw average is above. Next, we’ll do the same, but for the test set:
test_RMSE <- sqrt(sum(((test - train_avg)^2), na.rm = TRUE) / length(test[!is.na(test)]))
test_RMSE
## [1] 13.19222
To find the bias for each user (reviewer/publication) and item (album), we simply find the average rating for the applicable user or item, and then subtract the raw average found earlier using the training set:
album_bias <- rowMeans(training, na.rm = TRUE) - train_avg
album_bias
## Kendrick Lamar, DAMN Ed Sheeran, Divide
## 22.9166667 -14.0833333
## Bruno Mars, 24K Magic Chainsmokers, Memories...
## -1.0833333 -31.0833333
## Adele, 25 Harry Styles, Harry Styles
## 14.4166667 -0.3333333
Looking at the bias for each album, we can see the average rating for each album and how far above or below the average rating acrosss all user/item combinations it is. The Harry Styles and Bruno Mars albums are rated right around the average score, whereas Kendrick Lamar has much better scores than the average, and the Chainsmokers album seems to be really disliked by critics.
rev_bias <- colMeans(training, na.rm = TRUE) - train_avg
rev_bias
## RS EW PF NYT SPN AM
## 4.9166667 14.1666667 -17.8333333 5.9166667 -12.4166667 0.9166667
As with the album reviews, the critics also appear to skew more positive or negative with their reviews. Rolling Stone, NYT, and AllMusic appear to be fairly even across the small sample of scores, Entertainment Weekly looks to give more positive scores, and Pitchfork/Spin lean more negative with their reviews.
for every user-item combination.
baseline <- as.data.frame(matrix(nrow=6, ncol=6))
for (i in 1:length(album_bias)){
row <- c(rev_bias + album_bias[i] + train_avg)
#print(row)
baseline[i, ] <- row
}
# Cap any values over the maximum possible score of 100
baseline[baseline > 100] <- 100
rownames(baseline) <- Artist.Album
colnames(baseline) <- colnames(scores)
round(baseline, 2)
## RS EW PF NYT SPN AM
## Kendrick Lamar, DAMN 96.92 100.00 74.17 97.92 79.58 92.92
## Ed Sheeran, Divide 59.92 69.17 37.17 60.92 42.58 55.92
## Bruno Mars, 24K Magic 72.92 82.17 50.17 73.92 55.58 68.92
## Chainsmokers, Memories... 42.92 52.17 20.17 43.92 25.58 38.92
## Adele, 25 88.42 97.67 65.67 89.42 71.08 84.42
## Harry Styles, Harry Styles 73.67 82.92 50.92 74.67 56.33 69.67
Since we needed to use the same calculation, regardless of what the actual score was (or if it was missing), the matrix of scores using the baseline predictors is filled with the appropriate results of training set average + user bias + item bias.
Compared to the original scores, some of the results are very close. Where some of the scores are off, the bias of the other review sources skewed the results. For example, the actual scores for Kendrick Lamar’s new album “DAMN” are nearly perfect across the board, however Pitchfork tended to rate everything much lower than the other sources. So, although “DAMN” still as the highest rating of all of the albums reviewed by Pitchfork, it’s still far off from the actual score of 92.
Since the original training data contained NAs for some of the values, we can just subtract the baseline predictor data frame from the training one without needed to take out the values that were originally missing, and those removed to make the test set. The RMSE will be calculated the same way as before:
RMSE for the Training Set:
train_bias_RMSE <- sqrt(sum((training - baseline)^2, na.rm=TRUE) / length(scores[!is.na(scores)]))
train_bias_RMSE
## [1] 8.370777
RMSE for the Test Set:
# pull out the user/item combinations used in the Test set from the baseline predictions
test_baseline <- baseline[vals]
test_baseline
## [1] 59.91667 52.16667 74.16667 55.91667 71.08333 82.16667
test
## [1] 80 58 92 70 60 83
The above scores are the baseline predictor results for those user/item pairs selected for the test set, and the actual test set scores. Below, the same RMSE calculation is done using the original test set, and the test set baseline predictors:
# calculate the RMSE
test_bias_RMSE <- sqrt(sum((test - test_baseline)^2) / length(test))
test_bias_RMSE
## [1] 13.39945
A comparison of the RMSE calculated for the raw average and the baseline predictors is below:
| Data Subset | Raw AVG RMSE | Baseline Pred. RMSE | % Increase/Decrease |
|---|---|---|---|
| Training | 20.9919 | 8.370777 | 60.1 % improvement |
| Test | 13.19222 | 13.39945 | -1.57 % decrease |
Examining the two methods, accounting for the bias of the review sources and artist resulted in a much lower RMSE, and fairly accurate predicted scores for each music album. Using the raw average of scores yielded some acceptable results, but similar to regression analysis, sometimes considering more parameters will give a better prediction than a single predictor.
There was not a huge difference in the RMSE for the raw average of scores, or the baseline predictors (in fact, there was a very small increase in the RMSE). The reason for this may have been the relatively small size of the user/item matrix, which would be affected by a single heavily skewed score, or 1 or 2 missing values for review source or album.
It was interesting to see the bias introduced for different review sources. In the past, I have noticed some reviews from niche sources (blogs or smaller publications) seem a little more pretentious, whereas reviews from publications with a wider appeal (Entertainment Weekly, USA Today) seem to be a little more friendly with their reviews. Using some fairly simple math and logic yielded acceptable results for a recommender system, however, using some more complex linear algebra and/or algorithms would hopefully give more accurate predictions.