University of Minnesota’s Introduction to Recommender Systems, Assignment 1.
Implementation in R:
suppressWarnings(suppressMessages(require(operators)))
suppressWarnings(suppressMessages(require(plyr)))
movies <- read.csv("movies.csv",header=TRUE)
ratings <- read.csv("ratings.csv",header=TRUE)
Calculate, a movie’s mean rating and return the top 10 movies by mean rating, with their mean, id number, and title.
meanRatings <- aggregate(rating ~ movieId, ratings, mean)
joinedRatings <- join(meanRatings, movies, by = "movieId", type = "left")
reducedRatings <- joinedRatings[,3:1]
sortedRatings <- arrange(reducedRatings,-rating)
head(sortedRatings,10)
## title rating
## 1 Nobody Loves Me (Keiner liebt mich) (1994) 5
## 2 Margaret's Museum (1995) 5
## 3 Jupiter's Wife (1994) 5
## 4 Pie in the Sky (1996) 5
## 5 Neon Bible, The (1995) 5
## 6 Before the Rain (Pred dozhdot) (1994) 5
## 7 Panther (1995) 5
## 8 Red Firecracker, Green Firecracker (Pao Da Shuang Deng) (1994) 5
## 9 To Live (Huozhe) (1994) 5
## 10 Jason's Lyric (1994) 5
## movieId
## 1 106
## 2 114
## 3 128
## 4 129
## 5 138
## 6 214
## 7 297
## 8 309
## 9 326
## 10 391
Calculate a movie’s number of ratings and return the 10 movies with the most ratings, with their number of ratings, id number, and title.
numRatingsTable <- table(ratings$movieId)
numRatingsDF <- as.data.frame(numRatingsTable)
colnames(numRatingsDF) <- c("movieId","numRatings")
numRatingsSorted <- arrange(numRatingsDF,-numRatings)
numRatingsSorted <- join(numRatingsSorted, movies, by="movieId", type="left")
numRatingsSorted <- numRatingsSorted[,-4] # shed genre column
head(numRatingsSorted,10)
## movieId numRatings title
## 1 296 333 Pulp Fiction (1994)
## 2 356 321 Forrest Gump (1994)
## 3 318 310 Shawshank Redemption, The (1994)
## 4 480 307 Jurassic Park (1993)
## 5 593 299 Silence of the Lambs, The (1991)
## 6 260 265 Star Wars: Episode IV - A New Hope (1977)
## 7 1 263 Toy Story (1995)
## 8 2571 247 Matrix, The (1999)
## 9 110 244 Braveheart (1995)
## 10 589 244 Terminator 2: Judgment Day (1991)
Calculate a movie’s damped mean, with a damping term of 5. Return the top 10 movies by damped mean rating, with their damped mean, id number, and title.
\[\frac{\sum_{u}r_{ui}+k\mu}{n+k}\] \[sum\:over\:users'\:ratings\:for\:an\:item\] \[k\:= damping\:term\] \[n\:= number of\:ratings\] \[\mu\:=\:global\:mean\]
globalMean <- mean(ratings$rating)
ratingSums <- aggregate(rating ~ movieId, ratings, sum)
ratingsByMovie <- join(ratingSums, numRatingsDF, by="movieId",type="left")
ratingsByMovie$dampedMean <- (ratingsByMovie$rating + 5*globalMean)/(ratingsByMovie$numRatings + 5)
ratingsByMovie <- arrange(ratingsByMovie,-dampedMean)
ratingsByMovie <- ratingsByMovie[,-2:-3] # drop columns
ratingsByMovie <- join(ratingsByMovie, movies, by="movieId", type="left") # add titles
ratingsByMovie <- ratingsByMovie[,-4] # drop genre column after join
head(ratingsByMovie,10)
## movieId dampedMean
## 1 356 191.2116
## 2 260 185.9135
## 3 593 184.1401
## 4 2571 177.4968
## 5 2858 153.2468
## 6 50 147.0687
## 7 2959 142.1635
## 8 7153 127.3302
## 9 364 127.1635
## 10 5952 126.4968
## title
## 1 Forrest Gump (1994)
## 2 Star Wars: Episode IV - A New Hope (1977)
## 3 Silence of the Lambs, The (1991)
## 4 Matrix, The (1999)
## 5 American Beauty (1999)
## 6 Usual Suspects, The (1995)
## 7 Fight Club (1999)
## 8 Lord of the Rings: The Return of the King, The (2003)
## 9 Lion King, The (1994)
## 10 Lord of the Rings: The Two Towers, The (2002)
Calculate the similarity of one movie to another based on how likely the user is to rate one given that they rated the other (ignoring the rating values), using the simple \(\frac{x \wedge y}{x}\) method. Return the 10 movies most similar to movie 1389 (Jaws 3-D) using the simple metric, with their similarity scores, id number, and title.
subRatings <- ratings[,1:2]
allUsers <- unique(subRatings$userId)
movie1389ratings <- subset(ratings,movieId==1389)
usersWhoRated1389 <- unique(movie1389ratings$userId)
subRatings$rated1389 <- 0
subRatings$didntRate1389 <- 0
for (i in 1:nrow(subRatings)){
if (subRatings[i,1] %in% usersWhoRated1389)
{
subRatings[i,3] <- 1
subRatings[i,4] <- 0
}
else
{
subRatings[i,3] <- 0
subRatings[i,4] <- 1
}
}
head(subRatings)
## userId movieId rated1389 didntRate1389
## 1 1 1 0 1
## 2 1 2 0 1
## 3 1 10 0 1
## 4 1 32 0 1
## 5 1 34 0 1
## 6 1 47 0 1
sumXY <- aggregate(cbind(rated1389,didntRate1389) ~ movieId, subRatings, sum)
sumXY$similarity1389 <- 0
sumXY$similarity1389 <- sumXY$rated1389/length(usersWhoRated1389)
sumXY <- arrange(sumXY,-rated1389)
sumXY <- join(sumXY, movies, by="movieId", type="left")
sumXY <- sumXY[,c(4,5,1)] # drop unwanted columns
sumXY <- arrange(sumXY,-similarity1389)
head(sumXY,11)
## similarity1389 title
## 1 1.0000000 Jaws 3-D (1983)
## 2 0.8666667 Toy Story (1995)
## 3 0.8666667 Shawshank Redemption, The (1994)
## 4 0.8666667 Jurassic Park (1993)
## 5 0.8666667 Terminator 2: Judgment Day (1991)
## 6 0.8666667 Silence of the Lambs, The (1991)
## 7 0.8666667 Independence Day (a.k.a. ID4) (1996)
## 8 0.8666667 Star Wars: Episode V - The Empire Strikes Back (1980)
## 9 0.8666667 Star Wars: Episode VI - Return of the Jedi (1983)
## 10 0.8000000 Braveheart (1995)
## 11 0.8000000 Star Wars: Episode IV - A New Hope (1977)
## movieId
## 1 1389
## 2 1
## 3 318
## 4 480
## 5 589
## 6 593
## 7 780
## 8 1196
## 9 1210
## 10 110
## 11 260
Here we see that the most similar movie is Jaws 3D itself - a good indication that our formula worked.
Now we calculate the similarity of one movie to another, using the advanced \(\frac{\frac{x \wedge y}{1+x}}{1+\frac{!x \wedge y}{1+!x}}\) method. We will return the 10 movies most similar to movie 1389 (Jaws 3-D) using the advanced metric, with their similarity scores, id number, and title.
First we need to recalculate sumXY with 1 in the denominator.
sumXY <- aggregate(cbind(rated1389,didntRate1389) ~ movieId, subRatings, sum)
sumXY$similarity1389 <- 0
sumXY$dissimilarity1389 <- 0
sumXY$index <- 0
sumXY$similarity1389 <- sumXY$rated1389/(1+length(usersWhoRated1389))
sumXY$dissimilarity1389 <- (1+sumXY$didntRate1389)/(1+length(allUsers) - length(usersWhoRated1389))
sumXY$index <- sumXY$similarity1389/(1+sumXY$dissimilarity1389)
sumXY <- join(sumXY, movies, by="movieId", type="left")
sumXY <- sumXY[,c(6,7,1)] # drop unwanted columns
sumXY <- arrange(sumXY,-index)
head(sumXY,11)
## index title movieId
## 1 0.9361702 Jaws 3-D (1983) 1389
## 2 0.6759777 Jaws 2 (1978) 1388
## 3 0.6641509 Full Metal Jacket (1987) 1222
## 4 0.6583541 Jaws (1975) 1387
## 5 0.6583541 Back to the Future Part III (1990) 2012
## 6 0.6575342 Stand by Me (1986) 1259
## 7 0.6410596 Unforgiven (1992) 1266
## 8 0.6300716 Clear and Present Danger (1994) 349
## 9 0.6293206 Die Hard (1988) 1036
## 10 0.6293206 Terminator, The (1984) 1240
## 11 0.6285714 Star Wars: Episode V - The Empire Strikes Back (1980) 1196