University of Minnesota’s Introduction to Recommender Systems, Assignment 1.

Implementation in R:

suppressWarnings(suppressMessages(require(operators)))
suppressWarnings(suppressMessages(require(plyr)))
movies <- read.csv("movies.csv",header=TRUE)
ratings <- read.csv("ratings.csv",header=TRUE)

Calculate, a movie’s mean rating and return the top 10 movies by mean rating, with their mean, id number, and title.

meanRatings <- aggregate(rating ~ movieId, ratings, mean)
joinedRatings <- join(meanRatings, movies, by = "movieId", type = "left")
reducedRatings <- joinedRatings[,3:1]
sortedRatings <- arrange(reducedRatings,-rating)
head(sortedRatings,10)
##                                                             title rating
## 1                      Nobody Loves Me (Keiner liebt mich) (1994)      5
## 2                                        Margaret's Museum (1995)      5
## 3                                           Jupiter's Wife (1994)      5
## 4                                           Pie in the Sky (1996)      5
## 5                                          Neon Bible, The (1995)      5
## 6                           Before the Rain (Pred dozhdot) (1994)      5
## 7                                                  Panther (1995)      5
## 8  Red Firecracker, Green Firecracker (Pao Da Shuang Deng) (1994)      5
## 9                                         To Live (Huozhe) (1994)      5
## 10                                           Jason's Lyric (1994)      5
##    movieId
## 1      106
## 2      114
## 3      128
## 4      129
## 5      138
## 6      214
## 7      297
## 8      309
## 9      326
## 10     391

Calculate a movie’s number of ratings and return the 10 movies with the most ratings, with their number of ratings, id number, and title.

numRatingsTable <- table(ratings$movieId)
numRatingsDF <- as.data.frame(numRatingsTable)
colnames(numRatingsDF) <- c("movieId","numRatings")
numRatingsSorted <- arrange(numRatingsDF,-numRatings)
numRatingsSorted <- join(numRatingsSorted, movies, by="movieId", type="left")
numRatingsSorted <- numRatingsSorted[,-4] # shed genre column
head(numRatingsSorted,10)
##    movieId numRatings                                     title
## 1      296        333                       Pulp Fiction (1994)
## 2      356        321                       Forrest Gump (1994)
## 3      318        310          Shawshank Redemption, The (1994)
## 4      480        307                      Jurassic Park (1993)
## 5      593        299          Silence of the Lambs, The (1991)
## 6      260        265 Star Wars: Episode IV - A New Hope (1977)
## 7        1        263                          Toy Story (1995)
## 8     2571        247                        Matrix, The (1999)
## 9      110        244                         Braveheart (1995)
## 10     589        244         Terminator 2: Judgment Day (1991)

Calculate a movie’s damped mean, with a damping term of 5. Return the top 10 movies by damped mean rating, with their damped mean, id number, and title.

\[\frac{\sum_{u}r_{ui}+k\mu}{n+k}\] \[sum\:over\:users'\:ratings\:for\:an\:item\] \[k\:= damping\:term\] \[n\:= number of\:ratings\] \[\mu\:=\:global\:mean\]

globalMean <- mean(ratings$rating)
ratingSums <- aggregate(rating ~ movieId, ratings, sum)
ratingsByMovie <- join(ratingSums, numRatingsDF, by="movieId",type="left")
ratingsByMovie$dampedMean <- (ratingsByMovie$rating + 5*globalMean)/(ratingsByMovie$numRatings + 5)
ratingsByMovie <- arrange(ratingsByMovie,-dampedMean)
ratingsByMovie <- ratingsByMovie[,-2:-3] # drop columns
ratingsByMovie <- join(ratingsByMovie, movies, by="movieId", type="left") # add titles
ratingsByMovie <- ratingsByMovie[,-4] # drop genre column after join
head(ratingsByMovie,10)
##    movieId dampedMean
## 1      356   191.2116
## 2      260   185.9135
## 3      593   184.1401
## 4     2571   177.4968
## 5     2858   153.2468
## 6       50   147.0687
## 7     2959   142.1635
## 8     7153   127.3302
## 9      364   127.1635
## 10    5952   126.4968
##                                                    title
## 1                                    Forrest Gump (1994)
## 2              Star Wars: Episode IV - A New Hope (1977)
## 3                       Silence of the Lambs, The (1991)
## 4                                     Matrix, The (1999)
## 5                                 American Beauty (1999)
## 6                             Usual Suspects, The (1995)
## 7                                      Fight Club (1999)
## 8  Lord of the Rings: The Return of the King, The (2003)
## 9                                  Lion King, The (1994)
## 10         Lord of the Rings: The Two Towers, The (2002)

Calculate the similarity of one movie to another based on how likely the user is to rate one given that they rated the other (ignoring the rating values), using the simple \(\frac{x \wedge y}{x}\) method. Return the 10 movies most similar to movie 1389 (Jaws 3-D) using the simple metric, with their similarity scores, id number, and title.

subRatings <- ratings[,1:2]
allUsers <- unique(subRatings$userId)
movie1389ratings <- subset(ratings,movieId==1389)
usersWhoRated1389 <- unique(movie1389ratings$userId)
subRatings$rated1389 <- 0
subRatings$didntRate1389 <- 0

for (i in 1:nrow(subRatings)){
  if (subRatings[i,1] %in% usersWhoRated1389) 
  {
    subRatings[i,3] <- 1
    subRatings[i,4] <- 0
  }
    else
  {
    subRatings[i,3] <- 0
    subRatings[i,4] <- 1
  }
}

head(subRatings)
##   userId movieId rated1389 didntRate1389
## 1      1       1         0             1
## 2      1       2         0             1
## 3      1      10         0             1
## 4      1      32         0             1
## 5      1      34         0             1
## 6      1      47         0             1
sumXY <- aggregate(cbind(rated1389,didntRate1389) ~ movieId, subRatings, sum)
sumXY$similarity1389 <- 0
sumXY$similarity1389 <- sumXY$rated1389/length(usersWhoRated1389)
sumXY <- arrange(sumXY,-rated1389)
sumXY <- join(sumXY, movies, by="movieId", type="left") 
sumXY <- sumXY[,c(4,5,1)] # drop unwanted columns
sumXY <- arrange(sumXY,-similarity1389)
head(sumXY,11)
##    similarity1389                                                 title
## 1       1.0000000                                       Jaws 3-D (1983)
## 2       0.8666667                                      Toy Story (1995)
## 3       0.8666667                      Shawshank Redemption, The (1994)
## 4       0.8666667                                  Jurassic Park (1993)
## 5       0.8666667                     Terminator 2: Judgment Day (1991)
## 6       0.8666667                      Silence of the Lambs, The (1991)
## 7       0.8666667                  Independence Day (a.k.a. ID4) (1996)
## 8       0.8666667 Star Wars: Episode V - The Empire Strikes Back (1980)
## 9       0.8666667     Star Wars: Episode VI - Return of the Jedi (1983)
## 10      0.8000000                                     Braveheart (1995)
## 11      0.8000000             Star Wars: Episode IV - A New Hope (1977)
##    movieId
## 1     1389
## 2        1
## 3      318
## 4      480
## 5      589
## 6      593
## 7      780
## 8     1196
## 9     1210
## 10     110
## 11     260

Here we see that the most similar movie is Jaws 3D itself - a good indication that our formula worked.

Now we calculate the similarity of one movie to another, using the advanced \(\frac{\frac{x \wedge y}{1+x}}{1+\frac{!x \wedge y}{1+!x}}\) method. We will return the 10 movies most similar to movie 1389 (Jaws 3-D) using the advanced metric, with their similarity scores, id number, and title.

First we need to recalculate sumXY with 1 in the denominator.

sumXY <- aggregate(cbind(rated1389,didntRate1389) ~ movieId, subRatings, sum)
sumXY$similarity1389 <- 0
sumXY$dissimilarity1389 <- 0
sumXY$index <- 0
sumXY$similarity1389 <- sumXY$rated1389/(1+length(usersWhoRated1389))
sumXY$dissimilarity1389 <- (1+sumXY$didntRate1389)/(1+length(allUsers) - length(usersWhoRated1389))
sumXY$index <- sumXY$similarity1389/(1+sumXY$dissimilarity1389)

sumXY <- join(sumXY, movies, by="movieId", type="left") 
sumXY <- sumXY[,c(6,7,1)] # drop unwanted columns
sumXY <- arrange(sumXY,-index)
head(sumXY,11)
##        index                                                 title movieId
## 1  0.9361702                                       Jaws 3-D (1983)    1389
## 2  0.6759777                                         Jaws 2 (1978)    1388
## 3  0.6641509                              Full Metal Jacket (1987)    1222
## 4  0.6583541                                           Jaws (1975)    1387
## 5  0.6583541                    Back to the Future Part III (1990)    2012
## 6  0.6575342                                    Stand by Me (1986)    1259
## 7  0.6410596                                     Unforgiven (1992)    1266
## 8  0.6300716                       Clear and Present Danger (1994)     349
## 9  0.6293206                                       Die Hard (1988)    1036
## 10 0.6293206                                Terminator, The (1984)    1240
## 11 0.6285714 Star Wars: Episode V - The Empire Strikes Back (1980)    1196