Let’s load the dataset (from Introduction to Recommender Systems, by Coursera). Let’s load the most typical bunch of packages as well (we won’t use most of them, though):
setwd("C:/Users/ftorrent/Desktop/MovieLens Recommender System/Recommender by fatty")
library(knitr)
library(class)
recommender <- read.csv("C:/Users/ftorrent/Desktop/MovieLens Recommender System/Recommender by fatty/recommender1.csv")
We also should put the movies as row.names and rename the users from X1650 to user1650:
row.names(recommender) <- recommender[,1]
recommender <- recommender[,-1]
names(recommender) <- gsub("X", "user", names(recommender))
Now we have the data fairly OK, with the movies in the rows and the users as columns.
Now, as we are applying Knn algorithm, we need the ratings to be unary (viewed film =1, non-viewed film=0). So, by now we don’t focus on the rating, but on if a user has seen the movie or not (for simplicity, we are assuming that if he/she has seen the movie, then he/she liked it).
First, we set all NA values to 0:
recommender[is.na(recommender)]<-0
Then, we need to set all other values to 1:
recommender[recommender >0] <- 1
So we now have the data frame we wanted: Unary and with items as rows and users as columns. Let’s apply the Knn algorithm:
To begin with, we compute the distance matrix between each of the movies (how similar/ different they are based on when they’re seen together):
distances <- as.matrix(dist(recommender, method="euclidean"))
Now, the funcion k.nearest.neighbors, which computes the nearest neighbors:
k.nearest.neighbors <- function(i, recommender, k = 5)
{
ordered.neighbors <- order(recommender[i, ])
# This just gives us the list of points that are
# closest to row i in descending order.
# The first entry is always 0 (the closest point is the point itself) so
# let's ignore that entry and return points 2:(k+1) instead of 1:k
return(ordered.neighbors[2:(k + 1)])
}
Now, here’s a function to compute the probability that a movie has been seen by a given user:
seen.probability <- function(user, movie, recommender, distances, k = 25)
{
# Find the kNN closest movies to this movie in question.
neighbors <- k.nearest.neighbors(which(row.names(recommender) == movie), distances, k)
# Now that we know the other movies that have been seen with this one, return
# the mean of how many of those movies this user could see.
return(mean(recommender[neighbors, user]))
}
For example:
seen.probability("user1648", "13: Forrest Gump (1994)", recommender, distances)
## [1] 0.52
This tells us that user 1648 has a probability of having seen Forrest Gump of 52%.
So if we can predict the probability that a movie has been seen by a user, then we can run this function for all movies that the user has not yet seen, and return the ones with the highest probability of being seen (yet haven’t yet been seen). We do this with this function, which takes in the user in question, the recommender data.frame, the distances between movies, and the number of k neighbors you want to use. The function will return an ordered list of the movies recommended to that user:
most.probable.movies <- function(user, recommender, distances, k = 25)
{
probabilities <- rep(0, nrow(recommender))
for (i in 1:nrow(recommender)) { # For each movie
if (recommender[i,user] == 1) {
next # The user has already seen the movie.
}
probabilities[i] <- seen.probability(user, row.names(recommender)[i], recommender, distances, k)
}
return(order(probabilities, decreasing=T))
}
So, for instance, let’s recommend to user 860 the movies he will most likely enjoy seeing:
user <- "user860"
listing <- most.probable.movies(user, recommender, distances)
rownames(recommender)[listing[1:15]]
## [1] "453: A Beautiful Mind (2001)"
## [2] "278: The Shawshank Redemption (1994)"
## [3] "180: Minority Report (2002)"
## [4] "121: The Lord of the Rings: The Two Towers (2002)"
## [5] "954: Mission: Impossible (1996)"
## [6] "1894: Star Wars: Episode II - Attack of the Clones (2002)"
## [7] "187: Sin City (2005)"
## [8] "558: Spider-Man 2 (2004)"
## [9] "955: Mission: Impossible II (2000)"
## [10] "38: Eternal Sunshine of the Spotless Mind (2004)"
## [11] "114: Pretty Woman (1990)"
## [12] "424: Schindler's List (1993)"
## [13] "36658: X2: X-Men United (2003)"
## [14] "120: The Lord of the Rings: The Fellowship of the Ring (2001)"
## [15] "122: The Lord of the Rings: The Return of the King (2003)"
So we have the list with the most likely movies he/she will enjoy seeing.