Recommendation systems (sometimes called recommender systems) are a collection of algorithms used to recommend items to users based on information taken from the user. These systems have become ubiquitous and can commonly be seen in online stores, movies databases and job finders. In this notebook, we will explore recommendation systems based on Collaborative Filtering and implement a simple version of one using R
Acquiring the data
# rating dataset
download.file("https://ibm.box.com/shared/static/q61myoukbyz969b97ddlcq0dny0l07bf.dat", "ratings.dat")
#Movie dataset
download.file("https://ibm.box.com/shared/static/dn84btkn9gmxmdau32c5xb0vamie6jy4.dat", "movies.dat")
Preprocessing
Let’s begin by loading the data into their dataframes:
#Loading the movie information into a dataframe
movies_df <- read.csv('movies.dat', header = FALSE, sep=":")
# Head is a function that gets the first 6 rows of a dataframe
head(movies_df)
## V1 V2 V3 V4 V5
## 1 1 Toy Story (1995) Animation|Children's|Comedy
## 2 2 Jumanji (1995) Adventure|Children's|Fantasy
## 3 3 Grumpier Old Men (1995) Comedy|Romance
## 4 4 Waiting to Exhale (1995) Comedy|Drama
## 5 5 Father of the Bride Part II (1995) Comedy
## 6 6 Heat (1995) Action|Crime|Thriller
#Loading the user information into a dataframe
ratings_df <- read.csv('ratings.dat', header = FALSE, sep=":")
# Alternatively let's look at the first 20 rows of the datatframe
head(ratings_df, 20)
## V1 V2 V3 V4 V5 V6 V7
## 1 1 NA 1193 NA 5 NA 978300760
## 2 1 NA 661 NA 3 NA 978302109
## 3 1 NA 914 NA 3 NA 978301968
## 4 1 NA 3408 NA 4 NA 978300275
## 5 1 NA 2355 NA 5 NA 978824291
## 6 1 NA 1197 NA 3 NA 978302268
## 7 1 NA 1287 NA 5 NA 978302039
## 8 1 NA 2804 NA 5 NA 978300719
## 9 1 NA 594 NA 4 NA 978302268
## 10 1 NA 919 NA 4 NA 978301368
## 11 1 NA 595 NA 5 NA 978824268
## 12 1 NA 938 NA 4 NA 978301752
## 13 1 NA 2398 NA 4 NA 978302281
## 14 1 NA 2918 NA 4 NA 978302124
## 15 1 NA 1035 NA 5 NA 978301753
## 16 1 NA 2791 NA 4 NA 978302188
## 17 1 NA 2687 NA 3 NA 978824268
## 18 1 NA 2018 NA 4 NA 978301777
## 19 1 NA 3105 NA 5 NA 978301713
## 20 1 NA 2797 NA 4 NA 978302039
You can see here that there are some issues that arise when reading the data. Movies that have a colon in the title are causing additional columns to be generated, such as column 4 which contains the part of a movie’s title that appears after the colon for movies with a colon in the title. We will now run some code to deal with some of these issues.
Let’s have a look at the raw data to see what may be causing the problem.
We will do this by using the function readLines to store the raw data and using the head function to preview it.
# Here we read the movies data again in the raw format and display the first few rows
lines <- readLines("movies.dat")
head(lines, 20)
## [1] "1::Toy Story (1995)::Animation|Children's|Comedy"
## [2] "2::Jumanji (1995)::Adventure|Children's|Fantasy"
## [3] "3::Grumpier Old Men (1995)::Comedy|Romance"
## [4] "4::Waiting to Exhale (1995)::Comedy|Drama"
## [5] "5::Father of the Bride Part II (1995)::Comedy"
## [6] "6::Heat (1995)::Action|Crime|Thriller"
## [7] "7::Sabrina (1995)::Comedy|Romance"
## [8] "8::Tom and Huck (1995)::Adventure|Children's"
## [9] "9::Sudden Death (1995)::Action"
## [10] "10::GoldenEye (1995)::Action|Adventure|Thriller"
## [11] "11::American President, The (1995)::Comedy|Drama|Romance"
## [12] "12::Dracula: Dead and Loving It (1995)::Comedy|Horror"
## [13] "13::Balto (1995)::Animation|Children's"
## [14] "14::Nixon (1995)::Drama"
## [15] "15::Cutthroat Island (1995)::Action|Adventure|Romance"
## [16] "16::Casino (1995)::Drama|Thriller"
## [17] "17::Sense and Sensibility (1995)::Drama|Romance"
## [18] "18::Four Rooms (1995)::Thriller"
## [19] "19::Ace Ventura: When Nature Calls (1995)::Comedy"
## [20] "20::Money Train (1995)::Action"
**It would appear that for each line of the data, the information that would go into each column is separated by a double colon (::) as opposed to the single colon (:) we used for our sep value in our read.csv function call. However, the read.csv function only allows us to use single characters for our field separator character (sep) value.
We can use the function gsub to replace the double colons (::) in our data with the symbol tilde (~).
# Here we replace the sep character used in the data ("::") with one that does not appear in the data ("~")
lines <- gsub("::", "~", lines)
head(lines, 20)
## [1] "1~Toy Story (1995)~Animation|Children's|Comedy"
## [2] "2~Jumanji (1995)~Adventure|Children's|Fantasy"
## [3] "3~Grumpier Old Men (1995)~Comedy|Romance"
## [4] "4~Waiting to Exhale (1995)~Comedy|Drama"
## [5] "5~Father of the Bride Part II (1995)~Comedy"
## [6] "6~Heat (1995)~Action|Crime|Thriller"
## [7] "7~Sabrina (1995)~Comedy|Romance"
## [8] "8~Tom and Huck (1995)~Adventure|Children's"
## [9] "9~Sudden Death (1995)~Action"
## [10] "10~GoldenEye (1995)~Action|Adventure|Thriller"
## [11] "11~American President, The (1995)~Comedy|Drama|Romance"
## [12] "12~Dracula: Dead and Loving It (1995)~Comedy|Horror"
## [13] "13~Balto (1995)~Animation|Children's"
## [14] "14~Nixon (1995)~Drama"
## [15] "15~Cutthroat Island (1995)~Action|Adventure|Romance"
## [16] "16~Casino (1995)~Drama|Thriller"
## [17] "17~Sense and Sensibility (1995)~Drama|Romance"
## [18] "18~Four Rooms (1995)~Thriller"
## [19] "19~Ace Ventura: When Nature Calls (1995)~Comedy"
## [20] "20~Money Train (1995)~Action"
# Now we recreate the movies dataframe using the updated data
movies_df <- read.csv(text=lines, sep="~", header = FALSE)
head(movies_df, 20)
## V1 V2 V3
## 1 1 Toy Story (1995) Animation|Children's|Comedy
## 2 2 Jumanji (1995) Adventure|Children's|Fantasy
## 3 3 Grumpier Old Men (1995) Comedy|Romance
## 4 4 Waiting to Exhale (1995) Comedy|Drama
## 5 5 Father of the Bride Part II (1995) Comedy
## 6 6 Heat (1995) Action|Crime|Thriller
## 7 7 Sabrina (1995) Comedy|Romance
## 8 8 Tom and Huck (1995) Adventure|Children's
## 9 9 Sudden Death (1995) Action
## 10 10 GoldenEye (1995) Action|Adventure|Thriller
## 11 11 American President, The (1995) Comedy|Drama|Romance
## 12 12 Dracula: Dead and Loving It (1995) Comedy|Horror
## 13 13 Balto (1995) Animation|Children's
## 14 14 Nixon (1995) Drama
## 15 15 Cutthroat Island (1995) Action|Adventure|Romance
## 16 16 Casino (1995) Drama|Thriller
## 17 17 Sense and Sensibility (1995) Drama|Romance
## 18 18 Four Rooms (1995) Thriller
## 19 19 Ace Ventura: When Nature Calls (1995) Comedy
## 20 20 Money Train (1995) Action
So each movie has a unique ID, a title with its release year along with it and several different genres in the same field. Name the columns and then remove the year from the title column using R’s handy “sub” function and then clean any trailing whitespaces.
names(movies_df)[names(movies_df)=="V1"] = "movieId"
names(movies_df)[names(movies_df)=="V2"] = "title"
names(movies_df)[names(movies_df)=="V3"] = "genres"
#Applying the strip function to get rid of any ending whitespace characters that may have appeared
movies_df$title = sub("\\s+$", "", movies_df$title)
head(movies_df, 20)
## movieId title genres
## 1 1 Toy Story (1995) Animation|Children's|Comedy
## 2 2 Jumanji (1995) Adventure|Children's|Fantasy
## 3 3 Grumpier Old Men (1995) Comedy|Romance
## 4 4 Waiting to Exhale (1995) Comedy|Drama
## 5 5 Father of the Bride Part II (1995) Comedy
## 6 6 Heat (1995) Action|Crime|Thriller
## 7 7 Sabrina (1995) Comedy|Romance
## 8 8 Tom and Huck (1995) Adventure|Children's
## 9 9 Sudden Death (1995) Action
## 10 10 GoldenEye (1995) Action|Adventure|Thriller
## 11 11 American President, The (1995) Comedy|Drama|Romance
## 12 12 Dracula: Dead and Loving It (1995) Comedy|Horror
## 13 13 Balto (1995) Animation|Children's
## 14 14 Nixon (1995) Drama
## 15 15 Cutthroat Island (1995) Action|Adventure|Romance
## 16 16 Casino (1995) Drama|Thriller
## 17 17 Sense and Sensibility (1995) Drama|Romance
## 18 18 Four Rooms (1995) Thriller
## 19 19 Ace Ventura: When Nature Calls (1995) Comedy
## 20 20 Money Train (1995) Action
With that, let’s also drop the genres column since we won’t need it for this particular recommendation system.
#Dropping the genres column
movies_df$genres = NULL
Here’s the final movies dataframe:
head(movies_df,20)
## movieId title
## 1 1 Toy Story (1995)
## 2 2 Jumanji (1995)
## 3 3 Grumpier Old Men (1995)
## 4 4 Waiting to Exhale (1995)
## 5 5 Father of the Bride Part II (1995)
## 6 6 Heat (1995)
## 7 7 Sabrina (1995)
## 8 8 Tom and Huck (1995)
## 9 9 Sudden Death (1995)
## 10 10 GoldenEye (1995)
## 11 11 American President, The (1995)
## 12 12 Dracula: Dead and Loving It (1995)
## 13 13 Balto (1995)
## 14 14 Nixon (1995)
## 15 15 Cutthroat Island (1995)
## 16 16 Casino (1995)
## 17 17 Sense and Sensibility (1995)
## 18 18 Four Rooms (1995)
## 19 19 Ace Ventura: When Nature Calls (1995)
## 20 20 Money Train (1995)
head(ratings_df)
## V1 V2 V3 V4 V5 V6 V7
## 1 1 NA 1193 NA 5 NA 978300760
## 2 1 NA 661 NA 3 NA 978302109
## 3 1 NA 914 NA 3 NA 978301968
## 4 1 NA 3408 NA 4 NA 978300275
## 5 1 NA 2355 NA 5 NA 978824291
## 6 1 NA 1197 NA 3 NA 978302268
Every row in the ratings dataframe has a user id associated with at least one movie, a rating and a timestamp showing when they reviewed it. Let’s name the columns accordingly and drop the timestamp column since we won’t be using it for this type of recommendation.
#Removing the Empty Column Ex: V2, V4, V6 using subset function.
#These columns were generated because the data is separated by "::" while the read.csv function only accepts single characters
#for the sep value such as ":" or "~", thus the read function assumed that our data was separated by single colons (":").
ratings_df <- subset( ratings_df, select = -c(V2, V4, V6 ))
head(ratings_df)
## V1 V3 V5 V7
## 1 1 1193 5 978300760
## 2 1 661 3 978302109
## 3 1 914 3 978301968
## 4 1 3408 4 978300275
## 5 1 2355 5 978824291
## 6 1 1197 3 978302268
Lets name the columns in rating_df as follows:
V1 as userId V3 as movieId V5 as rating V7 as timestamp Remove Column timestamp
names(ratings_df)[names(ratings_df)=="V1"] = "userId"
names(ratings_df)[names(ratings_df)=="V3"] = "movieId"
names(ratings_df)[names(ratings_df)=="V5"] = "rating"
names(ratings_df)[names(ratings_df)=="V7"] = "timestamp"
ratings_df$timestamp = NULL
# Here's how the final ratings Dataframe looks like:
head(ratings_df)
## userId movieId rating
## 1 1 1193 5
## 2 1 661 3
## 3 1 914 3
## 4 1 3408 4
## 5 1 2355 5
## 6 1 1197 3
Collaborative Filtering
Now, time to start our work on recommendation systems.
The first technique we’re going to take a look at is called Collaborative Filtering, which is also known as User-User Filtering. As hinted by its alternate name, this technique uses other users data to recommend items to the input user. It attempts to find users that have similar preferences and opinions as the input and then recommends items that they have liked to the input. There are several methods of finding similar users (Even some making use of Machine Learning), and the one we will be using here is going to be based on the Pearson Correlation Function.
The process for creating a User Based recommendation system is as follows:
Select a user with the movies the user has watched Based on his rating to movies, find the top X neighbours Get the watched movie record of the user for each neighbour. Calculate a similarity score using some formula Recommend the items with the highest score Let’s begin by creating an input user to recommend movies to:
Notice: To add more movies, simply increase the amount of elements in the userInput. Feel free to add more in! Just be sure to write it in with capital letters and if a movie starts with a “The”, like “The Matrix” then write it in like this: ‘Matrix, The’ .
inputUser = data.frame("title"=c("Breakfast Club, The (1985)", "Toy Story (1995)", "Jumanji (1995)", "Pulp Fiction (1994)", "Akira (1988)"),
"rating"=c(5, 3.5, 2, 5, 4.5))
head(inputUser)
## title rating
## 1 Breakfast Club, The (1985) 5.0
## 2 Toy Story (1995) 3.5
## 3 Jumanji (1995) 2.0
## 4 Pulp Fiction (1994) 5.0
## 5 Akira (1988) 4.5
Adding movieIds to the input user
With the input complete, let’s extract the input movies’s ID’s from the movies dataframe and add them into it.
We can achieve this by first filtering out the rows that contain the input movies’ titles and getting their IDs.
inputUser$movieId = rep(NA, length(inputUser$title))
for (i in 1:length(inputUser$title)){
inputUser$movieId[i] = as.character(movies_df$movieId[movies_df$title == inputUser$title[i]])
}
head(inputUser)
## title rating movieId
## 1 Breakfast Club, The (1985) 5.0 1968
## 2 Toy Story (1995) 3.5 1
## 3 Jumanji (1995) 2.0 2
## 4 Pulp Fiction (1994) 5.0 296
## 5 Akira (1988) 4.5 1274
The users who have seen the same movies
Now with the movie IDs in our input, we can now get the subset of users who have watched and reviewed the movies that our input user has seen.
#Filtering out users that have watched movies that the input has watched and storing it
userSubset = ratings_df[ratings_df$movieId %in% inputUser$movieId,]
head(userSubset)
## userId movieId rating
## 41 1 1 5
## 175 2 1968 2
## 231 3 1968 4
## 286 5 296 4
## 464 6 296 2
## 470 6 1 4
With every user extracted, let’s sort them by the amount of movies that they have in common with the input and get the first 100 of them.
top100 <- head(sort(table(factor(userSubset$userId)), decreasing = TRUE), 100)
head(top100)
##
## 18 48 272 284 424 549
## 5 5 5 5 5 5
Now let’s extract the userIDs from the table and transform it into a table to make it easier to subset the data later on.
userList <- as.data.frame.table(top100)
colnames(userList) <- c("userId","commonMovies")
head(userList)
## userId commonMovies
## 1 18 5
## 2 48 5
## 3 272 5
## 4 284 5
## 5 424 5
## 6 549 5
Now let’s get the movies watched by these 100 users from the ratings dataframe and then create the UserSubset data frame (using merge function to combine the columns)
userSubset = ratings_df[ratings_df$userId %in% userList$userId,]
temp = as.data.frame(table(userSubset$movieId))
names(temp)[names(temp)=="Var1"] = "movieId"
userSubset = merge(temp, userSubset)
This is what our final userSubset dataframe looks like:
head(userSubset)
## movieId Freq userId rating
## 1 1 99 4448 4
## 2 1 99 5046 5
## 3 1 99 284 5
## 4 1 99 1680 4
## 5 1 99 424 4
## 6 1 99 1733 4
Let’s look at one of the users, e.g. the one with userID 533.
head(userSubset[userSubset$userId == 533,])
## movieId Freq userId rating
## 7 1 99 533 4
## 146 10 79 533 5
## 314 101 22 533 5
## 537 1019 38 533 3
## 572 1020 36 533 2
## 946 1031 33 533 4
Now let’s filter out the movies with less than 10 occurrences in our dataframe:
userSubset = userSubset[userSubset$Freq > 10,]
head(userSubset)
## movieId Freq userId rating
## 1 1 99 4448 4
## 2 1 99 5046 5
## 3 1 99 284 5
## 4 1 99 1680 4
## 5 1 99 424 4
## 6 1 99 1733 4
Similarity of users to input user
Next, we are going to compare the top users to our specified user and find the one that is most similar. we’re going to find out how similar each user is to the input user through the Pearson Correlation Coefficient. It is used to measure the strength of the linear association between two variables.
pearson_df = data.frame("userId"=integer(), "similarityIndex"=double())
for (user in userList$userId)
{
userRating = userSubset[userSubset$userId == user,]
moviesInCommonX = userRating[userRating$movieId %in% inputUser$movieId,]
moviesInCommonX = moviesInCommonX[complete.cases(moviesInCommonX),]
moviesInCommonY = inputUser[inputUser$movieId %in% userRating$movieId,]
moviesInCommonY = moviesInCommonY[complete.cases(moviesInCommonY),]
#Now let's calculate the pearson correlation between two users, so called, x and y
Sxx = sum(moviesInCommonX$rating^2) - (sum(moviesInCommonX$rating)^2)/nrow(moviesInCommonX)
Syy = sum(moviesInCommonY$rating^2) - (sum(moviesInCommonY$rating)^2)/nrow(moviesInCommonY)
Sxy = sum(moviesInCommonX$rating*moviesInCommonY$rating) - (sum(moviesInCommonX$rating)*sum(moviesInCommonY$rating))/nrow(moviesInCommonX)
if(Sxx == 0 | Syy == 0 | Sxy == 0)
{
pearsonCorrelation = 0
}
else
{
pearsonCorrelation = Sxy/sqrt(Sxx*Syy)
}
pearson_df = rbind(pearson_df, data.frame("userId"=user, "similarityIndex"=pearsonCorrelation))
}
Here’s a look at the similarity scores:
head(pearson_df)
## userId similarityIndex
## 1 18 -0.6016568
## 2 48 -0.4385290
## 3 272 0.0000000
## 4 284 0.3580574
## 5 424 0.2192645
## 6 549 0.5547002
The top x similar users to input user
Now let’s get the top 50 users that are most similar to the input.
Now, let’s start recommending movies to the input user.
Rating of selected users to all movies
We’re going to do this by taking the weighted average of the ratings of the movies using the Pearson Correlation as the weight. But to do this, we first need to get the movies watched by the users in our pearsonDF from the ratings dataframe and then store their correlation in a new column called _similarityIndex". This is achieved below by merging of these two tables.
topUsersRating = merge(userSubset, pearson_df)
head(topUsersRating, 15)
## userId movieId Freq rating similarityIndex
## 1 18 1408 60 5 -0.6016568
## 2 18 260 99 5 -0.6016568
## 3 18 2089 31 4 -0.6016568
## 4 18 262 13 5 -0.6016568
## 5 18 3489 71 4 -0.6016568
## 6 18 2384 38 2 -0.6016568
## 7 18 3668 23 3 -0.6016568
## 8 18 2807 15 1 -0.6016568
## 9 18 1873 16 5 -0.6016568
## 10 18 2142 26 3 -0.6016568
## 11 18 1580 94 5 -0.6016568
## 12 18 2718 30 2 -0.6016568
## 13 18 1411 25 5 -0.6016568
## 14 18 2394 43 4 -0.6016568
## 15 18 3255 62 4 -0.6016568
Now all we need to do is simply multiply the movie rating by its weight (The similarity index), then sum up the new ratings and divide it by the sum of the weights.
We can easily do this by simply multiplying two columns, then taking the mean of the aggregate of the movieId column:
#Multiplies the similarity by the user's ratings
topUsersRating$weightedRating = topUsersRating$similarityIndex*topUsersRating$rating
weightedAverage_df = aggregate(topUsersRating$weightedRating, list(topUsersRating$movieId), mean)
head(weightedAverage_df)
## Group.1 x
## 1 1 0.2364408
## 2 2 0.1369868
## 3 3 0.2994215
## 4 5 0.5463121
## 5 6 0.2496468
## 6 7 0.3516583
names(weightedAverage_df)[names(weightedAverage_df)=="Group.1"] = "movieId"
names(weightedAverage_df)[names(weightedAverage_df)=="x"] = "weightedAverage"
head(weightedAverage_df)
## movieId weightedAverage
## 1 1 0.2364408
## 2 2 0.1369868
## 3 3 0.2994215
## 4 5 0.5463121
## 5 6 0.2496468
## 6 7 0.3516583
Now we merge the averages with the movies dataframe so we can get their titles.
recommendation_df = merge(weightedAverage_df, movies_df)
And then we finally sort it to see the top 20 movies that the algorithm recommended!
head(recommendation_df[order(-recommendation_df$weightedAverage),], 20)
## movieId weightedAverage title
## 291 1480 1.4659647 Smilla's Sense of Snow (1997)
## 672 2202 1.3000818 Lifeboat (1944)
## 952 2729 1.2735141 Lolita (1962)
## 664 2176 1.2283686 Rope (1948)
## 1288 3469 1.1484300 Inherit the Wind (1960)
## 217 1340 1.0922209 Bride of Frankenstein (1935)
## 1404 371 1.0908185 Paper, The (1994)
## 266 1414 1.0753449 Mother (1996)
## 121 1211 1.0648503 Wings of Desire (Der Himmel über Berlin) (1987)
## 1749 965 1.0551635 39 Steps, The (1935)
## 977 2789 0.9969701 Damien: Omen II (1978)
## 3 100 0.9953413 City Hall (1996)
## 1560 513 0.9823467 Radioland Murders (1994)
## 987 280 0.9466496 Murder in the First (1995)
## 710 230 0.9223159 Dolores Claiborne (1994)
## 1145 3142 0.9129529 U2: Rattle and Hum (1988)
## 1748 955 0.9052864 Bringing Up Baby (1938)
## 666 2186 0.8992181 Strangers on a Train (1951)
## 1479 3916 0.8950870 Remember the Titans (2000)
## 332 1594 0.8928542 In the Company of Men (1997)
Conclusion
Advantages and Disadvantages of collaborative filtering:
Advantages
Takes other user’s ratings into consideration Doesn’t need to study or extract information from the recommended item Adapts to the user’s interests which might change over time
Disadvantages
Approximation function can be slow There might be a low of amount of users to approximate Privacy issues when trying to learn the user’s preferences