Collaborative Filtering in R

Recommendation systems (sometimes called recommender systems) are a collection of algorithms used to recommend items to users based on information taken from the user. These systems have become ubiquitous and can commonly be seen in online stores, movies databases and job finders. In this notebook, we will explore recommendation systems based on Collaborative Filtering and implement a simple version of one using R

Acquiring the data

# rating dataset
download.file("https://ibm.box.com/shared/static/q61myoukbyz969b97ddlcq0dny0l07bf.dat", "ratings.dat")

#Movie dataset
download.file("https://ibm.box.com/shared/static/dn84btkn9gmxmdau32c5xb0vamie6jy4.dat", "movies.dat")

Preprocessing

Let’s begin by loading the data into their dataframes:

#Loading the movie information into a dataframe
movies_df <- read.csv('movies.dat', header = FALSE, sep=":")

# Head is a function that gets the first 6 rows of a dataframe
head(movies_df)

##   V1 V2                                 V3 V4                           V5
## 1  1                      Toy Story (1995)     Animation|Children's|Comedy
## 2  2                        Jumanji (1995)    Adventure|Children's|Fantasy
## 3  3               Grumpier Old Men (1995)                  Comedy|Romance
## 4  4              Waiting to Exhale (1995)                    Comedy|Drama
## 5  5    Father of the Bride Part II (1995)                          Comedy
## 6  6                           Heat (1995)           Action|Crime|Thriller

#Loading the user information into a dataframe
ratings_df <- read.csv('ratings.dat', header = FALSE, sep=":")

# Alternatively let's look at the first 20 rows of the datatframe
head(ratings_df, 20)

##    V1 V2   V3 V4 V5 V6        V7
## 1   1 NA 1193 NA  5 NA 978300760
## 2   1 NA  661 NA  3 NA 978302109
## 3   1 NA  914 NA  3 NA 978301968
## 4   1 NA 3408 NA  4 NA 978300275
## 5   1 NA 2355 NA  5 NA 978824291
## 6   1 NA 1197 NA  3 NA 978302268
## 7   1 NA 1287 NA  5 NA 978302039
## 8   1 NA 2804 NA  5 NA 978300719
## 9   1 NA  594 NA  4 NA 978302268
## 10  1 NA  919 NA  4 NA 978301368
## 11  1 NA  595 NA  5 NA 978824268
## 12  1 NA  938 NA  4 NA 978301752
## 13  1 NA 2398 NA  4 NA 978302281
## 14  1 NA 2918 NA  4 NA 978302124
## 15  1 NA 1035 NA  5 NA 978301753
## 16  1 NA 2791 NA  4 NA 978302188
## 17  1 NA 2687 NA  3 NA 978824268
## 18  1 NA 2018 NA  4 NA 978301777
## 19  1 NA 3105 NA  5 NA 978301713
## 20  1 NA 2797 NA  4 NA 978302039

You can see here that there are some issues that arise when reading the data. Movies that have a colon in the title are causing additional columns to be generated, such as column 4 which contains the part of a movie’s title that appears after the colon for movies with a colon in the title. We will now run some code to deal with some of these issues.

Let’s have a look at the raw data to see what may be causing the problem.

We will do this by using the function readLines to store the raw data and using the head function to preview it.

# Here we read the movies data again in the raw format and display the first few rows
lines <- readLines("movies.dat")
head(lines, 20)

##  [1] "1::Toy Story (1995)::Animation|Children's|Comedy"        
##  [2] "2::Jumanji (1995)::Adventure|Children's|Fantasy"         
##  [3] "3::Grumpier Old Men (1995)::Comedy|Romance"              
##  [4] "4::Waiting to Exhale (1995)::Comedy|Drama"               
##  [5] "5::Father of the Bride Part II (1995)::Comedy"           
##  [6] "6::Heat (1995)::Action|Crime|Thriller"                   
##  [7] "7::Sabrina (1995)::Comedy|Romance"                       
##  [8] "8::Tom and Huck (1995)::Adventure|Children's"            
##  [9] "9::Sudden Death (1995)::Action"                          
## [10] "10::GoldenEye (1995)::Action|Adventure|Thriller"         
## [11] "11::American President, The (1995)::Comedy|Drama|Romance"
## [12] "12::Dracula: Dead and Loving It (1995)::Comedy|Horror"   
## [13] "13::Balto (1995)::Animation|Children's"                  
## [14] "14::Nixon (1995)::Drama"                                 
## [15] "15::Cutthroat Island (1995)::Action|Adventure|Romance"   
## [16] "16::Casino (1995)::Drama|Thriller"                       
## [17] "17::Sense and Sensibility (1995)::Drama|Romance"         
## [18] "18::Four Rooms (1995)::Thriller"                         
## [19] "19::Ace Ventura: When Nature Calls (1995)::Comedy"       
## [20] "20::Money Train (1995)::Action"

**It would appear that for each line of the data, the information that would go into each column is separated by a double colon (::) as opposed to the single colon (:) we used for our sep value in our read.csv function call. However, the read.csv function only allows us to use single characters for our field separator character (sep) value.

We can use the function gsub to replace the double colons (::) in our data with the symbol tilde (~).

# Here we replace the sep character used in the data ("::") with one that does not appear in the data ("~")
lines <- gsub("::", "~", lines)
head(lines, 20)

##  [1] "1~Toy Story (1995)~Animation|Children's|Comedy"        
##  [2] "2~Jumanji (1995)~Adventure|Children's|Fantasy"         
##  [3] "3~Grumpier Old Men (1995)~Comedy|Romance"              
##  [4] "4~Waiting to Exhale (1995)~Comedy|Drama"               
##  [5] "5~Father of the Bride Part II (1995)~Comedy"           
##  [6] "6~Heat (1995)~Action|Crime|Thriller"                   
##  [7] "7~Sabrina (1995)~Comedy|Romance"                       
##  [8] "8~Tom and Huck (1995)~Adventure|Children's"            
##  [9] "9~Sudden Death (1995)~Action"                          
## [10] "10~GoldenEye (1995)~Action|Adventure|Thriller"         
## [11] "11~American President, The (1995)~Comedy|Drama|Romance"
## [12] "12~Dracula: Dead and Loving It (1995)~Comedy|Horror"   
## [13] "13~Balto (1995)~Animation|Children's"                  
## [14] "14~Nixon (1995)~Drama"                                 
## [15] "15~Cutthroat Island (1995)~Action|Adventure|Romance"   
## [16] "16~Casino (1995)~Drama|Thriller"                       
## [17] "17~Sense and Sensibility (1995)~Drama|Romance"         
## [18] "18~Four Rooms (1995)~Thriller"                         
## [19] "19~Ace Ventura: When Nature Calls (1995)~Comedy"       
## [20] "20~Money Train (1995)~Action"

# Now we recreate the movies dataframe using the updated data
movies_df <- read.csv(text=lines, sep="~", header = FALSE)
head(movies_df, 20)

##    V1                                    V2                           V3
## 1   1                      Toy Story (1995)  Animation|Children's|Comedy
## 2   2                        Jumanji (1995) Adventure|Children's|Fantasy
## 3   3               Grumpier Old Men (1995)               Comedy|Romance
## 4   4              Waiting to Exhale (1995)                 Comedy|Drama
## 5   5    Father of the Bride Part II (1995)                       Comedy
## 6   6                           Heat (1995)        Action|Crime|Thriller
## 7   7                        Sabrina (1995)               Comedy|Romance
## 8   8                   Tom and Huck (1995)         Adventure|Children's
## 9   9                   Sudden Death (1995)                       Action
## 10 10                      GoldenEye (1995)    Action|Adventure|Thriller
## 11 11        American President, The (1995)         Comedy|Drama|Romance
## 12 12    Dracula: Dead and Loving It (1995)                Comedy|Horror
## 13 13                          Balto (1995)         Animation|Children's
## 14 14                          Nixon (1995)                        Drama
## 15 15               Cutthroat Island (1995)     Action|Adventure|Romance
## 16 16                         Casino (1995)               Drama|Thriller
## 17 17          Sense and Sensibility (1995)                Drama|Romance
## 18 18                     Four Rooms (1995)                     Thriller
## 19 19 Ace Ventura: When Nature Calls (1995)                       Comedy
## 20 20                    Money Train (1995)                       Action

So each movie has a unique ID, a title with its release year along with it and several different genres in the same field. Name the columns and then remove the year from the title column using R’s handy “sub” function and then clean any trailing whitespaces.

names(movies_df)[names(movies_df)=="V1"] = "movieId"
names(movies_df)[names(movies_df)=="V2"] = "title"
names(movies_df)[names(movies_df)=="V3"] = "genres"

#Applying the strip function to get rid of any ending whitespace characters that may have appeared
movies_df$title = sub("\\s+$", "", movies_df$title)

head(movies_df, 20)

##    movieId                                 title                       genres
## 1        1                      Toy Story (1995)  Animation|Children's|Comedy
## 2        2                        Jumanji (1995) Adventure|Children's|Fantasy
## 3        3               Grumpier Old Men (1995)               Comedy|Romance
## 4        4              Waiting to Exhale (1995)                 Comedy|Drama
## 5        5    Father of the Bride Part II (1995)                       Comedy
## 6        6                           Heat (1995)        Action|Crime|Thriller
## 7        7                        Sabrina (1995)               Comedy|Romance
## 8        8                   Tom and Huck (1995)         Adventure|Children's
## 9        9                   Sudden Death (1995)                       Action
## 10      10                      GoldenEye (1995)    Action|Adventure|Thriller
## 11      11        American President, The (1995)         Comedy|Drama|Romance
## 12      12    Dracula: Dead and Loving It (1995)                Comedy|Horror
## 13      13                          Balto (1995)         Animation|Children's
## 14      14                          Nixon (1995)                        Drama
## 15      15               Cutthroat Island (1995)     Action|Adventure|Romance
## 16      16                         Casino (1995)               Drama|Thriller
## 17      17          Sense and Sensibility (1995)                Drama|Romance
## 18      18                     Four Rooms (1995)                     Thriller
## 19      19 Ace Ventura: When Nature Calls (1995)                       Comedy
## 20      20                    Money Train (1995)                       Action

With that, let’s also drop the genres column since we won’t need it for this particular recommendation system.

#Dropping the genres column
movies_df$genres = NULL

Here’s the final movies dataframe:

head(movies_df,20)

##    movieId                                 title
## 1        1                      Toy Story (1995)
## 2        2                        Jumanji (1995)
## 3        3               Grumpier Old Men (1995)
## 4        4              Waiting to Exhale (1995)
## 5        5    Father of the Bride Part II (1995)
## 6        6                           Heat (1995)
## 7        7                        Sabrina (1995)
## 8        8                   Tom and Huck (1995)
## 9        9                   Sudden Death (1995)
## 10      10                      GoldenEye (1995)
## 11      11        American President, The (1995)
## 12      12    Dracula: Dead and Loving It (1995)
## 13      13                          Balto (1995)
## 14      14                          Nixon (1995)
## 15      15               Cutthroat Island (1995)
## 16      16                         Casino (1995)
## 17      17          Sense and Sensibility (1995)
## 18      18                     Four Rooms (1995)
## 19      19 Ace Ventura: When Nature Calls (1995)
## 20      20                    Money Train (1995)

head(ratings_df)

##   V1 V2   V3 V4 V5 V6        V7
## 1  1 NA 1193 NA  5 NA 978300760
## 2  1 NA  661 NA  3 NA 978302109
## 3  1 NA  914 NA  3 NA 978301968
## 4  1 NA 3408 NA  4 NA 978300275
## 5  1 NA 2355 NA  5 NA 978824291
## 6  1 NA 1197 NA  3 NA 978302268

Every row in the ratings dataframe has a user id associated with at least one movie, a rating and a timestamp showing when they reviewed it. Let’s name the columns accordingly and drop the timestamp column since we won’t be using it for this type of recommendation.

#Removing the Empty Column Ex: V2, V4, V6 using subset function.
#These columns were generated because the data is separated by "::" while the read.csv function only accepts single characters
#for the sep value  such as ":" or "~", thus the read function assumed that our data was separated by single colons (":").
ratings_df <- subset( ratings_df, select = -c(V2, V4, V6 ))

head(ratings_df)

##   V1   V3 V5        V7
## 1  1 1193  5 978300760
## 2  1  661  3 978302109
## 3  1  914  3 978301968
## 4  1 3408  4 978300275
## 5  1 2355  5 978824291
## 6  1 1197  3 978302268

Lets name the columns in rating_df as follows:

V1 as userId V3 as movieId V5 as rating V7 as timestamp Remove Column timestamp

names(ratings_df)[names(ratings_df)=="V1"] = "userId"
names(ratings_df)[names(ratings_df)=="V3"] = "movieId"
names(ratings_df)[names(ratings_df)=="V5"] = "rating"
names(ratings_df)[names(ratings_df)=="V7"] = "timestamp"
ratings_df$timestamp = NULL

# Here's how the final ratings Dataframe looks like:
head(ratings_df)

##   userId movieId rating
## 1      1    1193      5
## 2      1     661      3
## 3      1     914      3
## 4      1    3408      4
## 5      1    2355      5
## 6      1    1197      3

Collaborative Filtering

Now, time to start our work on recommendation systems.

The first technique we’re going to take a look at is called Collaborative Filtering, which is also known as User-User Filtering. As hinted by its alternate name, this technique uses other users data to recommend items to the input user. It attempts to find users that have similar preferences and opinions as the input and then recommends items that they have liked to the input. There are several methods of finding similar users (Even some making use of Machine Learning), and the one we will be using here is going to be based on the Pearson Correlation Function.

The process for creating a User Based recommendation system is as follows:

Select a user with the movies the user has watched Based on his rating to movies, find the top X neighbours Get the watched movie record of the user for each neighbour. Calculate a similarity score using some formula Recommend the items with the highest score Let’s begin by creating an input user to recommend movies to:

Notice: To add more movies, simply increase the amount of elements in the userInput. Feel free to add more in! Just be sure to write it in with capital letters and if a movie starts with a “The”, like “The Matrix” then write it in like this: ‘Matrix, The’ .

inputUser = data.frame("title"=c("Breakfast Club, The (1985)", "Toy Story (1995)", "Jumanji (1995)", "Pulp Fiction (1994)", "Akira (1988)"), 
                       "rating"=c(5, 3.5, 2, 5, 4.5))
head(inputUser)

##                        title rating
## 1 Breakfast Club, The (1985)    5.0
## 2           Toy Story (1995)    3.5
## 3             Jumanji (1995)    2.0
## 4        Pulp Fiction (1994)    5.0
## 5               Akira (1988)    4.5

Adding movieIds to the input user

With the input complete, let’s extract the input movies’s ID’s from the movies dataframe and add them into it.

We can achieve this by first filtering out the rows that contain the input movies’ titles and getting their IDs.

inputUser$movieId = rep(NA, length(inputUser$title))
for (i in 1:length(inputUser$title)){
    inputUser$movieId[i] = as.character(movies_df$movieId[movies_df$title == inputUser$title[i]])
}
head(inputUser)

##                        title rating movieId
## 1 Breakfast Club, The (1985)    5.0    1968
## 2           Toy Story (1995)    3.5       1
## 3             Jumanji (1995)    2.0       2
## 4        Pulp Fiction (1994)    5.0     296
## 5               Akira (1988)    4.5    1274

The users who have seen the same movies

Now with the movie IDs in our input, we can now get the subset of users who have watched and reviewed the movies that our input user has seen.

#Filtering out users that have watched movies that the input has watched and storing it
userSubset = ratings_df[ratings_df$movieId %in% inputUser$movieId,]
head(userSubset)

##     userId movieId rating
## 41       1       1      5
## 175      2    1968      2
## 231      3    1968      4
## 286      5     296      4
## 464      6     296      2
## 470      6       1      4

With every user extracted, let’s sort them by the amount of movies that they have in common with the input and get the first 100 of them.

top100 <- head(sort(table(factor(userSubset$userId)), decreasing = TRUE), 100)

head(top100)

## 
##  18  48 272 284 424 549 
##   5   5   5   5   5   5

Now let’s extract the userIDs from the table and transform it into a table to make it easier to subset the data later on.

userList <- as.data.frame.table(top100)
colnames(userList) <-  c("userId","commonMovies")
head(userList)

##   userId commonMovies
## 1     18            5
## 2     48            5
## 3    272            5
## 4    284            5
## 5    424            5
## 6    549            5

Now let’s get the movies watched by these 100 users from the ratings dataframe and then create the UserSubset data frame (using merge function to combine the columns)

userSubset = ratings_df[ratings_df$userId %in% userList$userId,]
temp = as.data.frame(table(userSubset$movieId))
names(temp)[names(temp)=="Var1"] = "movieId"
userSubset = merge(temp, userSubset)

This is what our final userSubset dataframe looks like:

head(userSubset)

##   movieId Freq userId rating
## 1       1   99   4448      4
## 2       1   99   5046      5
## 3       1   99    284      5
## 4       1   99   1680      4
## 5       1   99    424      4
## 6       1   99   1733      4

Let’s look at one of the users, e.g. the one with userID 533.

head(userSubset[userSubset$userId == 533,])

##     movieId Freq userId rating
## 7         1   99    533      4
## 146      10   79    533      5
## 314     101   22    533      5
## 537    1019   38    533      3
## 572    1020   36    533      2
## 946    1031   33    533      4

Now let’s filter out the movies with less than 10 occurrences in our dataframe:

userSubset = userSubset[userSubset$Freq > 10,]
head(userSubset)

##   movieId Freq userId rating
## 1       1   99   4448      4
## 2       1   99   5046      5
## 3       1   99    284      5
## 4       1   99   1680      4
## 5       1   99    424      4
## 6       1   99   1733      4

Similarity of users to input user

Next, we are going to compare the top users to our specified user and find the one that is most similar. we’re going to find out how similar each user is to the input user through the Pearson Correlation Coefficient. It is used to measure the strength of the linear association between two variables.

pearson_df = data.frame("userId"=integer(), "similarityIndex"=double())
for (user in userList$userId)
{
    userRating = userSubset[userSubset$userId == user,]
    
    moviesInCommonX = userRating[userRating$movieId %in% inputUser$movieId,]
    moviesInCommonX = moviesInCommonX[complete.cases(moviesInCommonX),]
    
    moviesInCommonY = inputUser[inputUser$movieId %in% userRating$movieId,]
    moviesInCommonY = moviesInCommonY[complete.cases(moviesInCommonY),]
    
    #Now let's calculate the pearson correlation between two users, so called, x and y
    Sxx = sum(moviesInCommonX$rating^2) - (sum(moviesInCommonX$rating)^2)/nrow(moviesInCommonX)
    Syy = sum(moviesInCommonY$rating^2) - (sum(moviesInCommonY$rating)^2)/nrow(moviesInCommonY)
    Sxy = sum(moviesInCommonX$rating*moviesInCommonY$rating) - (sum(moviesInCommonX$rating)*sum(moviesInCommonY$rating))/nrow(moviesInCommonX)
    
    
    if(Sxx == 0 | Syy == 0 | Sxy == 0)
    {
        pearsonCorrelation = 0
    }
    else
    {
        pearsonCorrelation = Sxy/sqrt(Sxx*Syy)
    }
    
    pearson_df = rbind(pearson_df, data.frame("userId"=user, "similarityIndex"=pearsonCorrelation))   
}

Here’s a look at the similarity scores:

head(pearson_df)

##   userId similarityIndex
## 1     18      -0.6016568
## 2     48      -0.4385290
## 3    272       0.0000000
## 4    284       0.3580574
## 5    424       0.2192645
## 6    549       0.5547002

The top x similar users to input user

Now let’s get the top 50 users that are most similar to the input.

Now, let’s start recommending movies to the input user.

Rating of selected users to all movies

We’re going to do this by taking the weighted average of the ratings of the movies using the Pearson Correlation as the weight. But to do this, we first need to get the movies watched by the users in our pearsonDF from the ratings dataframe and then store their correlation in a new column called _similarityIndex". This is achieved below by merging of these two tables.

topUsersRating = merge(userSubset, pearson_df)
head(topUsersRating, 15)

##    userId movieId Freq rating similarityIndex
## 1      18    1408   60      5      -0.6016568
## 2      18     260   99      5      -0.6016568
## 3      18    2089   31      4      -0.6016568
## 4      18     262   13      5      -0.6016568
## 5      18    3489   71      4      -0.6016568
## 6      18    2384   38      2      -0.6016568
## 7      18    3668   23      3      -0.6016568
## 8      18    2807   15      1      -0.6016568
## 9      18    1873   16      5      -0.6016568
## 10     18    2142   26      3      -0.6016568
## 11     18    1580   94      5      -0.6016568
## 12     18    2718   30      2      -0.6016568
## 13     18    1411   25      5      -0.6016568
## 14     18    2394   43      4      -0.6016568
## 15     18    3255   62      4      -0.6016568

Now all we need to do is simply multiply the movie rating by its weight (The similarity index), then sum up the new ratings and divide it by the sum of the weights.

We can easily do this by simply multiplying two columns, then taking the mean of the aggregate of the movieId column:

#Multiplies the similarity by the user's ratings
topUsersRating$weightedRating = topUsersRating$similarityIndex*topUsersRating$rating
weightedAverage_df = aggregate(topUsersRating$weightedRating, list(topUsersRating$movieId), mean)
head(weightedAverage_df)

##   Group.1         x
## 1       1 0.2364408
## 2       2 0.1369868
## 3       3 0.2994215
## 4       5 0.5463121
## 5       6 0.2496468
## 6       7 0.3516583

names(weightedAverage_df)[names(weightedAverage_df)=="Group.1"] = "movieId"
names(weightedAverage_df)[names(weightedAverage_df)=="x"] = "weightedAverage"
head(weightedAverage_df)

##   movieId weightedAverage
## 1       1       0.2364408
## 2       2       0.1369868
## 3       3       0.2994215
## 4       5       0.5463121
## 5       6       0.2496468
## 6       7       0.3516583

Now we merge the averages with the movies dataframe so we can get their titles.

recommendation_df = merge(weightedAverage_df, movies_df)

And then we finally sort it to see the top 20 movies that the algorithm recommended!

head(recommendation_df[order(-recommendation_df$weightedAverage),], 20)

##      movieId weightedAverage                                           title
## 291     1480       1.4659647                   Smilla's Sense of Snow (1997)
## 672     2202       1.3000818                                 Lifeboat (1944)
## 952     2729       1.2735141                                   Lolita (1962)
## 664     2176       1.2283686                                     Rope (1948)
## 1288    3469       1.1484300                         Inherit the Wind (1960)
## 217     1340       1.0922209                    Bride of Frankenstein (1935)
## 1404     371       1.0908185                               Paper, The (1994)
## 266     1414       1.0753449                                   Mother (1996)
## 121     1211       1.0648503 Wings of Desire (Der Himmel über Berlin) (1987)
## 1749     965       1.0551635                            39 Steps, The (1935)
## 977     2789       0.9969701                          Damien: Omen II (1978)
## 3        100       0.9953413                                City Hall (1996)
## 1560     513       0.9823467                        Radioland Murders (1994)
## 987      280       0.9466496                      Murder in the First (1995)
## 710      230       0.9223159                        Dolores Claiborne (1994)
## 1145    3142       0.9129529                       U2: Rattle and Hum (1988)
## 1748     955       0.9052864                         Bringing Up Baby (1938)
## 666     2186       0.8992181                     Strangers on a Train (1951)
## 1479    3916       0.8950870                      Remember the Titans (2000)
## 332     1594       0.8928542                    In the Company of Men (1997)

Conclusion

Advantages and Disadvantages of collaborative filtering:

Advantages

Takes other user’s ratings into consideration Doesn’t need to study or extract information from the recommended item Adapts to the user’s interests which might change over time

Disadvantages

Approximation function can be slow There might be a low of amount of users to approximate Privacy issues when trying to learn the user’s preferences

Collaborative Filtering in R

Anoop

15/07/2021