Up until now, the recommendation algorithms I have worked with were only based on Collaborative Filtering on ratings. I would like to include content-based recommendation that may improve the recommendations from previous algothrims.
My hybrid approach will use both collaborative and content-based methods. I will be using collobrative filtering on Movie ratings to get top 10 movies for a user. Then, I will futher augment the 10 rated movies with the preference of tags user has using tag data. Movies will be filtered down to only those that have been rated more than 500 times.
Load Necessary libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(reshape2)
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
library(tidytext)
library(recommenderlab)
## Loading required package: Matrix
## Loading required package: arules
##
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
## Loading required package: proxy
##
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
##
## as.matrix
## The following objects are masked from 'package:stats':
##
## as.dist, dist
## The following object is masked from 'package:base':
##
## as.matrix
## Loading required package: registry
library(sparklyr)
library(knitr)
Data: MovieLens + IMDb/Rotten Tomatoes, https://grouplens.org/datasets/hetrec-2011/
Number of Users: 2113 Number of Movies: 10197
user_count <- 2113
movie_count <- 10197
# tags
tags <- read.csv('tags.dat', sep='\t')
head(tags)
## id value
## 1 1 earth
## 2 2 police
## 3 3 boxing
## 4 4 painter
## 5 5 whale
## 6 6 medieval
nrow(tags)
## [1] 13222
There are 13,222 tags in total that can be associated to a movie! Most of these tags have not been tagged by users. We will further refine this data to tags that only users in the database have tagged.
# movie tag data
movie_tags <- read.csv('movie_tags.dat', sep='\t')
head(movie_tags)
## movieID tagID tagWeight
## 1 1 7 1
## 2 1 13 3
## 3 1 25 3
## 4 1 55 3
## 5 1 60 1
## 6 1 146 1
summary(movie_tags)
## movieID tagID tagWeight
## Min. : 1 Min. : 1 Min. : 1.000
## 1st Qu.: 1861 1st Qu.: 775 1st Qu.: 1.000
## Median : 4399 Median : 2738 Median : 1.000
## Mean :12163 Mean : 4355 Mean : 1.381
## 3rd Qu.: 8494 3rd Qu.: 6800 3rd Qu.: 1.000
## Max. :65130 Max. :16518 Max. :42.000
# movies data
movies <- read.csv('movies.dat', sep='\t')
head(movies)
## id title imdbID
## 1 1 Toy story 114709
## 2 2 Jumanji 113497
## 3 3 Grumpy Old Men 107050
## 4 4 Waiting to Exhale 114885
## 5 5 Father of the Bride Part II 113041
## 6 6 Heat 113277
## spanishTitle
## 1 Toy story (juguetes)
## 2 Jumanji
## 3 Dos viejos gruñones
## 4 Esperando un respiro
## 5 Vuelve el padre de la novia (Ahora también abuelo)
## 6 Heat
## imdbPictureURL
## 1 http://ia.media-imdb.com/images/M/MV5BMTMwNDU0NTY2Nl5BMl5BanBnXkFtZTcwOTUxOTM5Mw@@._V1._SX214_CR0,0,214,314_.jpg
## 2 http://ia.media-imdb.com/images/M/MV5BMzM5NjE1OTMxNV5BMl5BanBnXkFtZTcwNDY2MzEzMQ@@._V1._SY314_CR3,0,214,314_.jpg
## 3 http://ia.media-imdb.com/images/M/MV5BMTI5MTgyMzE0OF5BMl5BanBnXkFtZTYwNzAyNjg5._V1._SX214_CR0,0,214,314_.jpg
## 4 http://ia.media-imdb.com/images/M/MV5BMTczMTMyMTgyM15BMl5BanBnXkFtZTcwOTc4OTQyMQ@@._V1._SY314_CR4,0,214,314_.jpg
## 5 http://ia.media-imdb.com/images/M/MV5BMTg1NDc2MjExOF5BMl5BanBnXkFtZTcwNjU1NDAzMQ@@._V1._SY314_CR5,0,214,314_.jpg
## 6 http://ia.media-imdb.com/images/M/MV5BMTM1NDc4ODkxNV5BMl5BanBnXkFtZTcwNTI4ODE3MQ@@._V1._SY314_CR1,0,214,314_.jpg
## year rtID rtAllCriticsRating
## 1 1995 toy_story 9
## 2 1995 1068044-jumanji 5.6
## 3 1993 grumpy_old_men 5.9
## 4 1995 waiting_to_exhale 5.6
## 5 1995 father_of_the_bride_part_ii 5.3
## 6 1995 1068182-heat 7.7
## rtAllCriticsNumReviews rtAllCriticsNumFresh rtAllCriticsNumRotten
## 1 73 73 0
## 2 28 13 15
## 3 36 24 12
## 4 25 14 11
## 5 19 9 10
## 6 58 50 8
## rtAllCriticsScore rtTopCriticsRating rtTopCriticsNumReviews
## 1 100 8.5 17
## 2 46 5.8 5
## 3 66 7 6
## 4 56 5.5 11
## 5 47 5.4 5
## 6 86 7.2 17
## rtTopCriticsNumFresh rtTopCriticsNumRotten rtTopCriticsScore
## 1 17 0 100
## 2 2 3 40
## 3 5 1 83
## 4 5 6 45
## 5 1 4 20
## 6 14 3 82
## rtAudienceRating rtAudienceNumRatings rtAudienceScore
## 1 3.7 102338 81
## 2 3.2 44587 61
## 3 3.2 10489 66
## 4 3.3 5666 79
## 5 3 13761 64
## 6 3.9 42785 92
## rtPictureURL
## 1 http://content7.flixster.com/movie/10/93/63/10936393_det.jpg
## 2 http://content8.flixster.com/movie/56/79/73/5679734_det.jpg
## 3 http://content6.flixster.com/movie/25/60/256020_det.jpg
## 4 http://content9.flixster.com/movie/10/94/17/10941715_det.jpg
## 5 http://content8.flixster.com/movie/25/54/255426_det.jpg
## 6 http://content9.flixster.com/movie/26/80/268099_det.jpg
total_movies <- nrow(movies)
# Number of movies tagged
movies_tagged <- length(unique(movie_tags$movieID))
# check sparsity in tags
movies_tagged/total_movies
## [1] 0.701677
The maximum tags associated with a movie is total of 42 while most of the movies on average are tagged once.
There are total of 10,197 movies in the matrix. Roughly 70% of the movies are tagged and 30% have no associations.
# movie tag data
utp_data <- read.csv('user_taggedmovies.dat', sep='\t')
head(utp_data)
## userID movieID tagID date_day date_month date_year date_hour date_minute
## 1 75 353 5290 29 10 2006 23 20
## 2 78 4223 5264 16 4 2007 4 43
## 3 127 1343 1544 28 8 2007 3 42
## 4 127 1343 12330 28 8 2007 3 42
## 5 127 2080 1451 28 8 2007 3 42
## 6 127 2080 1574 28 8 2007 3 42
## date_second
## 1 15
## 2 45
## 3 27
## 4 27
## 5 47
## 6 47
summary(utp_data)
## userID movieID tagID date_day
## Min. : 75 Min. : 1 Min. : 1 Min. : 1.0
## 1st Qu.:23032 1st Qu.: 1580 1st Qu.: 606 1st Qu.: 9.0
## Median :30167 Median : 4216 Median : 1974 Median :16.0
## Mean :34708 Mean :12228 Mean : 4100 Mean :15.9
## 3rd Qu.:50846 3rd Qu.: 8512 3rd Qu.: 6794 3rd Qu.:23.0
## Max. :71534 Max. :65130 Max. :16529 Max. :31.0
## date_month date_year date_hour date_minute
## Min. : 1.000 Min. :2005 Min. : 0.0 Min. : 0.00
## 1st Qu.: 2.000 1st Qu.:2006 1st Qu.: 5.0 1st Qu.:14.00
## Median : 4.000 Median :2007 Median :14.0 Median :29.00
## Mean : 5.257 Mean :2007 Mean :12.3 Mean :29.51
## 3rd Qu.: 8.000 3rd Qu.:2007 3rd Qu.:19.0 3rd Qu.:45.00
## Max. :12.000 Max. :2009 Max. :23.0 Max. :59.00
## date_second
## Min. : 0.00
## 1st Qu.:14.00
## Median :29.00
## Mean :29.38
## 3rd Qu.:44.00
## Max. :59.00
top10tags <- as.data.frame(head(sort(table(utp_data$tagID), decreasing = TRUE),10), stringsAsFactors = FALSE)
top10tags$Var1 <- as.numeric(top10tags$Var1)
tagNames <- list()
for(i in 1:10){
tagNames[i] <- as.character(tags[top10tags[i,1],2])
}
top10tags <- cbind.data.frame(top10tags,unlist(tagNames), stringsAsFactors = FALSE)
colnames(top10tags) <- c("tagID","Freq","tagName")
library(ggplot2)
ggplot(top10tags, aes(tagName,Freq)) + geom_bar(stat="identity") + coord_flip()
# user rated movies
user_rated_movies <- read.csv('user_ratedmovies.dat', sep='\t')
head(user_rated_movies)
## userID movieID rating date_day date_month date_year date_hour
## 1 75 3 1.0 29 10 2006 23
## 2 75 32 4.5 29 10 2006 23
## 3 75 110 4.0 29 10 2006 23
## 4 75 160 2.0 29 10 2006 23
## 5 75 163 4.0 29 10 2006 23
## 6 75 165 4.5 29 10 2006 23
## date_minute date_second
## 1 17 16
## 2 23 44
## 3 30 8
## 4 16 52
## 5 29 30
## 6 25 15
summary(user_rated_movies)
## userID movieID rating date_day
## Min. : 75 Min. : 1 Min. :0.500 Min. : 1.00
## 1st Qu.:18161 1st Qu.: 1367 1st Qu.:3.000 1st Qu.: 8.00
## Median :33866 Median : 3249 Median :3.500 Median :15.00
## Mean :35191 Mean : 8710 Mean :3.438 Mean :15.57
## 3rd Qu.:52004 3rd Qu.: 6534 3rd Qu.:4.000 3rd Qu.:23.00
## Max. :71534 Max. :65133 Max. :5.000 Max. :31.00
## date_month date_year date_hour date_minute
## Min. : 1.000 Min. :1997 Min. : 0.00 Min. : 0.00
## 1st Qu.: 4.000 1st Qu.:2004 1st Qu.: 5.00 1st Qu.:15.00
## Median : 7.000 Median :2006 Median :13.00 Median :30.00
## Mean : 6.541 Mean :2005 Mean :12.12 Mean :29.65
## 3rd Qu.:10.000 3rd Qu.:2007 3rd Qu.:19.00 3rd Qu.:45.00
## Max. :12.000 Max. :2009 Max. :23.00 Max. :59.00
## date_second
## Min. : 0.00
## 1st Qu.:15.00
## Median :30.00
## Mean :29.51
## 3rd Qu.:44.00
## Max. :59.00
ggplot2::qplot(user_rated_movies$rating, geom="histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Minimum rating is 0.5 and maximum is 5.0. Average ratings on higher end between 3 and 4.
possible_ratings <- user_count * movie_count
rated <- nrow(user_rated_movies)
rated
## [1] 855598
possible_ratings
## [1] 21546261
rated/possible_ratings
## [1] 0.03970981
There are roughly 860,000 movies rated out of 21 million possible ratings. As you can see, the data is very sparse and only 4% has ratings.
# read all data sets again
mtp_data <- read.csv('movie_tags.dat', sep='\t')
utp_data <- read.csv('user_taggedmovies.dat', sep='\t')
umr_data <- read.csv('user_ratedmovies.dat', sep='\t')
tags <- read.csv('tags.dat', sep='\t')
movies <- read.csv('movies.dat', sep='\t')[,1:2]
movies$title <- as.character(movies$title)
###### --- U X M (user-movie ratings), finds tag preferences to users ---####
umr_long <- umr_data %>% select(userID, movieID, rating)# %>%
mtp_long <- mtp_data %>% select(movieID, tagID, tagWeight)
utp_long <- utp_data %>% select(userID, tagID) %>%
group_by(userID, tagID) %>%
summarise(n = n())
It appears that many words occur rarely and fewer words are tagged more often. We will employ tf-idf to find the important tags by decreasing the weight for tags used more often. The following is the equation for tf-idf but We will use the bind_td_idf from the tidytext package.
# select data only needed
mtp_pr <- sqldf("SELECT mtp_long.movieID, mtp_long.tagID, mtp_long.tagWeight FROM mtp_long")
## Loading required package: tcltk
## Warning: Quoted identifiers should have class SQL, use DBI::SQL() if the
## caller performs the quoting.
# filter to only tags that exist in user tag db as well
mtp_pr <- sqldf("SELECT * from mtp_pr where
tagID in (select tagID from utp_long)")
# calculate tf-idf
mtp_pr <- mtp_pr %>% bind_tf_idf(movieID, tagID, tagWeight) %>% select(movieID, tagID, tf_idf)
# reshape data in wide-format
mtp <- dcast(mtp_pr, movieID ~ tagID)
## Using tf_idf as value column: use value.var to override.
rownames(mtp) <- mtp[,1]
mtp <- within(mtp, rm(movieID))
mtp[is.na(mtp)] <- 0
dim(mtp)
## [1] 7022 4435
kable(mtp[1:5,1:20])
| 1 | 2 | 3 | 4 | 6 | 7 | 8 | 9 | 11 | 12 | 13 | 19 | 21 | 22 | 23 | 24 | 25 | 27 | 29 | 31 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 0 | 0 | 0 | 0 | 0.3090722 | 0 | 0 | 0 | 0 | 0.2207658 | 0 | 0 | 0 | 0 | 0 | 2.78165 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0 | 0.1748226 | 0 | 0 | 0 | 0 | 0 | 0.00000 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0 | 0 | 0.00000 | 0 | 0 | 0 |
| 5 | 0 | 0 | 0 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0 | 0 | 0.00000 | 0 | 0 | 0 |
| 6 | 0 | 0 | 0 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0 | 0.0000000 | 0 | 0 | 0 | 0 | 0 | 0.00000 | 0 | 0 | 0 |
# select tags only that exists in movie db
utp_pr <- sqldf("SELECT utp_long.userID, utp_long.tagID, utp_long.n from utp_long
WHERE utp_long.tagID in (select tagID from mtp_pr)")
utp_users <- as.data.frame(unique(utp_pr$userID))
colnames(utp_users) <- "userID"
# reshape data in wide-format
utp<- dcast(utp_pr, userID ~ tagID)
## Using n as value column: use value.var to override.
rownames(utp) <- utp[,1]
utp <- within(utp, rm(userID))
utp[is.na(utp)] <- 0
dim(utp)
## [1] 1875 4435
kable(utp[1:5,1:10])
| 1 | 2 | 3 | 4 | 6 | 7 | 8 | 9 | 11 | 12 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 78 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 127 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 170 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 175 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 190 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# cut down number of movies and consider only that have been rated more than 500 times
umr_movies <- sqldf("SELECT DISTINCT umr_long.movieID, count(*) from umr_long
WHERE umr_long.MovieID IN (SELECT DISTINCT mtp_long.movieID from mtp_long)
AND umr_long.userID in (SELECT DISTINCT utp_long.userID from utp_long)
GROUP BY umr_long.movieID
HAVING count(*) > 500")
umr_movies <- as.data.frame(unique(umr_movies$movieID))
colnames(umr_movies) <- "movieID"
umr_pr <- sqldf("SELECT * from umr_long
WHERE umr_long.movieID in (SELECT movieID from umr_movies)
AND umr_long.userID in (SELECT userID from utp_users)")
umr<- dcast(umr_pr, userID ~ movieID)
## Using rating as value column: use value.var to override.
rownames(umr) <- umr[,1]
umr <- within(umr, rm(userID))
dim(umr)
## [1] 1868 368
kable(umr[1:5,1:10])
| 1 | 2 | 6 | 10 | 16 | 19 | 21 | 32 | 34 | 39 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 78 | NA | NA | NA | NA | NA | NA | NA | 5 | NA | NA |
| 127 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 170 | 3.0 | 2 | NA | 3.5 | NA | 2.0 | 3.0 | 4 | 2 | 2 |
| 175 | 4.0 | NA | 5 | NA | NA | 0.5 | 0.5 | 4 | NA | 5 |
| 190 | 4.5 | 4 | NA | NA | NA | NA | NA | 3 | NA | NA |
# convert matrix
umr_matrix <- as(as.matrix(umr), "realRatingMatrix")
# run recommenderlab builtin function
recsys <- Recommender(data = umr_matrix, method = "IBCF", parameter = list(method = "Cosine"))
recsysTop5 <- predict(object = recsys, newdata = umr_matrix, n = 5)
top5items <- recsysTop5@items
head(top5items, n=5)
## $`78`
## [1] 153 306 265 168 327
##
## $`127`
## [1] 34 145 221 271 312
##
## $`170`
## [1] 355 342 274 142 113
##
## $`175`
## [1] 223 336 268 172 225
##
## $`190`
## [1] 210 326 204 280 140
eval_sets <- evaluationScheme(data = umr_matrix, method = "split", train = 0.7, given = 1, goodRating = 3, k = 1)
#IBCF
eval_recommender <- Recommender(data = getData(eval_sets, "train"), method = "IBCF", parameter = NULL)
eval_prediction <- predict(object = eval_recommender, newdata = getData(eval_sets, "known"), n = 10, type = "ratings")
calcPredictionAccuracy(x = eval_prediction, data = getData(eval_sets, "unknown"), byUser = FALSE)
## RMSE MSE MAE
## 0.9895285 0.9791667 0.7083333
Let’s explore recommendation for the first user. We will take these recommendations and find their weight using the Movie-Tag and User-Tag Matrix.
user_one <- top5items[1]
mtp_rows<- which(rownames(mtp) %in% as.character(unlist(user_one)))
mtp_selected_rows <- mtp[mtp_rows,]
# get user weight
userweight <- utp["78",]
# multiply the movie tag weight with user tag column
finalresults <- as.matrix(mtp_selected_rows) %*% t(as.matrix(userweight))
# sort the results
finalmovies <- as.vector(rownames(finalresults))
user_results <- movies %>% filter(id %in% finalmovies) %>% select(title)
colnames(user_results) <- c("Top 5 Movies")
user_results
## Top 5 Movies
## 1 Batman Forever
## 2 First Knight
## 3 Como agua para chocolate
## 4 Trois couleurs: Rouge
## 5 Tank Girl
From working on this project, the most challenging part I found was setting up data. Total of 90% of my time was consumed by how I should lay out the matrices so the rows and columns of movies and tags align. If I had filtered out movies from the movie-tag profile then I would have to update both ratings and user tag matrices. Working with 3 matrices to create subsets were not easy and I believe there is still lot more work involved in getting the data right.
As for making predictions, since algorithms were available to us, it was not as consuming to make predictions on ratings and then incorporate them with weight of tags.