On researching online about recommender sytems, I have come across an article that is based on Hybrid Filtering approach.
http://shodhganga.inflibnet.ac.in/bitstream/10603/45058/14/16_chapter7.pdf
Up until now, the recommendation algorithns I have worked with were only based on movie ratings. I would like to include content-based recommendation that may improve the recommendations from previous algothrims.
My hybrid approach will use both collaborative and content-based methods. The algorthims Pearson Correlation will be used to create similarity matrices.
First step will be to create a user based rating matrix. Second step will create a movie tag matrix showing the number of times a movie is tagged. Third step will be to create a hybrid matrix that uses both the ratings and tag matrices to reccommend a movie to a user.
Data: MovieLens + IMDb/Rotten Tomatoes, https://grouplens.org/datasets/hetrec-2011/
Number of Users: 2113
Number of Movies: 10197
user_count <- 2113
movie_count <- 10197
# tags
tags <- read.csv('tags.dat', sep='\t')
head(tags)
## id value
## 1 1 earth
## 2 2 police
## 3 3 boxing
## 4 4 painter
## 5 5 whale
## 6 6 medieval
nrow(tags)
## [1] 13222
There are 13,222 tags in total that can be associated to a movie!
# movie tag data
movie_tags <- read.csv('movie_tags.dat', sep='\t')
head(movie_tags)
## movieID tagID tagWeight
## 1 1 7 1
## 2 1 13 3
## 3 1 25 3
## 4 1 55 3
## 5 1 60 1
## 6 1 146 1
summary(movie_tags)
## movieID tagID tagWeight
## Min. : 1 Min. : 1 Min. : 1.000
## 1st Qu.: 1861 1st Qu.: 775 1st Qu.: 1.000
## Median : 4399 Median : 2738 Median : 1.000
## Mean :12163 Mean : 4355 Mean : 1.381
## 3rd Qu.: 8494 3rd Qu.: 6800 3rd Qu.: 1.000
## Max. :65130 Max. :16518 Max. :42.000
# movies data
movies <- read.csv('movies.dat', sep='\t')
head(movies)
## id title imdbID
## 1 1 Toy story 114709
## 2 2 Jumanji 113497
## 3 3 Grumpy Old Men 107050
## 4 4 Waiting to Exhale 114885
## 5 5 Father of the Bride Part II 113041
## 6 6 Heat 113277
## spanishTitle
## 1 Toy story (juguetes)
## 2 Jumanji
## 3 Dos viejos gruñones
## 4 Esperando un respiro
## 5 Vuelve el padre de la novia (Ahora también abuelo)
## 6 Heat
## imdbPictureURL
## 1 http://ia.media-imdb.com/images/M/MV5BMTMwNDU0NTY2Nl5BMl5BanBnXkFtZTcwOTUxOTM5Mw@@._V1._SX214_CR0,0,214,314_.jpg
## 2 http://ia.media-imdb.com/images/M/MV5BMzM5NjE1OTMxNV5BMl5BanBnXkFtZTcwNDY2MzEzMQ@@._V1._SY314_CR3,0,214,314_.jpg
## 3 http://ia.media-imdb.com/images/M/MV5BMTI5MTgyMzE0OF5BMl5BanBnXkFtZTYwNzAyNjg5._V1._SX214_CR0,0,214,314_.jpg
## 4 http://ia.media-imdb.com/images/M/MV5BMTczMTMyMTgyM15BMl5BanBnXkFtZTcwOTc4OTQyMQ@@._V1._SY314_CR4,0,214,314_.jpg
## 5 http://ia.media-imdb.com/images/M/MV5BMTg1NDc2MjExOF5BMl5BanBnXkFtZTcwNjU1NDAzMQ@@._V1._SY314_CR5,0,214,314_.jpg
## 6 http://ia.media-imdb.com/images/M/MV5BMTM1NDc4ODkxNV5BMl5BanBnXkFtZTcwNTI4ODE3MQ@@._V1._SY314_CR1,0,214,314_.jpg
## year rtID rtAllCriticsRating
## 1 1995 toy_story 9
## 2 1995 1068044-jumanji 5.6
## 3 1993 grumpy_old_men 5.9
## 4 1995 waiting_to_exhale 5.6
## 5 1995 father_of_the_bride_part_ii 5.3
## 6 1995 1068182-heat 7.7
## rtAllCriticsNumReviews rtAllCriticsNumFresh rtAllCriticsNumRotten
## 1 73 73 0
## 2 28 13 15
## 3 36 24 12
## 4 25 14 11
## 5 19 9 10
## 6 58 50 8
## rtAllCriticsScore rtTopCriticsRating rtTopCriticsNumReviews
## 1 100 8.5 17
## 2 46 5.8 5
## 3 66 7 6
## 4 56 5.5 11
## 5 47 5.4 5
## 6 86 7.2 17
## rtTopCriticsNumFresh rtTopCriticsNumRotten rtTopCriticsScore
## 1 17 0 100
## 2 2 3 40
## 3 5 1 83
## 4 5 6 45
## 5 1 4 20
## 6 14 3 82
## rtAudienceRating rtAudienceNumRatings rtAudienceScore
## 1 3.7 102338 81
## 2 3.2 44587 61
## 3 3.2 10489 66
## 4 3.3 5666 79
## 5 3 13761 64
## 6 3.9 42785 92
## rtPictureURL
## 1 http://content7.flixster.com/movie/10/93/63/10936393_det.jpg
## 2 http://content8.flixster.com/movie/56/79/73/5679734_det.jpg
## 3 http://content6.flixster.com/movie/25/60/256020_det.jpg
## 4 http://content9.flixster.com/movie/10/94/17/10941715_det.jpg
## 5 http://content8.flixster.com/movie/25/54/255426_det.jpg
## 6 http://content9.flixster.com/movie/26/80/268099_det.jpg
total_movies <- nrow(movies)
# Number of movies tagged
movies_tagged <- length(unique(movie_tags$movieID))
# check sparsity in tags
movies_tagged/total_movies
## [1] 0.701677
The maximum tags associated with a movie is 42 while most of the movies on average are tagged once.
There are total of 10,197 movies in the matrix. Roughly 70% of the movies are tagged and 30% have no associations.
# user rated movies
user_rated_movies <- read.csv('user_ratedmovies.dat', sep='\t')
head(user_rated_movies)
## userID movieID rating date_day date_month date_year date_hour
## 1 75 3 1.0 29 10 2006 23
## 2 75 32 4.5 29 10 2006 23
## 3 75 110 4.0 29 10 2006 23
## 4 75 160 2.0 29 10 2006 23
## 5 75 163 4.0 29 10 2006 23
## 6 75 165 4.5 29 10 2006 23
## date_minute date_second
## 1 17 16
## 2 23 44
## 3 30 8
## 4 16 52
## 5 29 30
## 6 25 15
summary(user_rated_movies)
## userID movieID rating date_day
## Min. : 75 Min. : 1 Min. :0.500 Min. : 1.00
## 1st Qu.:18161 1st Qu.: 1367 1st Qu.:3.000 1st Qu.: 8.00
## Median :33866 Median : 3249 Median :3.500 Median :15.00
## Mean :35191 Mean : 8710 Mean :3.438 Mean :15.57
## 3rd Qu.:52004 3rd Qu.: 6534 3rd Qu.:4.000 3rd Qu.:23.00
## Max. :71534 Max. :65133 Max. :5.000 Max. :31.00
## date_month date_year date_hour date_minute
## Min. : 1.000 Min. :1997 Min. : 0.00 Min. : 0.00
## 1st Qu.: 4.000 1st Qu.:2004 1st Qu.: 5.00 1st Qu.:15.00
## Median : 7.000 Median :2006 Median :13.00 Median :30.00
## Mean : 6.541 Mean :2005 Mean :12.12 Mean :29.65
## 3rd Qu.:10.000 3rd Qu.:2007 3rd Qu.:19.00 3rd Qu.:45.00
## Max. :12.000 Max. :2009 Max. :23.00 Max. :59.00
## date_second
## Min. : 0.00
## 1st Qu.:15.00
## Median :30.00
## Mean :29.51
## 3rd Qu.:44.00
## Max. :59.00
ggplot2::qplot(user_rated_movies$rating, geom="histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Minimum rating is 0.5 and maximum is 5.0. Average ratings on higher end between 3 and 4.
possible_ratings <- user_count * movie_count
rated <- nrow(user_rated_movies)
rated
## [1] 855598
possible_ratings
## [1] 21546261
rated/possible_ratings
## [1] 0.03970981
There are roughly 860,000 movies rated out of 21 million possible ratings. As you can see, the data is very sparse and only 4% has ratings.
This completes an overview of the MoviesLens data I will be working with to create a hybrid recommendation system.