Hybrid Recommendation System - MovieLens Data

On researching online about recommender sytems, I have come across an article that is based on Hybrid Filtering approach.

http://shodhganga.inflibnet.ac.in/bitstream/10603/45058/14/16_chapter7.pdf

Up until now, the recommendation algorithns I have worked with were only based on movie ratings. I would like to include content-based recommendation that may improve the recommendations from previous algothrims.

My hybrid approach will use both collaborative and content-based methods. The algorthims Pearson Correlation will be used to create similarity matrices.

First step will be to create a user based rating matrix. Second step will create a movie tag matrix showing the number of times a movie is tagged. Third step will be to create a hybrid matrix that uses both the ratings and tag matrices to reccommend a movie to a user.

Data: MovieLens + IMDb/Rotten Tomatoes, https://grouplens.org/datasets/hetrec-2011/

Number of Users: 2113

Number of Movies: 10197

user_count <- 2113
movie_count <- 10197

Explore Movie Tag Data

# tags
tags <- read.csv('tags.dat', sep='\t')
head(tags)
##   id    value
## 1  1    earth
## 2  2   police
## 3  3   boxing
## 4  4  painter
## 5  5    whale
## 6  6 medieval
nrow(tags)
## [1] 13222

There are 13,222 tags in total that can be associated to a movie!

# movie tag data
movie_tags <- read.csv('movie_tags.dat', sep='\t')
head(movie_tags)
##   movieID tagID tagWeight
## 1       1     7         1
## 2       1    13         3
## 3       1    25         3
## 4       1    55         3
## 5       1    60         1
## 6       1   146         1
summary(movie_tags)
##     movieID          tagID         tagWeight     
##  Min.   :    1   Min.   :    1   Min.   : 1.000  
##  1st Qu.: 1861   1st Qu.:  775   1st Qu.: 1.000  
##  Median : 4399   Median : 2738   Median : 1.000  
##  Mean   :12163   Mean   : 4355   Mean   : 1.381  
##  3rd Qu.: 8494   3rd Qu.: 6800   3rd Qu.: 1.000  
##  Max.   :65130   Max.   :16518   Max.   :42.000
# movies data
movies <- read.csv('movies.dat', sep='\t')
head(movies)
##   id                       title imdbID
## 1  1                   Toy story 114709
## 2  2                     Jumanji 113497
## 3  3              Grumpy Old Men 107050
## 4  4           Waiting to Exhale 114885
## 5  5 Father of the Bride Part II 113041
## 6  6                        Heat 113277
##                                         spanishTitle
## 1                               Toy story (juguetes)
## 2                                            Jumanji
## 3                                Dos viejos gruñones
## 4                               Esperando un respiro
## 5 Vuelve el padre de la novia (Ahora también abuelo)
## 6                                               Heat
##                                                                                                     imdbPictureURL
## 1 http://ia.media-imdb.com/images/M/MV5BMTMwNDU0NTY2Nl5BMl5BanBnXkFtZTcwOTUxOTM5Mw@@._V1._SX214_CR0,0,214,314_.jpg
## 2 http://ia.media-imdb.com/images/M/MV5BMzM5NjE1OTMxNV5BMl5BanBnXkFtZTcwNDY2MzEzMQ@@._V1._SY314_CR3,0,214,314_.jpg
## 3     http://ia.media-imdb.com/images/M/MV5BMTI5MTgyMzE0OF5BMl5BanBnXkFtZTYwNzAyNjg5._V1._SX214_CR0,0,214,314_.jpg
## 4 http://ia.media-imdb.com/images/M/MV5BMTczMTMyMTgyM15BMl5BanBnXkFtZTcwOTc4OTQyMQ@@._V1._SY314_CR4,0,214,314_.jpg
## 5 http://ia.media-imdb.com/images/M/MV5BMTg1NDc2MjExOF5BMl5BanBnXkFtZTcwNjU1NDAzMQ@@._V1._SY314_CR5,0,214,314_.jpg
## 6 http://ia.media-imdb.com/images/M/MV5BMTM1NDc4ODkxNV5BMl5BanBnXkFtZTcwNTI4ODE3MQ@@._V1._SY314_CR1,0,214,314_.jpg
##   year                        rtID rtAllCriticsRating
## 1 1995                   toy_story                  9
## 2 1995             1068044-jumanji                5.6
## 3 1993              grumpy_old_men                5.9
## 4 1995           waiting_to_exhale                5.6
## 5 1995 father_of_the_bride_part_ii                5.3
## 6 1995                1068182-heat                7.7
##   rtAllCriticsNumReviews rtAllCriticsNumFresh rtAllCriticsNumRotten
## 1                     73                   73                     0
## 2                     28                   13                    15
## 3                     36                   24                    12
## 4                     25                   14                    11
## 5                     19                    9                    10
## 6                     58                   50                     8
##   rtAllCriticsScore rtTopCriticsRating rtTopCriticsNumReviews
## 1               100                8.5                     17
## 2                46                5.8                      5
## 3                66                  7                      6
## 4                56                5.5                     11
## 5                47                5.4                      5
## 6                86                7.2                     17
##   rtTopCriticsNumFresh rtTopCriticsNumRotten rtTopCriticsScore
## 1                   17                     0               100
## 2                    2                     3                40
## 3                    5                     1                83
## 4                    5                     6                45
## 5                    1                     4                20
## 6                   14                     3                82
##   rtAudienceRating rtAudienceNumRatings rtAudienceScore
## 1              3.7               102338              81
## 2              3.2                44587              61
## 3              3.2                10489              66
## 4              3.3                 5666              79
## 5                3                13761              64
## 6              3.9                42785              92
##                                                   rtPictureURL
## 1 http://content7.flixster.com/movie/10/93/63/10936393_det.jpg
## 2  http://content8.flixster.com/movie/56/79/73/5679734_det.jpg
## 3      http://content6.flixster.com/movie/25/60/256020_det.jpg
## 4 http://content9.flixster.com/movie/10/94/17/10941715_det.jpg
## 5      http://content8.flixster.com/movie/25/54/255426_det.jpg
## 6      http://content9.flixster.com/movie/26/80/268099_det.jpg
total_movies <- nrow(movies)

# Number of movies tagged
movies_tagged <- length(unique(movie_tags$movieID))

# check sparsity in tags
movies_tagged/total_movies
## [1] 0.701677

The maximum tags associated with a movie is 42 while most of the movies on average are tagged once.

There are total of 10,197 movies in the matrix. Roughly 70% of the movies are tagged and 30% have no associations.

Explore movie ratings data

# user rated movies
user_rated_movies <- read.csv('user_ratedmovies.dat', sep='\t')
head(user_rated_movies)
##   userID movieID rating date_day date_month date_year date_hour
## 1     75       3    1.0       29         10      2006        23
## 2     75      32    4.5       29         10      2006        23
## 3     75     110    4.0       29         10      2006        23
## 4     75     160    2.0       29         10      2006        23
## 5     75     163    4.0       29         10      2006        23
## 6     75     165    4.5       29         10      2006        23
##   date_minute date_second
## 1          17          16
## 2          23          44
## 3          30           8
## 4          16          52
## 5          29          30
## 6          25          15
summary(user_rated_movies)
##      userID         movieID          rating         date_day    
##  Min.   :   75   Min.   :    1   Min.   :0.500   Min.   : 1.00  
##  1st Qu.:18161   1st Qu.: 1367   1st Qu.:3.000   1st Qu.: 8.00  
##  Median :33866   Median : 3249   Median :3.500   Median :15.00  
##  Mean   :35191   Mean   : 8710   Mean   :3.438   Mean   :15.57  
##  3rd Qu.:52004   3rd Qu.: 6534   3rd Qu.:4.000   3rd Qu.:23.00  
##  Max.   :71534   Max.   :65133   Max.   :5.000   Max.   :31.00  
##    date_month       date_year      date_hour      date_minute   
##  Min.   : 1.000   Min.   :1997   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 4.000   1st Qu.:2004   1st Qu.: 5.00   1st Qu.:15.00  
##  Median : 7.000   Median :2006   Median :13.00   Median :30.00  
##  Mean   : 6.541   Mean   :2005   Mean   :12.12   Mean   :29.65  
##  3rd Qu.:10.000   3rd Qu.:2007   3rd Qu.:19.00   3rd Qu.:45.00  
##  Max.   :12.000   Max.   :2009   Max.   :23.00   Max.   :59.00  
##   date_second   
##  Min.   : 0.00  
##  1st Qu.:15.00  
##  Median :30.00  
##  Mean   :29.51  
##  3rd Qu.:44.00  
##  Max.   :59.00
ggplot2::qplot(user_rated_movies$rating, geom="histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Minimum rating is 0.5 and maximum is 5.0. Average ratings on higher end between 3 and 4.

Check the Sparasity of Ratings Data

possible_ratings <- user_count * movie_count
rated <- nrow(user_rated_movies)

rated
## [1] 855598
possible_ratings
## [1] 21546261
rated/possible_ratings
## [1] 0.03970981

There are roughly 860,000 movies rated out of 21 million possible ratings. As you can see, the data is very sparse and only 4% has ratings.

This completes an overview of the MoviesLens data I will be working with to create a hybrid recommendation system.