DATA 643 - Final Project

Project Description

Hybrid Recommendation System - MovieLens Data

Up until now, the recommendation algorithms I have worked with were only based on Collaborative Filtering on ratings. I would like to include content-based recommendation that may improve the recommendations from previous algothrims.

My hybrid approach will use both collaborative and content-based methods. I will be using collobrative filtering on Movie ratings to get top 10 movies for a user. Then, I will futher augment the 10 rated movies with the preference of tags user has using tag data. Movies will be filtered down to only those that have been rated more than 500 times.

Explore Data Sets

Load Necessary libraries

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(reshape2)
library(sqldf)

## Loading required package: gsubfn

## Loading required package: proto

## Loading required package: RSQLite

library(tidytext)
library(recommenderlab)

## Loading required package: Matrix

## Loading required package: arules

## 
## Attaching package: 'arules'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

## Loading required package: proxy

## 
## Attaching package: 'proxy'

## The following object is masked from 'package:Matrix':
## 
##     as.matrix

## The following objects are masked from 'package:stats':
## 
##     as.dist, dist

## The following object is masked from 'package:base':
## 
##     as.matrix

## Loading required package: registry

library(sparklyr)
library(knitr)

Movie Data

Data: MovieLens + IMDb/Rotten Tomatoes, https://grouplens.org/datasets/hetrec-2011/

Number of Users: 2113 Number of Movies: 10197

user_count <- 2113
movie_count <- 10197

Movie Tag Data

# tags
tags <- read.csv('tags.dat', sep='\t')
head(tags)

##   id    value
## 1  1    earth
## 2  2   police
## 3  3   boxing
## 4  4  painter
## 5  5    whale
## 6  6 medieval

nrow(tags)

## [1] 13222

There are 13,222 tags in total that can be associated to a movie! Most of these tags have not been tagged by users. We will further refine this data to tags that only users in the database have tagged.

# movie tag data
movie_tags <- read.csv('movie_tags.dat', sep='\t')
head(movie_tags)

##   movieID tagID tagWeight
## 1       1     7         1
## 2       1    13         3
## 3       1    25         3
## 4       1    55         3
## 5       1    60         1
## 6       1   146         1

summary(movie_tags)

##     movieID          tagID         tagWeight     
##  Min.   :    1   Min.   :    1   Min.   : 1.000  
##  1st Qu.: 1861   1st Qu.:  775   1st Qu.: 1.000  
##  Median : 4399   Median : 2738   Median : 1.000  
##  Mean   :12163   Mean   : 4355   Mean   : 1.381  
##  3rd Qu.: 8494   3rd Qu.: 6800   3rd Qu.: 1.000  
##  Max.   :65130   Max.   :16518   Max.   :42.000

# movies data
movies <- read.csv('movies.dat', sep='\t')
head(movies)

##   id                       title imdbID
## 1  1                   Toy story 114709
## 2  2                     Jumanji 113497
## 3  3              Grumpy Old Men 107050
## 4  4           Waiting to Exhale 114885
## 5  5 Father of the Bride Part II 113041
## 6  6                        Heat 113277
##                                         spanishTitle
## 1                               Toy story (juguetes)
## 2                                            Jumanji
## 3                                Dos viejos gruñones
## 4                               Esperando un respiro
## 5 Vuelve el padre de la novia (Ahora también abuelo)
## 6                                               Heat
##                                                                                                     imdbPictureURL
## 1 http://ia.media-imdb.com/images/M/MV5BMTMwNDU0NTY2Nl5BMl5BanBnXkFtZTcwOTUxOTM5Mw@@._V1._SX214_CR0,0,214,314_.jpg
## 2 http://ia.media-imdb.com/images/M/MV5BMzM5NjE1OTMxNV5BMl5BanBnXkFtZTcwNDY2MzEzMQ@@._V1._SY314_CR3,0,214,314_.jpg
## 3     http://ia.media-imdb.com/images/M/MV5BMTI5MTgyMzE0OF5BMl5BanBnXkFtZTYwNzAyNjg5._V1._SX214_CR0,0,214,314_.jpg
## 4 http://ia.media-imdb.com/images/M/MV5BMTczMTMyMTgyM15BMl5BanBnXkFtZTcwOTc4OTQyMQ@@._V1._SY314_CR4,0,214,314_.jpg
## 5 http://ia.media-imdb.com/images/M/MV5BMTg1NDc2MjExOF5BMl5BanBnXkFtZTcwNjU1NDAzMQ@@._V1._SY314_CR5,0,214,314_.jpg
## 6 http://ia.media-imdb.com/images/M/MV5BMTM1NDc4ODkxNV5BMl5BanBnXkFtZTcwNTI4ODE3MQ@@._V1._SY314_CR1,0,214,314_.jpg
##   year                        rtID rtAllCriticsRating
## 1 1995                   toy_story                  9
## 2 1995             1068044-jumanji                5.6
## 3 1993              grumpy_old_men                5.9
## 4 1995           waiting_to_exhale                5.6
## 5 1995 father_of_the_bride_part_ii                5.3
## 6 1995                1068182-heat                7.7
##   rtAllCriticsNumReviews rtAllCriticsNumFresh rtAllCriticsNumRotten
## 1                     73                   73                     0
## 2                     28                   13                    15
## 3                     36                   24                    12
## 4                     25                   14                    11
## 5                     19                    9                    10
## 6                     58                   50                     8
##   rtAllCriticsScore rtTopCriticsRating rtTopCriticsNumReviews
## 1               100                8.5                     17
## 2                46                5.8                      5
## 3                66                  7                      6
## 4                56                5.5                     11
## 5                47                5.4                      5
## 6                86                7.2                     17
##   rtTopCriticsNumFresh rtTopCriticsNumRotten rtTopCriticsScore
## 1                   17                     0               100
## 2                    2                     3                40
## 3                    5                     1                83
## 4                    5                     6                45
## 5                    1                     4                20
## 6                   14                     3                82
##   rtAudienceRating rtAudienceNumRatings rtAudienceScore
## 1              3.7               102338              81
## 2              3.2                44587              61
## 3              3.2                10489              66
## 4              3.3                 5666              79
## 5                3                13761              64
## 6              3.9                42785              92
##                                                   rtPictureURL
## 1 http://content7.flixster.com/movie/10/93/63/10936393_det.jpg
## 2  http://content8.flixster.com/movie/56/79/73/5679734_det.jpg
## 3      http://content6.flixster.com/movie/25/60/256020_det.jpg
## 4 http://content9.flixster.com/movie/10/94/17/10941715_det.jpg
## 5      http://content8.flixster.com/movie/25/54/255426_det.jpg
## 6      http://content9.flixster.com/movie/26/80/268099_det.jpg

total_movies <- nrow(movies)

# Number of movies tagged
movies_tagged <- length(unique(movie_tags$movieID))

# check sparsity in tags
movies_tagged/total_movies

## [1] 0.701677

The maximum tags associated with a movie is total of 42 while most of the movies on average are tagged once.

There are total of 10,197 movies in the matrix. Roughly 70% of the movies are tagged and 30% have no associations.

User Tag Data

# movie tag data
utp_data <- read.csv('user_taggedmovies.dat', sep='\t')
head(utp_data)

##   userID movieID tagID date_day date_month date_year date_hour date_minute
## 1     75     353  5290       29         10      2006        23          20
## 2     78    4223  5264       16          4      2007         4          43
## 3    127    1343  1544       28          8      2007         3          42
## 4    127    1343 12330       28          8      2007         3          42
## 5    127    2080  1451       28          8      2007         3          42
## 6    127    2080  1574       28          8      2007         3          42
##   date_second
## 1          15
## 2          45
## 3          27
## 4          27
## 5          47
## 6          47

summary(utp_data)

##      userID         movieID          tagID          date_day   
##  Min.   :   75   Min.   :    1   Min.   :    1   Min.   : 1.0  
##  1st Qu.:23032   1st Qu.: 1580   1st Qu.:  606   1st Qu.: 9.0  
##  Median :30167   Median : 4216   Median : 1974   Median :16.0  
##  Mean   :34708   Mean   :12228   Mean   : 4100   Mean   :15.9  
##  3rd Qu.:50846   3rd Qu.: 8512   3rd Qu.: 6794   3rd Qu.:23.0  
##  Max.   :71534   Max.   :65130   Max.   :16529   Max.   :31.0  
##    date_month       date_year      date_hour     date_minute   
##  Min.   : 1.000   Min.   :2005   Min.   : 0.0   Min.   : 0.00  
##  1st Qu.: 2.000   1st Qu.:2006   1st Qu.: 5.0   1st Qu.:14.00  
##  Median : 4.000   Median :2007   Median :14.0   Median :29.00  
##  Mean   : 5.257   Mean   :2007   Mean   :12.3   Mean   :29.51  
##  3rd Qu.: 8.000   3rd Qu.:2007   3rd Qu.:19.0   3rd Qu.:45.00  
##  Max.   :12.000   Max.   :2009   Max.   :23.0   Max.   :59.00  
##   date_second   
##  Min.   : 0.00  
##  1st Qu.:14.00  
##  Median :29.00  
##  Mean   :29.38  
##  3rd Qu.:44.00  
##  Max.   :59.00

top10tags <- as.data.frame(head(sort(table(utp_data$tagID), decreasing = TRUE),10), stringsAsFactors = FALSE)
top10tags$Var1 <- as.numeric(top10tags$Var1)

tagNames <- list()

for(i in 1:10){
  tagNames[i] <- as.character(tags[top10tags[i,1],2])
}
top10tags <- cbind.data.frame(top10tags,unlist(tagNames), stringsAsFactors = FALSE)                   
colnames(top10tags) <- c("tagID","Freq","tagName")

library(ggplot2)
ggplot(top10tags, aes(tagName,Freq)) + geom_bar(stat="identity") + coord_flip()

Movie Ratings data

# user rated movies
user_rated_movies <- read.csv('user_ratedmovies.dat', sep='\t')
head(user_rated_movies)

##   userID movieID rating date_day date_month date_year date_hour
## 1     75       3    1.0       29         10      2006        23
## 2     75      32    4.5       29         10      2006        23
## 3     75     110    4.0       29         10      2006        23
## 4     75     160    2.0       29         10      2006        23
## 5     75     163    4.0       29         10      2006        23
## 6     75     165    4.5       29         10      2006        23
##   date_minute date_second
## 1          17          16
## 2          23          44
## 3          30           8
## 4          16          52
## 5          29          30
## 6          25          15

summary(user_rated_movies)

##      userID         movieID          rating         date_day    
##  Min.   :   75   Min.   :    1   Min.   :0.500   Min.   : 1.00  
##  1st Qu.:18161   1st Qu.: 1367   1st Qu.:3.000   1st Qu.: 8.00  
##  Median :33866   Median : 3249   Median :3.500   Median :15.00  
##  Mean   :35191   Mean   : 8710   Mean   :3.438   Mean   :15.57  
##  3rd Qu.:52004   3rd Qu.: 6534   3rd Qu.:4.000   3rd Qu.:23.00  
##  Max.   :71534   Max.   :65133   Max.   :5.000   Max.   :31.00  
##    date_month       date_year      date_hour      date_minute   
##  Min.   : 1.000   Min.   :1997   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 4.000   1st Qu.:2004   1st Qu.: 5.00   1st Qu.:15.00  
##  Median : 7.000   Median :2006   Median :13.00   Median :30.00  
##  Mean   : 6.541   Mean   :2005   Mean   :12.12   Mean   :29.65  
##  3rd Qu.:10.000   3rd Qu.:2007   3rd Qu.:19.00   3rd Qu.:45.00  
##  Max.   :12.000   Max.   :2009   Max.   :23.00   Max.   :59.00  
##   date_second   
##  Min.   : 0.00  
##  1st Qu.:15.00  
##  Median :30.00  
##  Mean   :29.51  
##  3rd Qu.:44.00  
##  Max.   :59.00

ggplot2::qplot(user_rated_movies$rating, geom="histogram")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Minimum rating is 0.5 and maximum is 5.0. Average ratings on higher end between 3 and 4.

Check the Sparasity of Ratings Data

possible_ratings <- user_count * movie_count
rated <- nrow(user_rated_movies)

rated

## [1] 855598

possible_ratings

## [1] 21546261

rated/possible_ratings

## [1] 0.03970981

There are roughly 860,000 movies rated out of 21 million possible ratings. As you can see, the data is very sparse and only 4% has ratings.

Load data to create matrices

# read all data sets again

mtp_data <- read.csv('movie_tags.dat', sep='\t')
utp_data <- read.csv('user_taggedmovies.dat', sep='\t')
umr_data <- read.csv('user_ratedmovies.dat', sep='\t')
tags <- read.csv('tags.dat', sep='\t') 
movies <- read.csv('movies.dat', sep='\t')[,1:2]
movies$title <- as.character(movies$title)



###### --- U X M (user-movie ratings), finds tag preferences to users ---####
umr_long <- umr_data %>% select(userID, movieID, rating)# %>%

mtp_long <- mtp_data %>% select(movieID, tagID, tagWeight)

utp_long <- utp_data %>% select(userID, tagID)  %>% 
            group_by(userID, tagID) %>% 
            summarise(n = n())

Create Movie-Tag Matrix

Used for Content

It appears that many words occur rarely and fewer words are tagged more often. We will employ tf-idf to find the important tags by decreasing the weight for tags used more often. The following is the equation for tf-idf but We will use the bind_td_idf from the tidytext package.

# select data only needed
mtp_pr <- sqldf("SELECT mtp_long.movieID, mtp_long.tagID, mtp_long.tagWeight FROM mtp_long")

## Loading required package: tcltk

## Warning: Quoted identifiers should have class SQL, use DBI::SQL() if the
## caller performs the quoting.

# filter to only tags that exist in user tag db as well
mtp_pr <- sqldf("SELECT * from mtp_pr where
                tagID in (select tagID from utp_long)")

# calculate tf-idf
mtp_pr <- mtp_pr %>% bind_tf_idf(movieID, tagID, tagWeight) %>% select(movieID, tagID, tf_idf)

# reshape data in wide-format
mtp <- dcast(mtp_pr, movieID ~ tagID)

## Using tf_idf as value column: use value.var to override.

rownames(mtp) <- mtp[,1]
mtp <- within(mtp, rm(movieID))
mtp[is.na(mtp)] <- 0

dim(mtp)

## [1] 7022 4435

kable(mtp[1:5,1:20])

	7	13	25
1	0.3090722	0.2207658	2.78165
2	0.0000000	0.1748226	0.00000
3	0.0000000	0.0000000	0.00000
5	0.0000000	0.0000000	0.00000
6	0.0000000	0.0000000	0.00000

Create User-Tag Matrix

# select tags only that exists in movie db
utp_pr <- sqldf("SELECT utp_long.userID, utp_long.tagID, utp_long.n from utp_long
             WHERE utp_long.tagID in (select tagID from mtp_pr)")

utp_users <- as.data.frame(unique(utp_pr$userID))
colnames(utp_users) <- "userID"

# reshape data in wide-format
utp<- dcast(utp_pr, userID ~ tagID)

## Using n as value column: use value.var to override.

rownames(utp) <- utp[,1]
utp <- within(utp, rm(userID))
utp[is.na(utp)] <- 0

dim(utp)

## [1] 1875 4435

kable(utp[1:5,1:10])

	1	2	3	4	6	7	8	9	11	12
78	0	0	0	0	0	0	0	0	0	0
127	0	0	0	0	0	1	0	0	0	0
170	0	0	0	0	0	0	0	0	0	0
175	0	0	0	0	0	0	0	0	0	0
190	0	0	0	0	0	0	0	0	0	0

Create User-Movie Matrix

# cut down number of movies and consider only that have been rated more than 500 times
umr_movies <- sqldf("SELECT DISTINCT umr_long.movieID, count(*) from umr_long
                     WHERE umr_long.MovieID IN (SELECT DISTINCT mtp_long.movieID from mtp_long)
                     AND umr_long.userID in (SELECT DISTINCT utp_long.userID from utp_long)
                     GROUP BY umr_long.movieID 
                     HAVING count(*) > 500")

umr_movies <- as.data.frame(unique(umr_movies$movieID))
colnames(umr_movies) <- "movieID"

umr_pr <- sqldf("SELECT * from umr_long
                WHERE umr_long.movieID in (SELECT movieID from umr_movies)
                AND umr_long.userID in (SELECT userID from utp_users)")


umr<- dcast(umr_pr, userID ~ movieID)

## Using rating as value column: use value.var to override.

rownames(umr) <- umr[,1]
umr <- within(umr, rm(userID))

dim(umr)

## [1] 1868  368

kable(umr[1:5,1:10])

	1	2	6	10	16	19	21	32	34	39
78	NA	NA	NA	NA	NA	NA	NA	5	NA	NA
127	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
170	3.0	2	NA	3.5	NA	2.0	3.0	4	2	2
175	4.0	NA	5	NA	NA	0.5	0.5	4	NA	5
190	4.5	4	NA	NA	NA	NA	NA	3	NA	NA

User-Movie Rating Predictions - IBCF

# convert matrix
umr_matrix <- as(as.matrix(umr), "realRatingMatrix")

# run recommenderlab builtin function
recsys <- Recommender(data = umr_matrix, method = "IBCF", parameter = list(method = "Cosine"))
recsysTop5 <- predict(object = recsys, newdata = umr_matrix, n = 5)
top5items <- recsysTop5@items

head(top5items, n=5)

## $`78`
## [1] 153 306 265 168 327
## 
## $`127`
## [1]  34 145 221 271 312
## 
## $`170`
## [1] 355 342 274 142 113
## 
## $`175`
## [1] 223 336 268 172 225
## 
## $`190`
## [1] 210 326 204 280 140

Evaluate

eval_sets <- evaluationScheme(data = umr_matrix, method = "split", train = 0.7, given = 1, goodRating = 3, k = 1)

#IBCF
eval_recommender <- Recommender(data = getData(eval_sets, "train"), method = "IBCF", parameter = NULL)
eval_prediction <- predict(object = eval_recommender, newdata = getData(eval_sets, "known"), n = 10, type = "ratings")
calcPredictionAccuracy(x = eval_prediction, data = getData(eval_sets, "unknown"), byUser = FALSE)

##      RMSE       MSE       MAE 
## 0.9895285 0.9791667 0.7083333

Add Tag Weights (Content-Based)

Let’s explore recommendation for the first user. We will take these recommendations and find their weight using the Movie-Tag and User-Tag Matrix.

user_one <- top5items[1]
mtp_rows<- which(rownames(mtp) %in% as.character(unlist(user_one)))
mtp_selected_rows <- mtp[mtp_rows,]

# get user weight
userweight <- utp["78",]

# multiply the movie tag weight with user tag column
finalresults <- as.matrix(mtp_selected_rows) %*% t(as.matrix(userweight))

# sort the results
finalmovies <- as.vector(rownames(finalresults))

user_results <- movies %>% filter(id %in% finalmovies) %>% select(title)
colnames(user_results) <- c("Top 5 Movies")

user_results

##               Top 5 Movies
## 1           Batman Forever
## 2             First Knight
## 3 Como agua para chocolate
## 4    Trois couleurs: Rouge
## 5                Tank Girl

Conclusion

From working on this project, the most challenging part I found was setting up data. Total of 90% of my time was consumed by how I should lay out the matrices so the rows and columns of movies and tags align. If I had filtered out movies from the movie-tag profile then I would have to update both ratings and user tag matrices. Working with 3 matrices to create subsets were not easy and I believe there is still lot more work involved in getting the data right.
As for making predictions, since algorithms were available to us, it was not as consuming to make predictions on ratings and then incorporate them with weight of tags.