Note that not all codes are being displayed.
Recommendation Systems are models whose algorithm focuses on filtering. It seeks to predict the “rating” or preference of a user, given a set of items. Recommendation systems are of two major types; the collaborative Recommendation system and the content based Recommendation system. The Recommendation system works primarily in the Media and entertainment industry. The collaborative Recommendation system is the type that aggregates ratings or recommendation of objects, identify similarities between the users by using their ratings and then generating new recommendations based on inter-user comparisons. It is based on the assumption that users who agreed in the past will agree in future, and will like similar objects as they did in the past. The content Based Recommendation system is such that the objects are mainly defined by features associated with them. The content based Recommendation operates by learning a profile of the new user’s interests based on the features present in items the user has rated.
In this report, we aim to use the TMDB (The Movie Database) 5000 Movie data-set to build a content-based recommendation system in order to be able to recommend a set of movies for users given that a movie is fed into the system.
To achieve this aim, we develop some objectives; 1. We clean the data. 2. we do data engineering to make sure the data set have useful features. 3. we perform Exploratory data analysis to have fair insight about the data. 4. We create a Recommendation system. 5. And finally, we test this system by feeding it with different movies.
The first approach is to use the “glimpse” function to view the structure of the dataset. We may observe that the dataset contains some variables encoded in the JSON format, having multiple information enough to be another set of variables in them. We will attempt to ectract variables from these JSON variables. We renamed the “id” variable ti “ID” to avoid conflicts later on during the analysis. We also checked for Nas discovered none was present.
In order to extract more useful features from the dataset, we deal with the features encoded in JSON format like genres and production_companies, the same happens with crew and cast in the HMDB_c. these features includes some additional informations useful for the recommendation. We attempt to create some dataframe out of those columns. Genre, Production COmpanies, cast(major actors) and movie directors.
Below is the first few observatons of directors extracted from the JSON encoded variable.
Table 2.1| movie_id | name |
|---|---|
| 19995 | James Cameron |
| 285 | Gore Verbinski |
| 206647 | Sam Mendes |
| 49026 | Christopher Nolan |
| 49529 | Andrew Stanton |
We proceed in feature engineering by categorizing verage votes into 4 categories, poor for average votes less than 3, fair for average votes greater than 3 but less than 6, good for ones less than 8 and excellent for above votes averaging 8 and above.
We attempt to visualize some features in the dataset. We start by identifying the top directors by number of movies directed. Figure 2.2.1 shows this
genre_1 <- genres %>% group_by(ID) %>% summarise(genre = first(genres))
# Identify top directors by num of movies
top_directors <- director %>%
group_by(name) %>%
summarise(total_n = n()) %>%
top_n(20, wt = total_n)
director %>%
filter(name %in% top_directors$name) %>%
left_join(genre_1, by = c('movie_id' = 'ID')) %>%
count(name, genre) %>%
left_join(top_directors, by = 'name') %>%
ggplot(aes(x = reorder(name, total_n), y = n, fill = genre)) +
geom_col() +
coord_flip() +
scale_fill_manual(values = wes_palette('Darjeeling1',
length(unique(genre_1$genre)),
type = 'continuous')) +
labs(x = 'Directors', y = 'Number of Movies')
Figure 2.2.1
Another interesting feature to look out for is the top movies on average votes in Figure 2.2.2.
Figure 2.2.2
Another one to be considered is in Figure 2.2.3 which shows a plot of Top Genres by Total number of movies.
#Top genre by number of movies
genres %>%
count(genres) %>%
ggplot(aes(x = reorder(genres, n), y = n)) +
geom_col(fill = wes_palette('Rushmore1', 1, type = 'discrete')) +
coord_flip() +
labs(x = 'Genres', y = 'Number of Movies')
Figure 2.2.3
We Build the first Recomender system based on the genre, actor, rating and directors to recommend movies. In this case, we do slight data engineering to make the genres form corpus(a collection of genres).
pasted_genres <- genres %>%
mutate(genres = str_replace_all(genres, ' ', '')) %>%
group_by(ID) %>%
summarise(genres = paste(genres, collapse = ' ' ))
pasted_directors <- director %>%
mutate(director = str_replace_all(name, ' ', '')) %>%
group_by(movie_id) %>%
summarise(director = first(director))
pasted_cast <- cast %>%
mutate(actors = str_replace_all(name, ' ', '')) %>%
group_by(movie_id) %>%
summarise(actors = paste(actors, collapse = ' ' ))
# Create corpus
corpus_metadata <- TMDB %>%
select(ID, title, Rating) %>%
distinct(title, .keep_all = TRUE) %>%
left_join(pasted_genres, by = 'ID') %>%
left_join(pasted_directors, by = c('ID' = 'movie_id')) %>%
left_join(pasted_cast, by = c('ID' = 'movie_id')) %>%
transmute(doc_id = title, text = paste(genres, director, actors,
Rating)) %>%
as.data.frame() %>%
DataframeSource() %>%
Corpus()
# Form dtm with binary weighting
dt_md_bin <- DocumentTermMatrix(corpus_metadata,
control = list(weighting = function(x) weightBin(x)))
# Convert into Matrix
frec_matrix <- dt_md_bin %>% as.matrix()
After we are successful forming a corpus, we create a Recommendation function and apply it on the corpus. We attempted feeding it with a movie Spider-Man 3 and these are the top 10 recomendations produced by the system.
##Recommendation####
Recommendation <- function(matr, title = 'Avatar', db = TMDB, n = 10){
# matr: matrix of weigth
# title: string title of the movie
# db: original df to extract votes
# n: number of recommendations to provide
ind <- which(row.names(matr) == title)
frec_movie <- matr[ind,]
# Sum vector and matrix to identify mutual terms
mutual_terms <- sweep(matr, 2, frec_movie, '+')
# Sum mutual terms
most_frec <- apply(mutual_terms, 1, function(x) {sum(x == 2)})
# Join with original df to get vote_average
recomms <- data.frame(title = names(most_frec), frec = most_frec) %>%
mutate_if(is.character, factor) %>%
left_join(db %>%
select(title, vote_average) %>%
distinct(title, .keep_all = TRUE) %>%
mutate_if(is.character, factor),
by = 'title') %>%
arrange(desc(frec), desc(vote_average)) %>%
select(title) %>%
head(n)
return(recomms)
}
kable(Recommendation(frec_matrix, 'Spider-Man 3'))
| title |
|---|
| Spider-Man 3 |
| Spider-Man 2 |
| Spider-Man |
| Oz: The Great and Powerful |
| Small Soldiers |
| The Mummy Returns |
| Reign of Fire |
| The Monkey King 2 |
| Predators |
| Suicide Squad |
Again, we attempted feeding it with a movie John Carter and these are the top 10 recomendations produced by the system.
kable(Recommendation(frec_matrix, 'John Carter'))
| title |
|---|
| John Carter |
| X-Men Origins: Wolverine |
| Guardians of the Galaxy |
| Return of the Jedi |
| Captain America: The Winter Soldier |
| X-Men: Days of Future Past |
| The Avengers |
| Star Trek Into Darkness |
| Iron Man |
| Star Trek |