1.0 Recommendation System

Note that not all codes are being displayed.

1.1 Introduction

Recommendation Systems are models whose algorithm focuses on filtering. It seeks to predict the “rating” or preference of a user, given a set of items. Recommendation systems are of two major types; the collaborative Recommendation system and the content based Recommendation system. The Recommendation system works primarily in the Media and entertainment industry. The collaborative Recommendation system is the type that aggregates ratings or recommendation of objects, identify similarities between the users by using their ratings and then generating new recommendations based on inter-user comparisons. It is based on the assumption that users who agreed in the past will agree in future, and will like similar objects as they did in the past. The content Based Recommendation system is such that the objects are mainly defined by features associated with them. The content based Recommendation operates by learning a profile of the new user’s interests based on the features present in items the user has rated.

1.1 Statememt of problem

In this report, we aim to use the TMDB (The Movie Database) 5000 Movie data-set to build a content-based recommendation system in order to be able to recommend a set of movies for users given that a movie is fed into the system.

To achieve this aim, we develop some objectives; 1. We clean the data. 2. we do data engineering to make sure the data set have useful features. 3. we perform Exploratory data analysis to have fair insight about the data. 4. We create a Recommendation system. 5. And finally, we test this system by feeding it with different movies.

2.0 Data Preparation

The first approach is to use the “glimpse” function to view the structure of the dataset. We may observe that the dataset contains some variables encoded in the JSON format, having multiple information enough to be another set of variables in them. We will attempt to ectract variables from these JSON variables. We renamed the “id” variable ti “ID” to avoid conflicts later on during the analysis. We also checked for Nas discovered none was present.

2.1 Feature Engineering

In order to extract more useful features from the dataset, we deal with the features encoded in JSON format like genres and production_companies, the same happens with crew and cast in the HMDB_c. these features includes some additional informations useful for the recommendation. We attempt to create some dataframe out of those columns. Genre, Production COmpanies, cast(major actors) and movie directors.

Below is the first few observatons of directors extracted from the JSON encoded variable.

Table 2.1

movie_id	name
19995	James Cameron
285	Gore Verbinski
206647	Sam Mendes
49026	Christopher Nolan
49529	Andrew Stanton

We proceed in feature engineering by categorizing verage votes into 4 categories, poor for average votes less than 3, fair for average votes greater than 3 but less than 6, good for ones less than 8 and excellent for above votes averaging 8 and above.

2.2 Exploratory Analysis

We attempt to visualize some features in the dataset. We start by identifying the top directors by number of movies directed. Figure 2.2.1 shows this

genre_1 <- genres %>% group_by(ID) %>% summarise(genre = first(genres))

# Identify top directors by num of movies
top_directors <- director %>% 
        group_by(name) %>%
        summarise(total_n = n()) %>% 
        top_n(20, wt = total_n)

director %>% 
        filter(name %in% top_directors$name) %>% 
        left_join(genre_1, by = c('movie_id' = 'ID')) %>% 
        count(name, genre) %>% 
        left_join(top_directors, by = 'name') %>% 
        ggplot(aes(x = reorder(name, total_n), y = n, fill = genre)) +
        geom_col() + 
        coord_flip() +
        scale_fill_manual(values = wes_palette('Darjeeling1', 
                                               length(unique(genre_1$genre)), 
                                               type = 'continuous')) +
        labs(x = 'Directors', y = 'Number of Movies')

Figure 2.2.1

Another interesting feature to look out for is the top movies on average votes in Figure 2.2.2.

Figure 2.2.2

Another one to be considered is in Figure 2.2.3 which shows a plot of Top Genres by Total number of movies.

#Top genre by number of movies
genres %>% 
        count(genres) %>% 
        ggplot(aes(x = reorder(genres, n), y = n)) +
        geom_col(fill = wes_palette('Rushmore1', 1, type = 'discrete')) +
        coord_flip() +
        labs(x = 'Genres', y = 'Number of Movies')

Figure 2.2.3

3.0 The Recommendation System

We Build the first Recomender system based on the genre, actor, rating and directors to recommend movies. In this case, we do slight data engineering to make the genres form corpus(a collection of genres).

pasted_genres <- genres %>% 
        mutate(genres = str_replace_all(genres, ' ', '')) %>% 
        group_by(ID) %>% 
        summarise(genres = paste(genres, collapse = ' ' ))

pasted_directors <- director %>% 
        mutate(director = str_replace_all(name, ' ', '')) %>% 
        group_by(movie_id) %>% 
        summarise(director = first(director))

pasted_cast <- cast %>%
        mutate(actors = str_replace_all(name, ' ', '')) %>% 
        group_by(movie_id) %>% 
        summarise(actors = paste(actors, collapse = ' ' ))

# Create corpus
corpus_metadata <- TMDB %>%
        select(ID, title, Rating) %>% 
        distinct(title, .keep_all = TRUE) %>% 
        left_join(pasted_genres, by = 'ID') %>% 
        left_join(pasted_directors, by = c('ID' = 'movie_id')) %>% 
        left_join(pasted_cast, by = c('ID' = 'movie_id')) %>%
        transmute(doc_id = title, text = paste(genres, director, actors,
                                               Rating)) %>% 
        as.data.frame() %>% 
        DataframeSource() %>%
        Corpus()

# Form dtm with binary weighting
dt_md_bin <- DocumentTermMatrix(corpus_metadata,
                                control = list(weighting = function(x) weightBin(x)))

# Convert into Matrix
frec_matrix <- dt_md_bin %>% as.matrix()

After we are successful forming a corpus, we create a Recommendation function and apply it on the corpus. We attempted feeding it with a movie Spider-Man 3 and these are the top 10 recomendations produced by the system.

##Recommendation####
Recommendation <- function(matr, title = 'Avatar', db = TMDB, n = 10){
        
        # matr: matrix of weigth
        # title: string title of the movie
        # db: original df to extract votes
        # n: number of recommendations to provide
        
        ind <- which(row.names(matr) == title)
        
        frec_movie <- matr[ind,]
        
        # Sum vector and matrix to identify mutual terms
        mutual_terms <- sweep(matr, 2, frec_movie, '+')
        
        # Sum mutual terms
        most_frec <- apply(mutual_terms, 1, function(x) {sum(x == 2)})
        
        # Join with original df to get vote_average
        recomms <- data.frame(title = names(most_frec), frec = most_frec) %>% 
                mutate_if(is.character, factor) %>% 
                left_join(db %>%
                                  select(title, vote_average) %>% 
                                  distinct(title, .keep_all = TRUE) %>% 
                                  mutate_if(is.character, factor), 
                          by = 'title') %>% 
                arrange(desc(frec), desc(vote_average)) %>% 
                select(title) %>% 
                head(n)
        
        return(recomms)
}

kable(Recommendation(frec_matrix, 'Spider-Man 3'))

title
Spider-Man 3
Spider-Man 2
Spider-Man
Oz: The Great and Powerful
Small Soldiers
The Mummy Returns
Reign of Fire
The Monkey King 2
Predators
Suicide Squad

Again, we attempted feeding it with a movie John Carter and these are the top 10 recomendations produced by the system.

kable(Recommendation(frec_matrix, 'John Carter'))

title
John Carter
X-Men Origins: Wolverine
Guardians of the Galaxy
Return of the Jedi
Captain America: The Winter Soldier
X-Men: Days of Future Past
The Avengers
Star Trek Into Darkness
Iron Man
Star Trek