Recommendation systems:TMDB 5000 Movie Dataset

Abstract
I. Introduction
II. Dataset and executive summary
III. preprocessing
IV. Methods and modelling approaches
V. Results
VI. Conclusion
References

Abstract

write the abstract text here….

on_kaggle <- 1

if (on_kaggle == 0){
  path <- getwd() #load local copy of dataset
    } else {
  path <- "" #access dataset on Kaggle
    }

I. Introduction

Recommender systems are one of the most successful and widespread application of machine learning technologies in business.There are a subsclass of information filtering system that seek to predict the “rating” or “preference” a user would give to an item. One of the most famous success story of the recommender system is the Netflix competition launched on october 2006. In 2009, at the end of the challenge , Netflix awarded a one million dollar prize to a developer team for an algorithm that increased the accuracy of the company’s recommendation engine by 10%. Many well-known recommendation algorithms, such as latent factor models, were popularized by the Netflix contest.The Netflix prize contest is become notable for its numerous contributions to the data science community.

According to Aggarwal(2016),the recommendation problem may be formulated in various ways, among which the two main are as follows: The first approach , the “prediction version of problem” aims to predict the rating value for a user-item combination. It is also referred to as “matrix completion problem”. The second approach, the “ranking version of problem” seeks to recommend the top-k items for a particular user, or determine the top-k users to target for a particular item. In the second case, the absolute values of the predicted ratings are not important. The first formulation is more general, because the solutions to the second case can be derived by solving the first formulation for various user-item combinations and then ranking the predictions.

For the ML project, we use the TMDB 5000 Movie Dataset available on the Kaggle platform. This dataset was generated from the The Movie Database API. The principal question which arises from the description of the challenge is to predict which films will be highly rated, whether or not they are a commercial success. This means that is mainly a “ranking version of problem” since it does not expect a submission file with predicted ratings for each film, but a list of the top-k recommended films. This approach will be studied and the possible features processed and analyzed with several Machine Learning techniques, focusing on content-based, collaborative filtering, and hybrid recommender systems as described in Adomavicius et al(2010) and Zanker et al() .

To achieve the goal of our analysis , we followed the different steps :

II. Dataset and executive summary
III. preprocessing
IV. Methods and modelling approaches
V. Results

II. Dataset and executive summary

1.Dataset overlook.

Loading library and data

library(tidyverse)
library(scales)
library(jsonlite)
library(knitr)
library(kableExtra)
library(ggrepel)
library(gridExtra)
library(lubridate)
library(tidytext)
library(plyr)
library(formattable)
library(splitstackshape) 
library(jsonlite) #JSON format
library(wordcloud) 
library(RColorBrewer) 
library(ggthemes) 
library(tm) 
library(RSentiment)
library(zoo)
library(stringr)
library(ggplot2)
library(readr)

films <- read_csv(str_c(path, "tmdb_5000_movies.csv"), na="NA")
credits <- read_csv(str_c(path, "tmdb_5000_credits.csv"),  na="NA")

Summary

Movie dataset

class(films)

## [1] "tbl_df"     "tbl"        "data.frame"

glimpse(films)

## Observations: 4,803
## Variables: 20
## $ budget               <int> 237000000, 300000000, 245000000, 25000000...
## $ genres               <chr> "[{\"id\": 28, \"name\": \"Action\"}, {\"...
## $ homepage             <chr> "http://www.avatarmovie.com/", "http://di...
## $ id                   <int> 19995, 285, 206647, 49026, 49529, 559, 38...
## $ keywords             <chr> "[{\"id\": 1463, \"name\": \"culture clas...
## $ original_language    <chr> "en", "en", "en", "en", "en", "en", "en",...
## $ original_title       <chr> "Avatar", "Pirates of the Caribbean: At W...
## $ overview             <chr> "In the 22nd century, a paraplegic Marine...
## $ popularity           <dbl> 150.43758, 139.08262, 107.37679, 112.3129...
## $ production_companies <chr> "[{\"name\": \"Ingenious Film Partners\",...
## $ production_countries <chr> "[{\"iso_3166_1\": \"US\", \"name\": \"Un...
## $ release_date         <date> 2009-12-10, 2007-05-19, 2015-10-26, 2012...
## $ revenue              <dbl> 2787965087, 961000000, 880674609, 1084939...
## $ runtime              <int> 162, 169, 148, 165, 132, 139, 100, 141, 1...
## $ spoken_languages     <chr> "[{\"iso_639_1\": \"en\", \"name\": \"Eng...
## $ status               <chr> "Released", "Released", "Released", "Rele...
## $ tagline              <chr> "Enter the World of Pandora.", "At the en...
## $ title                <chr> "Avatar", "Pirates of the Caribbean: At W...
## $ vote_average         <dbl> 7.2, 6.9, 6.3, 7.6, 6.1, 5.9, 7.4, 7.3, 7...
## $ vote_count           <int> 11800, 4500, 4466, 9106, 2124, 3576, 3330...

summary(films)

##      budget             genres            homepage        
##  Min.   :        0   Length:4803        Length:4803       
##  1st Qu.:   790000   Class :character   Class :character  
##  Median : 15000000   Mode  :character   Mode  :character  
##  Mean   : 29045040                                        
##  3rd Qu.: 40000000                                        
##  Max.   :380000000                                        
##                                                           
##        id           keywords         original_language  original_title    
##  Min.   :     5   Length:4803        Length:4803        Length:4803       
##  1st Qu.:  9014   Class :character   Class :character   Class :character  
##  Median : 14629   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 57166                                                           
##  3rd Qu.: 58611                                                           
##  Max.   :459488                                                           
##                                                                           
##    overview           popularity      production_companies
##  Length:4803        Min.   :  0.000   Length:4803         
##  Class :character   1st Qu.:  4.668   Class :character    
##  Mode  :character   Median : 12.922   Mode  :character    
##                     Mean   : 21.492                       
##                     3rd Qu.: 28.314                       
##                     Max.   :875.581                       
##                                                           
##  production_countries  release_date           revenue         
##  Length:4803          Min.   :1916-09-04   Min.   :0.000e+00  
##  Class :character     1st Qu.:1999-07-14   1st Qu.:0.000e+00  
##  Mode  :character     Median :2005-10-03   Median :1.917e+07  
##                       Mean   :2002-12-27   Mean   :8.226e+07  
##                       3rd Qu.:2011-02-16   3rd Qu.:9.292e+07  
##                       Max.   :2017-02-03   Max.   :2.788e+09  
##                       NA's   :1                               
##     runtime    spoken_languages      status            tagline         
##  Min.   :  0   Length:4803        Length:4803        Length:4803       
##  1st Qu.: 94   Class :character   Class :character   Class :character  
##  Median :104   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :107                                                           
##  3rd Qu.:118                                                           
##  Max.   :338                                                           
##  NA's   :80                                                            
##     title            vote_average      vote_count     
##  Length:4803        Min.   : 0.000   Min.   :    0.0  
##  Class :character   1st Qu.: 5.600   1st Qu.:   54.0  
##  Mode  :character   Median : 6.200   Median :  235.0  
##                     Mean   : 6.092   Mean   :  690.2  
##                     3rd Qu.: 6.800   3rd Qu.:  737.0  
##                     Max.   :10.000   Max.   :13752.0  
##

credit dataset

class(credits)

## [1] "tbl_df"     "tbl"        "data.frame"

glimpse(credits)

## Observations: 4,803
## Variables: 4
## $ movie_id <int> 19995, 285, 206647, 49026, 49529, 559, 38757, 99861, ...
## $ title    <chr> "Avatar", "Pirates of the Caribbean: At World's End",...
## $ cast     <chr> "[{\"cast_id\": 242, \"character\": \"Jake Sully\", \...
## $ crew     <chr> "[{\"credit_id\": \"52fe48009251416c750aca23\", \"dep...

summary(credits)

##     movie_id         title               cast               crew          
##  Min.   :     5   Length:4803        Length:4803        Length:4803       
##  1st Qu.:  9014   Class :character   Class :character   Class :character  
##  Median : 14629   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 57166                                                           
##  3rd Qu.: 58611                                                           
##  Max.   :459488

After loading the two provided files “tmdb_5000_movies.csv” and “tmdb_5000_credits.csv” , we can see that the movie dataset is made of 20 features for a total of about 4,803 observations while the credit dataset contains the same number of occurences, but with a total of 4 attributes. Accross the attributes of the two datasets, the key feature and the common identifier is the movie_id. The credit dataset shows other two attributes, “cast” and “crew”, that are not present in the movie dataset. Here are some attributes and their characteristics :

-movie_Id(credit dataset) or id(movie dataset): numeric, Unique ID for the movie.

-budjet: numeric, financial investment for the production of a movie Id.

-etc

As explained in the Kaggle Overview of the competition, some of the columns are in the JSON format.Looking at the output of glimpse function, we can recognize them starting with curly brackets “{” : genres, keywords, production_companies, production_countries, and spoken_languages.

Data Wrangling

To extract informations included in the JSON format attributes, we apply the following code :

#Loading.....

comment of the results…. etc..

2.Preliminary descriptive statistics

numeric variables: id, budget, revenue, popularity, runtime, vote_average, vote_count

Loading….

nominal variables: genres, title, ecc ...

Loading….

III. preprocessing

Loading….

IV. Methods and modelling approaches

Loading….

V. Results

Loading…

VI. Conclusion

Loading…

References

Loading…

Recommendation systems:TMDB 5000 Movie Dataset

Team38

January 6th,2019

Abstract

I. Introduction

II. Dataset and executive summary

1.Dataset overlook.

Movie dataset

credit dataset

2.Preliminary descriptive statistics

III. preprocessing

IV. Methods and modelling approaches

V. Results

VI. Conclusion

References