write the abstract text here….
on_kaggle <- 1
if (on_kaggle == 0){
path <- getwd() #load local copy of dataset
} else {
path <- "" #access dataset on Kaggle
}
Recommender systems are one of the most successful and widespread application of machine learning technologies in business.There are a subsclass of information filtering system that seek to predict the “rating” or “preference” a user would give to an item. One of the most famous success story of the recommender system is the Netflix competition launched on october 2006. In 2009, at the end of the challenge , Netflix awarded a one million dollar prize to a developer team for an algorithm that increased the accuracy of the company’s recommendation engine by 10%. Many well-known recommendation algorithms, such as latent factor models, were popularized by the Netflix contest.The Netflix prize contest is become notable for its numerous contributions to the data science community.
According to Aggarwal(2016),the recommendation problem may be formulated in various ways, among which the two main are as follows: The first approach , the “prediction version of problem” aims to predict the rating value for a user-item combination. It is also referred to as “matrix completion problem”. The second approach, the “ranking version of problem” seeks to recommend the top-k items for a particular user, or determine the top-k users to target for a particular item. In the second case, the absolute values of the predicted ratings are not important. The first formulation is more general, because the solutions to the second case can be derived by solving the first formulation for various user-item combinations and then ranking the predictions.
For the ML project, we use the TMDB 5000 Movie Dataset available on the Kaggle platform. This dataset was generated from the The Movie Database API. The principal question which arises from the description of the challenge is to predict which films will be highly rated, whether or not they are a commercial success. This means that is mainly a “ranking version of problem” since it does not expect a submission file with predicted ratings for each film, but a list of the top-k recommended films. This approach will be studied and the possible features processed and analyzed with several Machine Learning techniques, focusing on content-based, collaborative filtering, and hybrid recommender systems as described in Adomavicius et al(2010) and Zanker et al() .
To achieve the goal of our analysis , we followed the different steps :
Loading library and data
library(tidyverse)
library(scales)
library(jsonlite)
library(knitr)
library(kableExtra)
library(ggrepel)
library(gridExtra)
library(lubridate)
library(tidytext)
library(plyr)
library(formattable)
library(splitstackshape)
library(jsonlite) #JSON format
library(wordcloud)
library(RColorBrewer)
library(ggthemes)
library(tm)
library(RSentiment)
library(zoo)
library(stringr)
library(ggplot2)
library(readr)
films <- read_csv(str_c(path, "tmdb_5000_movies.csv"), na="NA")
credits <- read_csv(str_c(path, "tmdb_5000_credits.csv"), na="NA")
Summary
class(films)
## [1] "tbl_df" "tbl" "data.frame"
glimpse(films)
## Observations: 4,803
## Variables: 20
## $ budget <int> 237000000, 300000000, 245000000, 25000000...
## $ genres <chr> "[{\"id\": 28, \"name\": \"Action\"}, {\"...
## $ homepage <chr> "http://www.avatarmovie.com/", "http://di...
## $ id <int> 19995, 285, 206647, 49026, 49529, 559, 38...
## $ keywords <chr> "[{\"id\": 1463, \"name\": \"culture clas...
## $ original_language <chr> "en", "en", "en", "en", "en", "en", "en",...
## $ original_title <chr> "Avatar", "Pirates of the Caribbean: At W...
## $ overview <chr> "In the 22nd century, a paraplegic Marine...
## $ popularity <dbl> 150.43758, 139.08262, 107.37679, 112.3129...
## $ production_companies <chr> "[{\"name\": \"Ingenious Film Partners\",...
## $ production_countries <chr> "[{\"iso_3166_1\": \"US\", \"name\": \"Un...
## $ release_date <date> 2009-12-10, 2007-05-19, 2015-10-26, 2012...
## $ revenue <dbl> 2787965087, 961000000, 880674609, 1084939...
## $ runtime <int> 162, 169, 148, 165, 132, 139, 100, 141, 1...
## $ spoken_languages <chr> "[{\"iso_639_1\": \"en\", \"name\": \"Eng...
## $ status <chr> "Released", "Released", "Released", "Rele...
## $ tagline <chr> "Enter the World of Pandora.", "At the en...
## $ title <chr> "Avatar", "Pirates of the Caribbean: At W...
## $ vote_average <dbl> 7.2, 6.9, 6.3, 7.6, 6.1, 5.9, 7.4, 7.3, 7...
## $ vote_count <int> 11800, 4500, 4466, 9106, 2124, 3576, 3330...
summary(films)
## budget genres homepage
## Min. : 0 Length:4803 Length:4803
## 1st Qu.: 790000 Class :character Class :character
## Median : 15000000 Mode :character Mode :character
## Mean : 29045040
## 3rd Qu.: 40000000
## Max. :380000000
##
## id keywords original_language original_title
## Min. : 5 Length:4803 Length:4803 Length:4803
## 1st Qu.: 9014 Class :character Class :character Class :character
## Median : 14629 Mode :character Mode :character Mode :character
## Mean : 57166
## 3rd Qu.: 58611
## Max. :459488
##
## overview popularity production_companies
## Length:4803 Min. : 0.000 Length:4803
## Class :character 1st Qu.: 4.668 Class :character
## Mode :character Median : 12.922 Mode :character
## Mean : 21.492
## 3rd Qu.: 28.314
## Max. :875.581
##
## production_countries release_date revenue
## Length:4803 Min. :1916-09-04 Min. :0.000e+00
## Class :character 1st Qu.:1999-07-14 1st Qu.:0.000e+00
## Mode :character Median :2005-10-03 Median :1.917e+07
## Mean :2002-12-27 Mean :8.226e+07
## 3rd Qu.:2011-02-16 3rd Qu.:9.292e+07
## Max. :2017-02-03 Max. :2.788e+09
## NA's :1
## runtime spoken_languages status tagline
## Min. : 0 Length:4803 Length:4803 Length:4803
## 1st Qu.: 94 Class :character Class :character Class :character
## Median :104 Mode :character Mode :character Mode :character
## Mean :107
## 3rd Qu.:118
## Max. :338
## NA's :80
## title vote_average vote_count
## Length:4803 Min. : 0.000 Min. : 0.0
## Class :character 1st Qu.: 5.600 1st Qu.: 54.0
## Mode :character Median : 6.200 Median : 235.0
## Mean : 6.092 Mean : 690.2
## 3rd Qu.: 6.800 3rd Qu.: 737.0
## Max. :10.000 Max. :13752.0
##
class(credits)
## [1] "tbl_df" "tbl" "data.frame"
glimpse(credits)
## Observations: 4,803
## Variables: 4
## $ movie_id <int> 19995, 285, 206647, 49026, 49529, 559, 38757, 99861, ...
## $ title <chr> "Avatar", "Pirates of the Caribbean: At World's End",...
## $ cast <chr> "[{\"cast_id\": 242, \"character\": \"Jake Sully\", \...
## $ crew <chr> "[{\"credit_id\": \"52fe48009251416c750aca23\", \"dep...
summary(credits)
## movie_id title cast crew
## Min. : 5 Length:4803 Length:4803 Length:4803
## 1st Qu.: 9014 Class :character Class :character Class :character
## Median : 14629 Mode :character Mode :character Mode :character
## Mean : 57166
## 3rd Qu.: 58611
## Max. :459488
After loading the two provided files “tmdb_5000_movies.csv” and “tmdb_5000_credits.csv” , we can see that the movie dataset is made of 20 features for a total of about 4,803 observations while the credit dataset contains the same number of occurences, but with a total of 4 attributes. Accross the attributes of the two datasets, the key feature and the common identifier is the movie_id. The credit dataset shows other two attributes, “cast” and “crew”, that are not present in the movie dataset. Here are some attributes and their characteristics :
-movie_Id(credit dataset) or id(movie dataset): numeric, Unique ID for the movie.
-budjet: numeric, financial investment for the production of a movie Id.
-etc
As explained in the Kaggle Overview of the competition, some of the columns are in the JSON format.Looking at the output of glimpse function, we can recognize them starting with curly brackets “{” : genres, keywords, production_companies, production_countries, and spoken_languages.
Data Wrangling
To extract informations included in the JSON format attributes, we apply the following code :
#Loading.....
comment of the results…. etc..
numeric variables: id, budget, revenue, popularity, runtime, vote_average, vote_count
Loading….
nominal variables: genres, title, ecc ...
Loading….
Loading….
Loading….
Loading…
Loading…
Loading…