Movie Success Criteria Analysis (Using IMDB datasets)

Data Import

IMDB 5000 Movie Dataset from Kaggle

This dataset contains 28 variables about 5,043 movies spanning across 100 years in 66 countries. There are 2399 unique director names, and thousands of actors/actresses.

Since Kaggle requires a username and password to download the dataset, I am sourcing the same data from my Github library here

More information about the dataset can be found on the Kaggle page here.

Data Import Code:

library(tibble)
url <- "https://github.com/yash91sharma/projectX/raw/master/movie_metadata.csv"
movie <- as_data_frame(read.csv(url, stringsAsFactors = FALSE))
class(movie)
## [1] "tbl_df"     "tbl"        "data.frame"
colnames(movie)
##  [1] "color"                     "director_name"            
##  [3] "num_critic_for_reviews"    "duration"                 
##  [5] "director_facebook_likes"   "actor_3_facebook_likes"   
##  [7] "actor_2_name"              "actor_1_facebook_likes"   
##  [9] "gross"                     "genres"                   
## [11] "actor_1_name"              "movie_title"              
## [13] "num_voted_users"           "cast_total_facebook_likes"
## [15] "actor_3_name"              "facenumber_in_poster"     
## [17] "plot_keywords"             "movie_imdb_link"          
## [19] "num_user_for_reviews"      "language"                 
## [21] "country"                   "content_rating"           
## [23] "budget"                    "title_year"               
## [25] "actor_2_facebook_likes"    "imdb_score"               
## [27] "aspect_ratio"              "movie_facebook_likes"
dim(movie)
## [1] 5043   28

Data Preview

Preview of the full dataset:

library(DT)
datatable(head(movie,50))

Data Description

Below is the table containing the the variable names, types and a short description.

Variable.type <- lapply(movie,class)
Variable.desc <- c("Specifies if it was color/black & white movie",
"Name of movie director","Number of critics who reviewed",
"Duration of the movie (minutes)","Number of likes on director's FB page",
"Number of likes on 3rd actor's FB page","Name of second actor",
"Number of likes on 1st actor's FB page","Gross earning by the movie ($)",
"Genres of the movie","Name of the first actor",
"Title of the movie","Number of users voted on IMDB",
"Total facebook likes for all cast members","Name of the third actor",
"Number of the actor who featured in the movie poster",
"Keywords describing the movie plot","IMDB link of the movie",
"Number of users who gave a review","Language of the movie",
"Country the movie was produced in",
"Content rating of the movie","Budget of the movie ($)",
"Year the movie released in","Number of facebook likes for actor 2",
"IMDB score for the movie (out of 10)","Aspect ratio the movie was made in",
"Number of facebook likes")
Variable.name1 <- colnames(movie)
data.desc <- as_data_frame(cbind(Variable.name1,Variable.type,Variable.desc))
colnames(data.desc) <- c("Variable Name","Data Type","Variable Description")
library(knitr)
kable(data.desc)
Variable Name Data Type Variable Description
color character Specifies if it was color/black & white movie
director_name character Name of movie director
num_critic_for_reviews integer Number of critics who reviewed
duration integer Duration of the movie (minutes)
director_facebook_likes integer Number of likes on director’s FB page
actor_3_facebook_likes integer Number of likes on 3rd actor’s FB page
actor_2_name character Name of second actor
actor_1_facebook_likes integer Number of likes on 1st actor’s FB page
gross integer Gross earning by the movie ($)
genres character Genres of the movie
actor_1_name character Name of the first actor
movie_title character Title of the movie
num_voted_users integer Number of users voted on IMDB
cast_total_facebook_likes integer Total facebook likes for all cast members
actor_3_name character Name of the third actor
facenumber_in_poster integer Number of the actor who featured in the movie poster
plot_keywords character Keywords describing the movie plot
movie_imdb_link character IMDB link of the movie
num_user_for_reviews integer Number of users who gave a review
language character Language of the movie
country character Country the movie was produced in
content_rating character Content rating of the movie
budget numeric Budget of the movie ($)
title_year integer Year the movie released in
actor_2_facebook_likes integer Number of facebook likes for actor 2
imdb_score numeric IMDB score for the movie (out of 10)
aspect_ratio numeric Aspect ratio the movie was made in
movie_facebook_likes integer Number of facebook likes

Missing Values

There are 3,801 rows which do not have any missing value.For Character values missing values are blanks, while numeric variables have missing values as NAs.

print(paste(sum(complete.cases(movie)),"Complete cases!"))
## [1] "3801 Complete cases!"

Planned Analysis

I would like to analyze what kind of movies are more sucessful, using the IMDB score of the movie as my dependent variable. It would be interesting to understand what are the characterstics that make a movie more successful than others.

I would like to show the results of this analysis in as intuitive way as possible using charts and figures. Would primarily use GGPlot and Plotly Packages in R.