Data Wrangling Project

Movie Success Criteria Analysis (Using IMDB datasets)

Data Import

IMDB 5000 Movie Dataset from Kaggle

This dataset contains 28 variables about 5,043 movies spanning across 100 years in 66 countries. There are 2399 unique director names, and thousands of actors/actresses.

Since Kaggle requires a username and password to download the dataset, I am sourcing the same data from my Github library here

More information about the dataset can be found on the Kaggle page here.

Data Import Code:

library(tibble)
url <- "https://github.com/yash91sharma/projectX/raw/master/movie_metadata.csv"
movie <- as_data_frame(read.csv(url, stringsAsFactors = FALSE))
class(movie)

## [1] "tbl_df"     "tbl"        "data.frame"

colnames(movie)

##  [1] "color"                     "director_name"            
##  [3] "num_critic_for_reviews"    "duration"                 
##  [5] "director_facebook_likes"   "actor_3_facebook_likes"   
##  [7] "actor_2_name"              "actor_1_facebook_likes"   
##  [9] "gross"                     "genres"                   
## [11] "actor_1_name"              "movie_title"              
## [13] "num_voted_users"           "cast_total_facebook_likes"
## [15] "actor_3_name"              "facenumber_in_poster"     
## [17] "plot_keywords"             "movie_imdb_link"          
## [19] "num_user_for_reviews"      "language"                 
## [21] "country"                   "content_rating"           
## [23] "budget"                    "title_year"               
## [25] "actor_2_facebook_likes"    "imdb_score"               
## [27] "aspect_ratio"              "movie_facebook_likes"

dim(movie)

## [1] 5043   28

Data Preview

Preview of the full dataset:

library(DT)
datatable(head(movie,50))

Data Description

Below is the table containing the the variable names, types and a short description.

Variable.type <- lapply(movie,class)
Variable.desc <- c("Specifies if it was color/black & white movie",
"Name of movie director","Number of critics who reviewed",
"Duration of the movie (minutes)","Number of likes on director's FB page",
"Number of likes on 3rd actor's FB page","Name of second actor",
"Number of likes on 1st actor's FB page","Gross earning by the movie ($)",
"Genres of the movie","Name of the first actor",
"Title of the movie","Number of users voted on IMDB",
"Total facebook likes for all cast members","Name of the third actor",
"Number of the actor who featured in the movie poster",
"Keywords describing the movie plot","IMDB link of the movie",
"Number of users who gave a review","Language of the movie",
"Country the movie was produced in",
"Content rating of the movie","Budget of the movie ($)",
"Year the movie released in","Number of facebook likes for actor 2",
"IMDB score for the movie (out of 10)","Aspect ratio the movie was made in",
"Number of facebook likes")
Variable.name1 <- colnames(movie)
data.desc <- as_data_frame(cbind(Variable.name1,Variable.type,Variable.desc))
colnames(data.desc) <- c("Variable Name","Data Type","Variable Description")
library(knitr)
kable(data.desc)

Variable Name	Data Type	Variable Description
color	character	Specifies if it was color/black & white movie
director_name	character	Name of movie director
num_critic_for_reviews	integer	Number of critics who reviewed
duration	integer	Duration of the movie (minutes)
director_facebook_likes	integer	Number of likes on director’s FB page
actor_3_facebook_likes	integer	Number of likes on 3rd actor’s FB page
actor_2_name	character	Name of second actor
actor_1_facebook_likes	integer	Number of likes on 1st actor’s FB page
gross	integer	Gross earning by the movie ($)
genres	character	Genres of the movie
actor_1_name	character	Name of the first actor
movie_title	character	Title of the movie
num_voted_users	integer	Number of users voted on IMDB
cast_total_facebook_likes	integer	Total facebook likes for all cast members
actor_3_name	character	Name of the third actor
facenumber_in_poster	integer	Number of the actor who featured in the movie poster
plot_keywords	character	Keywords describing the movie plot
movie_imdb_link	character	IMDB link of the movie
num_user_for_reviews	integer	Number of users who gave a review
language	character	Language of the movie
country	character	Country the movie was produced in
content_rating	character	Content rating of the movie
budget	numeric	Budget of the movie ($)
title_year	integer	Year the movie released in
actor_2_facebook_likes	integer	Number of facebook likes for actor 2
imdb_score	numeric	IMDB score for the movie (out of 10)
aspect_ratio	numeric	Aspect ratio the movie was made in
movie_facebook_likes	integer	Number of facebook likes

Missing Values

There are 3,801 rows which do not have any missing value.For Character values missing values are blanks, while numeric variables have missing values as NAs.

print(paste(sum(complete.cases(movie)),"Complete cases!"))

## [1] "3801 Complete cases!"

Planned Analysis

I would like to analyze what kind of movies are more sucessful, using the IMDB score of the movie as my dependent variable. It would be interesting to understand what are the characterstics that make a movie more successful than others.

I would like to show the results of this analysis in as intuitive way as possible using charts and figures. Would primarily use GGPlot and Plotly Packages in R.