IMDB 5000 Movie Dataset from Kaggle
This dataset contains 28 variables about 5,043 movies spanning across 100 years in 66 countries. There are 2399 unique director names, and thousands of actors/actresses.
Since Kaggle requires a username and password to download the dataset, I am sourcing the same data from my Github library here
More information about the dataset can be found on the Kaggle page here.
Data Import Code:
library(tibble)
url <- "https://github.com/yash91sharma/projectX/raw/master/movie_metadata.csv"
movie <- as_data_frame(read.csv(url, stringsAsFactors = FALSE))
class(movie)
## [1] "tbl_df" "tbl" "data.frame"
colnames(movie)
## [1] "color" "director_name"
## [3] "num_critic_for_reviews" "duration"
## [5] "director_facebook_likes" "actor_3_facebook_likes"
## [7] "actor_2_name" "actor_1_facebook_likes"
## [9] "gross" "genres"
## [11] "actor_1_name" "movie_title"
## [13] "num_voted_users" "cast_total_facebook_likes"
## [15] "actor_3_name" "facenumber_in_poster"
## [17] "plot_keywords" "movie_imdb_link"
## [19] "num_user_for_reviews" "language"
## [21] "country" "content_rating"
## [23] "budget" "title_year"
## [25] "actor_2_facebook_likes" "imdb_score"
## [27] "aspect_ratio" "movie_facebook_likes"
dim(movie)
## [1] 5043 28
Preview of the full dataset:
library(DT)
datatable(head(movie,50))
Below is the table containing the the variable names, types and a short description.
Variable.type <- lapply(movie,class)
Variable.desc <- c("Specifies if it was color/black & white movie",
"Name of movie director","Number of critics who reviewed",
"Duration of the movie (minutes)","Number of likes on director's FB page",
"Number of likes on 3rd actor's FB page","Name of second actor",
"Number of likes on 1st actor's FB page","Gross earning by the movie ($)",
"Genres of the movie","Name of the first actor",
"Title of the movie","Number of users voted on IMDB",
"Total facebook likes for all cast members","Name of the third actor",
"Number of the actor who featured in the movie poster",
"Keywords describing the movie plot","IMDB link of the movie",
"Number of users who gave a review","Language of the movie",
"Country the movie was produced in",
"Content rating of the movie","Budget of the movie ($)",
"Year the movie released in","Number of facebook likes for actor 2",
"IMDB score for the movie (out of 10)","Aspect ratio the movie was made in",
"Number of facebook likes")
Variable.name1 <- colnames(movie)
data.desc <- as_data_frame(cbind(Variable.name1,Variable.type,Variable.desc))
colnames(data.desc) <- c("Variable Name","Data Type","Variable Description")
library(knitr)
kable(data.desc)
| Variable Name | Data Type | Variable Description |
|---|---|---|
| color | character | Specifies if it was color/black & white movie |
| director_name | character | Name of movie director |
| num_critic_for_reviews | integer | Number of critics who reviewed |
| duration | integer | Duration of the movie (minutes) |
| director_facebook_likes | integer | Number of likes on director’s FB page |
| actor_3_facebook_likes | integer | Number of likes on 3rd actor’s FB page |
| actor_2_name | character | Name of second actor |
| actor_1_facebook_likes | integer | Number of likes on 1st actor’s FB page |
| gross | integer | Gross earning by the movie ($) |
| genres | character | Genres of the movie |
| actor_1_name | character | Name of the first actor |
| movie_title | character | Title of the movie |
| num_voted_users | integer | Number of users voted on IMDB |
| cast_total_facebook_likes | integer | Total facebook likes for all cast members |
| actor_3_name | character | Name of the third actor |
| facenumber_in_poster | integer | Number of the actor who featured in the movie poster |
| plot_keywords | character | Keywords describing the movie plot |
| movie_imdb_link | character | IMDB link of the movie |
| num_user_for_reviews | integer | Number of users who gave a review |
| language | character | Language of the movie |
| country | character | Country the movie was produced in |
| content_rating | character | Content rating of the movie |
| budget | numeric | Budget of the movie ($) |
| title_year | integer | Year the movie released in |
| actor_2_facebook_likes | integer | Number of facebook likes for actor 2 |
| imdb_score | numeric | IMDB score for the movie (out of 10) |
| aspect_ratio | numeric | Aspect ratio the movie was made in |
| movie_facebook_likes | integer | Number of facebook likes |
Missing Values
There are 3,801 rows which do not have any missing value.For Character values missing values are blanks, while numeric variables have missing values as NAs.
print(paste(sum(complete.cases(movie)),"Complete cases!"))
## [1] "3801 Complete cases!"
I would like to analyze what kind of movies are more sucessful, using the IMDB score of the movie as my dependent variable. It would be interesting to understand what are the characterstics that make a movie more successful than others.
I would like to show the results of this analysis in as intuitive way as possible using charts and figures. Would primarily use GGPlot and Plotly Packages in R.