Data Wrangling Project

Analysis of TMDB Movies Data

Introduction

Problem statement: To analyze TMDB 5000 Movie data set obatined from Kaggle in order to provide insights into such as

Analyze trend over the years in terms of number produced, genres, runtime
Most popular genres, spoken languages
Top 10 movie producing countries and companies, Top 10 highest grossing movies of all time, Top 10 highest rated movies of all time
How runtime varies across genres, original languages, countries, with gross and average vote, categorization of movies based on runtime
How average vote has varied across years, original languages, different categories of movies
Difference in movies produced by different countries and production companies in terms of number produced, duration, gross revenue, etc.

Approach:

Preliminary analysis of data to see dimensions and stucture of data, number of missing and duplicate values
Cleaning of data - removal of duplicate observations, spurious characters in title
Data Manipulation - the data set has few columns in JSON format from which data has been extracted. The resulting data frames were joined to existing data set. New columns were created to store release year, gross (revenue - budget) and gross_flag (profit or loss).
Data Visualization - Using ggplots, word cloud to visualize data

How is it useful to consumers: Primary objective of producing movies is to make profits. Some movies generate very high revenue while others go into loss. An analysis of movies will help understand how factors such as runtime, languages, genres influence revenue.

Packages Required

Packages used:

Dplyr : For data manipulation
Jsonlite : To extract JSON data
DT : To create data table so as to view data
wordcloud : To plot a word cloud
Tidyverse : Tidyverse has several packages which makes it easy to install and load multiple tidyverse packages
Ggplot2 : To visualize data using a variety of charts
Lubridate : To extract year from release date
Plotly : To plot a 3D scatter plot

library(dplyr)
library(jsonlite)
library(DT)
library(wordcloud)
library(tidyverse)
library(ggplot2)
library(lubridate)
library(plotly)

Data Preparation

Data Source

Source of data is Kaggle - TMDB 5000 Movie Data set. You can find it here

The data set was created to answer different questions about movies such as what can be said about success of a movie, have certain production companies found a formula for success of a movie, why does a film with very high budget fail at box office and many more such questions.

It contains 20 variables and 4803 observations across 88 different countries, 20 genres, 86 spoken languages.

Data Import code:

movies <- read.csv("tmdb_5000_movies.csv", header = TRUE, stringsAsFactors = FALSE)

class(movies)

## [1] "data.frame"

dim(movies)

## [1] 4803   20

colnames(movies)

##  [1] "budget"               "genres"               "homepage"            
##  [4] "id"                   "keywords"             "original_language"   
##  [7] "original_title"       "overview"             "popularity"          
## [10] "production_companies" "production_countries" "release_date"        
## [13] "revenue"              "runtime"              "spoken_languages"    
## [16] "status"               "tagline"              "title"               
## [19] "vote_average"         "vote_count"

colnames(movies)[4] <- "movie_id"    #Renaming id column as movie_id

Summary of Variables

The original data set has 20 variables and their data types are mentioned in brackets. They are as follows:

Budget (dbl) : Budget of a movie
Genres (chr) : Genre of a movie such as comedy, action, etc
Homepage (chr) : url of a movie
id (dbl) : ID of a movie
Keywords (dbl) : Major keywords describing a movie
Original_language (chr) : Original language in which a movie was made
Original_title (chr) : Original title of a movie
Overview (chr) : Brief description of a movie
Popularity (dbl) : Popularity of a movie
Production_companies (chr) : Names of the companies producing a movie
Production_countries (chr) : Names of the countries producing a movie
Release_date (dbl) : Release date of a movie
Revenue (dbl) : Revenue earned by a movie
Runtime (dbl) : Duration of a movie
Spoken_languages (chr) : Languages in which movie is available
Status (chr) : Status of a movie, if it’s released, in production or rumoured to be produced
Tagline (chr) : Tagline of a movie
Title (chr) : Actual title of a movie
Vote_average (dbl) : Average vote received for a movie in IMDB
Vote_count(dbl) : Number of votes received by a movie

Note: Original data set from Kaggle provides no information about unit of budget and revenue. I have assumed it to be in US dollars

Original data set

Preview of first 100 rows of original data:

datatable(head(movies, n = 100))

Data Cleaning

Missing Values:

There are several missing values in the data set. I have assigned NA to observations having blank, 0 and [] values. The number of missing cases in each column and total number of complete cases are as shown below:

#Assign NA to blank values
movies[movies == ""] <- NA
movies[movies == "[]"] <- NA
movies[movies == 0] <- NA

sum(complete.cases(movies))

## [1] 1225

#Counting number of missing values in each column
movies %>% summarise_all(funs(sum(is.na(.))))

##   budget genres homepage movie_id keywords original_language
## 1   1037     28     3091        0      412                 0
##   original_title overview popularity production_companies
## 1              0        3          1                  351
##   production_countries release_date revenue runtime spoken_languages
## 1                  174            1    1427      37               86
##   status tagline title vote_average vote_count
## 1      0     844     0           63         62

I am retaining all the observations for the analysis.

Duplicate Values:

Checking for duplicate movie titles using title column and removing the duplicate observations.

#Checking for duplicate movie title and removing duplicate values
movies <- movies[!duplicated(movies$title), ]
dim(movies)

## [1] 4800   20

There were three duplicate rows and they were removed.

Check for spurious characters:

Title column had spurious characters and they were removed during data cleaning.

#Removing spurious characters from movie title
movies$title <- sub(pattern = "Â", "", movies$title)

Adding new columns:

Year is extracted from release_date column and stored in a new column. Extracting year will help in performing trend analysis. Also, new columns are created to store difference between revenue and budget and an indicator, gross flag, which tells if the movie has made profit or is in loss.

movies$year <- movies$release_date %>%
                    as.POSIXlt(tz = "", format = "%m/%d/%Y") %>%
                    year()

#Adding new columns gross and gross_flag 
movies <- movies %>%
  mutate(gross = revenue - budget, gross_flag = ifelse(gross < 0, "Loss", "Profit"))

There are five columns in the original data set which have JSON values. They have been extracted and store in separate data frames. Columns containing JSON values were dropped from original data set and new columns were joined to the final data set which has multiple values stored in one column separated by comma.

#This code has been inspired from kernel 'Tidydata Movie Data set exploration' from Kaggle
#Creating a tibble, Keywords1, which stores keywords
keywords1 <- movies %>%    
  filter(nchar(keywords) > 2) %>%                 # fiter out blank keywords field
  mutate(                                         # create a new field 
    js = lapply(keywords, fromJSON)               # containing a LIST of keyword and value pairs
  ) %>%                                           # called id and name
  unnest(js) %>%                                  # turn each keyword/value pairs in the LIST into a row
  select(movie_id, title, keywords = name)

#Combining the keywords of a movie in a single column
keywords <- aggregate(keywords ~.,data = keywords1, paste, collapse = ",")

#Creating a tibble, genres1, which stores genres
genres1 <- movies %>%    
  filter(nchar(genres) > 2) %>%                   
  mutate(                                          
    js = lapply(genres, fromJSON)                 
  ) %>%                                           
  unnest(js) %>%                                  
  select(movie_id, title, genres = name) 

#Combining genres of a movie in a single column
genres <- aggregate(genres ~.,data = genres1, paste, collapse = ",")

#Creating a tibble, production_companies1, which stores production companies
production_companies1 <- movies %>%    
  filter(nchar(production_companies) > 2) %>%     
  mutate(                                         
    js = lapply(production_companies, fromJSON)   
  ) %>%                                           
  unnest(js) %>%                                  
  select(movie_id, title, production_companies = name)

#Combining production_companies of a movie in a single column
production_companies <- aggregate(production_companies ~.,data = production_companies1, paste, collapse = ",")

#Creating a tibble, production_countries1, which stores production countries
production_countries1 <- movies %>%    
  filter(nchar(production_countries) > 2) %>%     
  mutate(                                         
    js = lapply(production_countries, fromJSON)   
  ) %>%                                          
  unnest(js) %>%                                  
  select(movie_id, title, production_countries = name) 

#Combining production_countries of a movie in a single column
production_countries <- aggregate(production_countries ~.,data = production_countries1, paste, collapse = ",")


#Creating a tibble, spoken_languages1, which stores languages of the movies
spoken_languages1 <- movies %>%    
  filter(nchar(spoken_languages) > 2) %>%        
  mutate(                                         
    js = lapply(spoken_languages, fromJSON)      
  ) %>%                                          
  unnest(js) %>%                                 
  select(movie_id, title, spoken_languages = iso_639_1) 

#Combining spoken_languages of a movie in a single column
spoken_languages <- aggregate(spoken_languages ~.,data = spoken_languages1, paste, collapse = ",")

#Dropping existing columns - keywords, genres, production_companies, production_countries, spoken_languages
movies <- movies %>%
  select(budget, homepage, movie_id, original_language, original_title, overview, popularity, release_date,
         revenue, runtime, status, tagline, title, vote_average, vote_count, year, gross, gross_flag)

#Attaching columns - keywords, genres, production_companies, production_countries, spoken_languages using full_join in order to retain all observations.
movies <- movies %>%
            full_join(keywords, by = c("movie_id", "title")) %>%
            full_join(genres, by = c("movie_id", "title")) %>%
            full_join(production_companies, by = c("movie_id", "title")) %>%
            full_join(production_countries, by = c("movie_id", "title")) %>%
            full_join(spoken_languages, by = c("movie_id", "title"))

Data Preview

Preview of first 100 rows of cleaned data:

datatable(head(movies, n = 100))

Exploratory Data Analysis

Genre

WordCloud was used for basic genre analysis. From the word cloud, it is evident that top five genres are Drama, Comedy, Thriller, Action and Romance in the order.

#Function to count number of genres
number <- function(df, col) {
  df_count <- group_by(df, df[[col]]) %>%
    summarise(count = n()) %>%
    arrange(desc(count))
  
  names(df_count)[1] <- "df"                  #renaming column
  
  return(df_count)
}

genres_count <- number(genres1, 3)


#Generating word cloud
set.seed(42)
wordcloud(words = genres_count$df, freq = genres_count$count,
          random.order = FALSE, rot.per = 0.20, 
          colors = brewer.pal(8, "Dark2"))

Genre vs runtime - Looking at movies with runtime more than 200 minutes, we see that drama and history genres have more number of movies than any other genres. Animations and documentary genres have short movies. For rest of the genres, most of the movies are in regular category.

genre_df <- movies %>%
               select(movie_id, title, runtime, year, original_language, gross) %>%
               left_join(genres1, by = c("movie_id","title")) 

genre_df %>% 
  filter(!is.na(genres) == TRUE) %>%
  ggplot(aes(x = genres, y = runtime, color = genres)) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 90))

Gross across genres and years: Heat map is used to see how gross varies across genres and years. Based on gross, popular genres across years include thriller, romance, mystery, horror, drama, adventure and action. Earliest movie belongs to drama genre.

genre_df %>%
  filter(!is.na(gross) == TRUE) %>%
  ggplot(aes(x = year, y = genres, fill = gross)) +
  geom_tile() +
  ggtitle("Heat Map") +
  xlab("Year") +
  ylab("Genre") +
  scale_fill_gradient(low = "light blue", high = "dark blue")

Genre vs Gross

genre_df %>%
  filter(!is.na(genres) == TRUE) %>%
  group_by(genres) %>%
  summarise(average_gross = mean(gross, na.rm = TRUE), movies_count = n(), total_gross = sum(gross, na.rm = TRUE)) %>%
  filter(movies_count >= 50) %>%
  arrange(desc(average_gross)) %>%
  head(n = 10)

## # A tibble: 10 x 4
##             genres average_gross movies_count  total_gross
##              <chr>         <dbl>        <int>        <dbl>
##  1       Animation     198090651          234  37241042418
##  2       Adventure     170714153          789 112842055285
##  3         Fantasy     162675510          424  55635024529
##  4          Family     157588117          512  57519662846
##  5 Science Fiction     126069497          533  54209883661
##  6          Action     114424359         1154 105041561757
##  7          Comedy      72323910         1721  80279539976
##  8        Thriller      69233753         1274  64733558825
##  9         Romance      64263807          894  36887425371
## 10         Mystery      63308118          348  16776651385

Language

The original and spoken languages are in ISO_639_1 codes. Their corresponding English names were found (source: https://www.loc.gov/standards/iso639-2/php/code_list.php) and a new data frame was created.

Bar chart and word cloud was used for basic language analysis. Top five languages in which movies were made include English, French, German, Spanish and Chinese. Note here that spoken language includes original language as well as language in which movies were dubbed.

#Importing language file
language <- read.csv("ISO_639_1.csv", header = TRUE)
colnames(language) <- c("language_code", "Language")

colnames(spoken_languages1)[3] <- "language_code"

spoken_languages1 <- left_join(spoken_languages1, language, by = "language_code")

#Obtaining count of languages using function 
language_count <- number(spoken_languages1, 4)

#Wordcloud of languages
wordcloud(words = language_count$df, freq = language_count$count, 
          min.freq = 5, random.order = FALSE, rot.per = 0.25, 
          colors = brewer.pal(8, "Dark2"))

#Bar chart showing top 20 spoken languages 
language_count %>%
  head(n = 20) %>%
  ggplot(aes(x = reorder(df, -count), y = count)) +
  geom_bar(stat = "identity", fill = "blue") +
  coord_flip() +
  ggtitle("Frequency of languages") +
  xlab("Language") +
  ylab("Frequency")

In case of original languages:

#Top 10 original languages in which movies are made
movies %>%
  group_by(original_language) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  head(n = 10)

## # A tibble: 10 x 2
##    original_language count
##                <chr> <int>
##  1                en  4503
##  2                fr    70
##  3                es    32
##  4                de    27
##  5                zh    27
##  6                hi    19
##  7                ja    16
##  8                it    14
##  9                cn    12
## 10                ru    11

Language vs gross

language_df <- movies %>%
                 select(movie_id, title, runtime, year, gross) %>%
                 left_join(spoken_languages1, by = c("movie_id","title"))

language_df %>%
  filter(!is.na(Language) == TRUE) %>%
  group_by(Language) %>%
  summarise(average_gross = mean(gross, na.rm = TRUE), movies_count = n(), total_gross = sum(gross, na.rm = TRUE)) %>%
  filter(movies_count >= 50) %>%
  arrange(desc(total_gross)) %>%
  head(n = 10)

## # A tibble: 10 x 4
##    Language average_gross movies_count  total_gross
##      <fctr>         <dbl>        <int>        <dbl>
##  1  English      81717510         4483 256184393932
##  2   French      97197202          874  59095898846
##  3   German      92104198          524  32973302842
##  4  Spanish      89671419          351  24480297296
##  5  Chinese     109857819          214  17357535434
##  6  Italian     104228431          188  14904665658
##  7  Russian      91438311          185  13532869957
##  8 Japanese      91837875           97   7530705756
##  9   Arabic     112318856           67   6177537074
## 10    Latin     115211043           52   4954074842

Runtime

Average runtime of the movies over the years was analysed using scatter plot. We see that average runtime has come down over the years. after 1970, the average runtime has remained in the range of 100 to 125 minutes.

#Run time over the years
movies %>%
  group_by(year) %>%
  summarise(avg_runtime = mean(runtime)) %>%
  ggplot(aes(x = year, y = avg_runtime)) +
  geom_point() +
  xlab("Year") +
  ylab("Average Runtime")

Analysis of runtime across different languages is shown using boxplot and bar chart. The column ‘original_language’ was used for the analysis. Movies made in English language have an average runtime of around 110 minutes and several outliers were observed. Interestingly, the longest and shortest movies in the data set are also made in English language originally. Among the seven languages, which have average runtime of more than 125 minutes, most of them are Asian languages. in fact, two Indian languages - Telugu and Tamil, have average runtime of more than 150 minutes !!!

#runtime vs original_language
movies %>%
  ggplot(aes(x = original_language, y = runtime)) +
  geom_boxplot() +
  xlab("Original Language") +
  ylab("Runtime")

movies %>%
  ggplot(aes(x = original_language, y = runtime)) +
  stat_summary(fun.y = mean, geom = "bar")

Based on runtime, the movies were categorized into short, regular and long movies.

summary(movies$runtime)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    14.0    94.0   104.0   107.7   118.0   338.0      37

#average movie length between 104 to 107 mins

quantile(movies$runtime, probs = 0.95, na.rm = TRUE)

## 95% 
## 144

quantile(movies$runtime, probs = 0.10, na.rm = TRUE)

## 10% 
##  87

#movies greater than 144 minutes are long movies. 95% of movies are shorter than 144 mins
#90% of movies are longer than 87 mins. So definig short movies as less than 87 mins

movies <- movies %>%
              mutate(Description = ifelse(runtime >= 144, "Long",
                                          ifelse(runtime <= 87, "Short",
                                                 "Regular")))

Gross vs categories of movies is shown using histogram. It is seen that highest grossing movies are long, followed by regular and then short.

#Gross vs Movie categories
movies %>%
  filter(!is.na(gross) == TRUE) %>%
  group_by(Description) %>%
  summarise(avg_gross = mean(gross),
           median_gross = median(gross))

## # A tibble: 3 x 3
##   Description avg_gross median_gross
##         <chr>     <dbl>        <dbl>
## 1        Long 185180243     56849135
## 2     Regular  75543543     25654109
## 3       Short  46511444     12909561

movies %>%
  filter(!is.na(gross) == TRUE) %>%
  ggplot(aes(x = gross, fill = Description)) +
  geom_histogram(bins = 200) +
  facet_wrap(~ Description, ncol = 1, scales = "free_y") +
  ggtitle("Histogram") +
  scale_x_continuous(labels = scales::comma)

3D plot of runtime vs average vote vs gross

plot_ly(movies, x = ~vote_average, y = ~runtime, z = ~gross/1000000, 
        color = ~gross_flag, colors = c('#BF382A', '#0C4B8E') ,size = I(3)) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'Average vote'),
                      yaxis = list(title = 'Runtime'),
                      zaxis = list(title = 'Gross (million $)')),
         title = "INTERACTIVE 3D Scatter plot: Average vote vs Runtime vs Gross",
         showlegend = FALSE)

Country

Top 10 movie producing countries and number of movies produced by them

#Count of movies by production countries
country_count <- number(production_countries1, 3)

#Top 10 production countries
country_count %>%
  arrange(desc(count)) %>%
  head(n = 10) %>%
  ggplot(aes(x = reorder(df, -count), y = count)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90)) +
  ggtitle("Top 10 movie producing countries") +
  xlab("Country") +
  ylab("Count of Movies")

Country vs runtime: Top 10 countries having longest average runtime. Only those countries have been considered which have produced atleast 50 movies. India has the longest average runtime of all countries.

country_df <- movies %>%
               select(movie_id, title, runtime, year, original_language, gross) %>%
               left_join(production_countries1, by = c("movie_id","title")) 

country_df %>%
  filter(!is.na(runtime) == TRUE) %>%
  group_by(production_countries) %>%
  summarise(avg_runtime = mean(runtime), movies_count = n()) %>%
  filter(movies_count >= 50) %>%
  arrange(desc(avg_runtime)) %>%
  head(n = 10) %>%
  ggplot(aes(x = reorder(production_countries, -avg_runtime), y = avg_runtime, fill = production_countries)) +
  geom_bar(stat = "identity") +
  xlab("Country") +
  ylab("Average Runtime") +
  theme(axis.text.x = element_text(angle = 90))

Country vs average gross:

country_df %>%
  filter(!is.na(production_countries) == TRUE) %>%
  group_by(production_countries) %>%
  summarise(average_gross = mean(gross, na.rm = TRUE), movies_count = n(), total_gross = sum(gross, na.rm = TRUE)) %>%
  filter(movies_count >= 50) %>%
  arrange(desc(average_gross))

## # A tibble: 11 x 4
##        production_countries average_gross movies_count  total_gross
##                       <chr>         <dbl>        <int>        <dbl>
##  1                    China     118746291           59   4393612776
##  2           United Kingdom      92238534          636  40216000928
##  3                    Japan      91378382           58   3655135283
##  4 United States of America      86526542         3955 251619185559
##  5                Australia      65300096          110   5093407465
##  6                  Germany      60117363          324  14007345498
##  7                   Canada      56292196          261   9119335766
##  8                    India      51916896           54   1505589991
##  9                   France      40577027          306   7953097333
## 10                    Italy      37779516           72   1813416781
## 11                    Spain      29449043           71   1266308835

Production Companies

Top 10 production companies and number of movies produced by them.

#Count of movies by production_companies
company_count <- number(production_companies1, 3)

#Top 10 production companies
company_count %>%
  head(n = 10) %>%
  ggplot(aes(x = reorder(df, -count), y = count)) +
  geom_bar(stat = "identity", fill = "steel blue") +
  theme(axis.text.x = element_text(angle = 90)) +
  ggtitle("Top 10 movie producing companies") +
  xlab("Production companies") +
  ylab("Count of Movies")

Production companies vs Average gross:

company_df <- movies %>%
                select(movie_id, title, runtime, year, original_language, gross) %>%
                    left_join(production_companies1, by = c("movie_id","title")) 

company_df %>%
  filter(!is.na(production_companies) == TRUE) %>%
  group_by(production_companies) %>%
  summarise(average_gross = mean(gross, na.rm = TRUE), movies_count = n(),total_gross = sum(gross, na.rm = TRUE)) %>%
  filter(movies_count >= 50) %>%
  head(n = 10) %>%
  arrange(desc(total_gross))

## # A tibble: 10 x 4
##             production_companies average_gross movies_count total_gross
##                            <chr>         <dbl>        <int>       <dbl>
##  1             Columbia Pictures     107615329          201 17971759867
##  2                DreamWorks SKG     149731845           79 10032033626
##  3            Dune Entertainment     180153356           59  9728281204
##  4     Metro-Goldwyn-Mayer (MGM)      66372067          122  6106230207
##  5 Columbia Pictures Corporation      63071308           96  5424132526
##  6             Fox 2000 Pictures      98113296           55  4218871743
##  7                     Lionsgate      78141144           56  3828916050
##  8                        Canal+      54072819           75  2595495328
##  9      Fox Searchlight Pictures      45342996           64  2131120795
## 10               Dimension Films      46308979           54  1944977127

Vote

Average vote of movies was analysed over the years and only movies which had more than 1000 votes were included in the plot. It is observed that average vote of movies was highest in mid 1960s, after which it has steadily declined.

#vote_average and vote_count over the years
movies %>%
  filter(vote_count >= 1000) %>%
  group_by(year) %>%
  summarise(avg_vote = mean(vote_average)) %>%
  ggplot(aes(x = year, y = avg_vote)) +
  geom_point() +
  geom_smooth(stat = "smooth", position = "identity") +
  xlab("Year") +
  ylab("Average Vote")

A look of how average vote varies across languages. Original language was used for the analysis.

#vote_average vs original_language
movies %>%
  filter(vote_count >= 100) %>%
  ggplot(aes(x = original_language, y = vote_average)) +
  geom_boxplot()

Average vote vs gross across different categories of movies : It is seen that only a handful of movies have gross of over 1 billion dollars. Highest grossing movie belongs to long category but regular category has nearly double the number of movies with gross above 1 billion while there are no movies from short category in that range. We can also see that some movies have gross less than 0, i.e, the movies were in loss. Except a few movies, most of them with average rating above 7 have grossed less than 1 billion dollars. In fact, three regular movies with less than average rating of 7 have managed to gross above 1 billion dollars.

movies %>%
  filter(vote_count >= 1000) %>%
  ggplot(aes(x = gross, y = vote_average, color = Description)) +
  geom_point(na.rm = TRUE) +
  facet_wrap(~ Description, ncol = 1) +
  scale_x_continuous(labels = scales::comma)

Trend over the years

Analysis of movies over the years is shown using line plot. Year 2009 saw the highest number with 247 movies. It is seen that number of movies steadily increased till year 1999, after which there have been ups and downs. As the data set contains information about movies till 2016, We see a sharp decline in number of movies after 2015.

#Trend analysis of movies over the years
movies %>%
  group_by(year) %>%
  summarise(movie_count = n()) %>%
  filter(movie_count >= 10)  %>%
  ggplot(aes(x = year, y = movie_count)) +
  geom_line() +
  geom_point() +
  xlab("Year") +
  ylab("Number of Movies") +
  theme_classic()

Top 10 highest grossing movies of all time

#Top 10 grossing movies
movies %>%
  select(title, gross) %>%
  arrange(desc(gross)) %>%
  head(n = 10) %>%
  ggplot(aes(x = reorder(title, -gross), y = gross, fill = title)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90)) +
  ggtitle("Top 10 highest grossing movies of all time") +
  xlab("Movie Title") +
  ylab("Gross (in $)") +
  scale_y_continuous(labels = scales::comma)

Top 10 movies with highest rating. Only movies which have more than 1000 vote counts have been considered.

#top 10 movies with highest rating
movies %>%
  filter(vote_count >= 1000) %>%
  select(title, vote_average) %>%
  arrange(desc(vote_average)) %>%
  head(n = 10) %>%
  ggplot(aes(x = reorder(title, -vote_average), y = vote_average, fill = title)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90)) +
  ggtitle("Top 10 highest rated movies of all time") +
  xlab("Movie Title") +
  ylab("Average Vote")

Summary

The points mentioned in problem statement were addressed in this project.

Functions in Dplyr package such as mutate, select, filter , arrange, sumarise, etc and jsonlite package were used for data manipulation, extracting data from JSON columns and creating new columns. For data analysis, various plots and charts such as scatter plot, bar chart, histogram, heat map, 3D scatter plot, word cloud, etc were used which were available in Ggplot, wordcloud and plotly packages.

Insights gained from data analysis are summarised below:

Number of movies produced has increased at a high rate after 1990, peaking at year 2009
USA produces highest number of movies at 3955, which is significantly high compared to next country, United Kingdom
Drama, Comedy, Thriller, Action and Romance are the top five genres as seen from word cloud
Animation and documentary genres have more number of short movies compared to other genres while history and drama genres have more long movies, few of them longer than 200 minutes
Earliest movie belongs to Drama genre. Genres such as Thriller, Romance, Mystery, Horror, Drama, Adventure and Action have been able to generate good gross revenue over the years
Adventure, action genres are able to fetch very high gross revenue. Till date, they fetched over 100 billion US dollars
In case of spoken langauges, i.e., the languages in which a movie is available, English is the most popular language for movies followed by French, German, Spanish and Chinese
Average runtime of the movies has reduced over the years and now it is in the range 100 to 125 minutes
Movies made in Asian languages tend to have higher average runtime than English language
95% of the movies have runtime less than 144 minutes. 10% of movies have run time less than 87 minutes.
Long movies (runtime more than 144 minutes) are the highest grossing movies followed by regular and short movies (runtime less than 87 minutes)
Most of the movies which have grossed over 500 million dollars have an average rating of atleast 6
Average gross revenue is highest in China, but overall gross revenue is highest for USA
Average gross revenue is highest for Dune Entertainment company but overall gross revenue is highest for Columbia Pictures
Interestingly, average gross revenue is highest for Latin language. However, English language has highest total gross revenue

Implications: Based on above analysis, movies in the following categories might be able to fetch high gross revenue:

Genres - Adventure, Animation, Fantasy, Action
Languages - English, French, German, Spanish, Chinese
Runtime - between 100 to 150 minutes

Limitations and future work:

The project is limited to exploratory data analysis using plots and charts. Statistical analysis has to be done in order to make any conclusions.
There could be more factors which influence gross revenue which have not been considered in the project.
Modeling can be done and the data set could be divided into train and test data sets in order to verify the model.