Packages used:
library(dplyr)
library(jsonlite)
library(DT)
library(wordcloud)
library(tidyverse)
library(ggplot2)
library(lubridate)
library(plotly)
Source of data is Kaggle - TMDB 5000 Movie Data set. You can find it here
The data set was created to answer different questions about movies such as what can be said about success of a movie, have certain production companies found a formula for success of a movie, why does a film with very high budget fail at box office and many more such questions.
It contains 20 variables and 4803 observations across 88 different countries, 20 genres, 86 spoken languages.
Data Import code:
movies <- read.csv("tmdb_5000_movies.csv", header = TRUE, stringsAsFactors = FALSE)
class(movies)
## [1] "data.frame"
dim(movies)
## [1] 4803 20
colnames(movies)
## [1] "budget" "genres" "homepage"
## [4] "id" "keywords" "original_language"
## [7] "original_title" "overview" "popularity"
## [10] "production_companies" "production_countries" "release_date"
## [13] "revenue" "runtime" "spoken_languages"
## [16] "status" "tagline" "title"
## [19] "vote_average" "vote_count"
colnames(movies)[4] <- "movie_id" #Renaming id column as movie_id
The original data set has 20 variables and their data types are mentioned in brackets. They are as follows:
Note: Original data set from Kaggle provides no information about unit of budget and revenue. I have assumed it to be in US dollars
Preview of first 100 rows of original data:
datatable(head(movies, n = 100))
Missing Values:
There are several missing values in the data set. I have assigned NA to observations having blank, 0 and [] values. The number of missing cases in each column and total number of complete cases are as shown below:
#Assign NA to blank values
movies[movies == ""] <- NA
movies[movies == "[]"] <- NA
movies[movies == 0] <- NA
sum(complete.cases(movies))
## [1] 1225
#Counting number of missing values in each column
movies %>% summarise_all(funs(sum(is.na(.))))
## budget genres homepage movie_id keywords original_language
## 1 1037 28 3091 0 412 0
## original_title overview popularity production_companies
## 1 0 3 1 351
## production_countries release_date revenue runtime spoken_languages
## 1 174 1 1427 37 86
## status tagline title vote_average vote_count
## 1 0 844 0 63 62
I am retaining all the observations for the analysis.
Duplicate Values:
Checking for duplicate movie titles using title column and removing the duplicate observations.
#Checking for duplicate movie title and removing duplicate values
movies <- movies[!duplicated(movies$title), ]
dim(movies)
## [1] 4800 20
There were three duplicate rows and they were removed.
Check for spurious characters:
Title column had spurious characters and they were removed during data cleaning.
#Removing spurious characters from movie title
movies$title <- sub(pattern = "Â", "", movies$title)
Adding new columns:
Year is extracted from release_date column and stored in a new column. Extracting year will help in performing trend analysis. Also, new columns are created to store difference between revenue and budget and an indicator, gross flag, which tells if the movie has made profit or is in loss.
movies$year <- movies$release_date %>%
as.POSIXlt(tz = "", format = "%m/%d/%Y") %>%
year()
#Adding new columns gross and gross_flag
movies <- movies %>%
mutate(gross = revenue - budget, gross_flag = ifelse(gross < 0, "Loss", "Profit"))
There are five columns in the original data set which have JSON values. They have been extracted and store in separate data frames. Columns containing JSON values were dropped from original data set and new columns were joined to the final data set which has multiple values stored in one column separated by comma.
#This code has been inspired from kernel 'Tidydata Movie Data set exploration' from Kaggle
#Creating a tibble, Keywords1, which stores keywords
keywords1 <- movies %>%
filter(nchar(keywords) > 2) %>% # fiter out blank keywords field
mutate( # create a new field
js = lapply(keywords, fromJSON) # containing a LIST of keyword and value pairs
) %>% # called id and name
unnest(js) %>% # turn each keyword/value pairs in the LIST into a row
select(movie_id, title, keywords = name)
#Combining the keywords of a movie in a single column
keywords <- aggregate(keywords ~.,data = keywords1, paste, collapse = ",")
#Creating a tibble, genres1, which stores genres
genres1 <- movies %>%
filter(nchar(genres) > 2) %>%
mutate(
js = lapply(genres, fromJSON)
) %>%
unnest(js) %>%
select(movie_id, title, genres = name)
#Combining genres of a movie in a single column
genres <- aggregate(genres ~.,data = genres1, paste, collapse = ",")
#Creating a tibble, production_companies1, which stores production companies
production_companies1 <- movies %>%
filter(nchar(production_companies) > 2) %>%
mutate(
js = lapply(production_companies, fromJSON)
) %>%
unnest(js) %>%
select(movie_id, title, production_companies = name)
#Combining production_companies of a movie in a single column
production_companies <- aggregate(production_companies ~.,data = production_companies1, paste, collapse = ",")
#Creating a tibble, production_countries1, which stores production countries
production_countries1 <- movies %>%
filter(nchar(production_countries) > 2) %>%
mutate(
js = lapply(production_countries, fromJSON)
) %>%
unnest(js) %>%
select(movie_id, title, production_countries = name)
#Combining production_countries of a movie in a single column
production_countries <- aggregate(production_countries ~.,data = production_countries1, paste, collapse = ",")
#Creating a tibble, spoken_languages1, which stores languages of the movies
spoken_languages1 <- movies %>%
filter(nchar(spoken_languages) > 2) %>%
mutate(
js = lapply(spoken_languages, fromJSON)
) %>%
unnest(js) %>%
select(movie_id, title, spoken_languages = iso_639_1)
#Combining spoken_languages of a movie in a single column
spoken_languages <- aggregate(spoken_languages ~.,data = spoken_languages1, paste, collapse = ",")
#Dropping existing columns - keywords, genres, production_companies, production_countries, spoken_languages
movies <- movies %>%
select(budget, homepage, movie_id, original_language, original_title, overview, popularity, release_date,
revenue, runtime, status, tagline, title, vote_average, vote_count, year, gross, gross_flag)
#Attaching columns - keywords, genres, production_companies, production_countries, spoken_languages using full_join in order to retain all observations.
movies <- movies %>%
full_join(keywords, by = c("movie_id", "title")) %>%
full_join(genres, by = c("movie_id", "title")) %>%
full_join(production_companies, by = c("movie_id", "title")) %>%
full_join(production_countries, by = c("movie_id", "title")) %>%
full_join(spoken_languages, by = c("movie_id", "title"))
Preview of first 100 rows of cleaned data:
datatable(head(movies, n = 100))
WordCloud was used for basic genre analysis. From the word cloud, it is evident that top five genres are Drama, Comedy, Thriller, Action and Romance in the order.
#Function to count number of genres
number <- function(df, col) {
df_count <- group_by(df, df[[col]]) %>%
summarise(count = n()) %>%
arrange(desc(count))
names(df_count)[1] <- "df" #renaming column
return(df_count)
}
genres_count <- number(genres1, 3)
#Generating word cloud
set.seed(42)
wordcloud(words = genres_count$df, freq = genres_count$count,
random.order = FALSE, rot.per = 0.20,
colors = brewer.pal(8, "Dark2"))
Genre vs runtime - Looking at movies with runtime more than 200 minutes, we see that drama and history genres have more number of movies than any other genres. Animations and documentary genres have short movies. For rest of the genres, most of the movies are in regular category.
genre_df <- movies %>%
select(movie_id, title, runtime, year, original_language, gross) %>%
left_join(genres1, by = c("movie_id","title"))
genre_df %>%
filter(!is.na(genres) == TRUE) %>%
ggplot(aes(x = genres, y = runtime, color = genres)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90))
Gross across genres and years: Heat map is used to see how gross varies across genres and years. Based on gross, popular genres across years include thriller, romance, mystery, horror, drama, adventure and action. Earliest movie belongs to drama genre.
genre_df %>%
filter(!is.na(gross) == TRUE) %>%
ggplot(aes(x = year, y = genres, fill = gross)) +
geom_tile() +
ggtitle("Heat Map") +
xlab("Year") +
ylab("Genre") +
scale_fill_gradient(low = "light blue", high = "dark blue")
Genre vs Gross
genre_df %>%
filter(!is.na(genres) == TRUE) %>%
group_by(genres) %>%
summarise(average_gross = mean(gross, na.rm = TRUE), movies_count = n(), total_gross = sum(gross, na.rm = TRUE)) %>%
filter(movies_count >= 50) %>%
arrange(desc(average_gross)) %>%
head(n = 10)
## # A tibble: 10 x 4
## genres average_gross movies_count total_gross
## <chr> <dbl> <int> <dbl>
## 1 Animation 198090651 234 37241042418
## 2 Adventure 170714153 789 112842055285
## 3 Fantasy 162675510 424 55635024529
## 4 Family 157588117 512 57519662846
## 5 Science Fiction 126069497 533 54209883661
## 6 Action 114424359 1154 105041561757
## 7 Comedy 72323910 1721 80279539976
## 8 Thriller 69233753 1274 64733558825
## 9 Romance 64263807 894 36887425371
## 10 Mystery 63308118 348 16776651385
The original and spoken languages are in ISO_639_1 codes. Their corresponding English names were found (source: https://www.loc.gov/standards/iso639-2/php/code_list.php) and a new data frame was created.
Bar chart and word cloud was used for basic language analysis. Top five languages in which movies were made include English, French, German, Spanish and Chinese. Note here that spoken language includes original language as well as language in which movies were dubbed.
#Importing language file
language <- read.csv("ISO_639_1.csv", header = TRUE)
colnames(language) <- c("language_code", "Language")
colnames(spoken_languages1)[3] <- "language_code"
spoken_languages1 <- left_join(spoken_languages1, language, by = "language_code")
#Obtaining count of languages using function
language_count <- number(spoken_languages1, 4)
#Wordcloud of languages
wordcloud(words = language_count$df, freq = language_count$count,
min.freq = 5, random.order = FALSE, rot.per = 0.25,
colors = brewer.pal(8, "Dark2"))
#Bar chart showing top 20 spoken languages
language_count %>%
head(n = 20) %>%
ggplot(aes(x = reorder(df, -count), y = count)) +
geom_bar(stat = "identity", fill = "blue") +
coord_flip() +
ggtitle("Frequency of languages") +
xlab("Language") +
ylab("Frequency")
In case of original languages:
#Top 10 original languages in which movies are made
movies %>%
group_by(original_language) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
head(n = 10)
## # A tibble: 10 x 2
## original_language count
## <chr> <int>
## 1 en 4503
## 2 fr 70
## 3 es 32
## 4 de 27
## 5 zh 27
## 6 hi 19
## 7 ja 16
## 8 it 14
## 9 cn 12
## 10 ru 11
Language vs gross
language_df <- movies %>%
select(movie_id, title, runtime, year, gross) %>%
left_join(spoken_languages1, by = c("movie_id","title"))
language_df %>%
filter(!is.na(Language) == TRUE) %>%
group_by(Language) %>%
summarise(average_gross = mean(gross, na.rm = TRUE), movies_count = n(), total_gross = sum(gross, na.rm = TRUE)) %>%
filter(movies_count >= 50) %>%
arrange(desc(total_gross)) %>%
head(n = 10)
## # A tibble: 10 x 4
## Language average_gross movies_count total_gross
## <fctr> <dbl> <int> <dbl>
## 1 English 81717510 4483 256184393932
## 2 French 97197202 874 59095898846
## 3 German 92104198 524 32973302842
## 4 Spanish 89671419 351 24480297296
## 5 Chinese 109857819 214 17357535434
## 6 Italian 104228431 188 14904665658
## 7 Russian 91438311 185 13532869957
## 8 Japanese 91837875 97 7530705756
## 9 Arabic 112318856 67 6177537074
## 10 Latin 115211043 52 4954074842
Average runtime of the movies over the years was analysed using scatter plot. We see that average runtime has come down over the years. after 1970, the average runtime has remained in the range of 100 to 125 minutes.
#Run time over the years
movies %>%
group_by(year) %>%
summarise(avg_runtime = mean(runtime)) %>%
ggplot(aes(x = year, y = avg_runtime)) +
geom_point() +
xlab("Year") +
ylab("Average Runtime")
Analysis of runtime across different languages is shown using boxplot and bar chart. The column ‘original_language’ was used for the analysis. Movies made in English language have an average runtime of around 110 minutes and several outliers were observed. Interestingly, the longest and shortest movies in the data set are also made in English language originally. Among the seven languages, which have average runtime of more than 125 minutes, most of them are Asian languages. in fact, two Indian languages - Telugu and Tamil, have average runtime of more than 150 minutes !!!
#runtime vs original_language
movies %>%
ggplot(aes(x = original_language, y = runtime)) +
geom_boxplot() +
xlab("Original Language") +
ylab("Runtime")
movies %>%
ggplot(aes(x = original_language, y = runtime)) +
stat_summary(fun.y = mean, geom = "bar")
Based on runtime, the movies were categorized into short, regular and long movies.
summary(movies$runtime)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 14.0 94.0 104.0 107.7 118.0 338.0 37
#average movie length between 104 to 107 mins
quantile(movies$runtime, probs = 0.95, na.rm = TRUE)
## 95%
## 144
quantile(movies$runtime, probs = 0.10, na.rm = TRUE)
## 10%
## 87
#movies greater than 144 minutes are long movies. 95% of movies are shorter than 144 mins
#90% of movies are longer than 87 mins. So definig short movies as less than 87 mins
movies <- movies %>%
mutate(Description = ifelse(runtime >= 144, "Long",
ifelse(runtime <= 87, "Short",
"Regular")))
Gross vs categories of movies is shown using histogram. It is seen that highest grossing movies are long, followed by regular and then short.
#Gross vs Movie categories
movies %>%
filter(!is.na(gross) == TRUE) %>%
group_by(Description) %>%
summarise(avg_gross = mean(gross),
median_gross = median(gross))
## # A tibble: 3 x 3
## Description avg_gross median_gross
## <chr> <dbl> <dbl>
## 1 Long 185180243 56849135
## 2 Regular 75543543 25654109
## 3 Short 46511444 12909561
movies %>%
filter(!is.na(gross) == TRUE) %>%
ggplot(aes(x = gross, fill = Description)) +
geom_histogram(bins = 200) +
facet_wrap(~ Description, ncol = 1, scales = "free_y") +
ggtitle("Histogram") +
scale_x_continuous(labels = scales::comma)
3D plot of runtime vs average vote vs gross
plot_ly(movies, x = ~vote_average, y = ~runtime, z = ~gross/1000000,
color = ~gross_flag, colors = c('#BF382A', '#0C4B8E') ,size = I(3)) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Average vote'),
yaxis = list(title = 'Runtime'),
zaxis = list(title = 'Gross (million $)')),
title = "INTERACTIVE 3D Scatter plot: Average vote vs Runtime vs Gross",
showlegend = FALSE)
Top 10 movie producing countries and number of movies produced by them
#Count of movies by production countries
country_count <- number(production_countries1, 3)
#Top 10 production countries
country_count %>%
arrange(desc(count)) %>%
head(n = 10) %>%
ggplot(aes(x = reorder(df, -count), y = count)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("Top 10 movie producing countries") +
xlab("Country") +
ylab("Count of Movies")
Country vs runtime: Top 10 countries having longest average runtime. Only those countries have been considered which have produced atleast 50 movies. India has the longest average runtime of all countries.
country_df <- movies %>%
select(movie_id, title, runtime, year, original_language, gross) %>%
left_join(production_countries1, by = c("movie_id","title"))
country_df %>%
filter(!is.na(runtime) == TRUE) %>%
group_by(production_countries) %>%
summarise(avg_runtime = mean(runtime), movies_count = n()) %>%
filter(movies_count >= 50) %>%
arrange(desc(avg_runtime)) %>%
head(n = 10) %>%
ggplot(aes(x = reorder(production_countries, -avg_runtime), y = avg_runtime, fill = production_countries)) +
geom_bar(stat = "identity") +
xlab("Country") +
ylab("Average Runtime") +
theme(axis.text.x = element_text(angle = 90))
Country vs average gross:
country_df %>%
filter(!is.na(production_countries) == TRUE) %>%
group_by(production_countries) %>%
summarise(average_gross = mean(gross, na.rm = TRUE), movies_count = n(), total_gross = sum(gross, na.rm = TRUE)) %>%
filter(movies_count >= 50) %>%
arrange(desc(average_gross))
## # A tibble: 11 x 4
## production_countries average_gross movies_count total_gross
## <chr> <dbl> <int> <dbl>
## 1 China 118746291 59 4393612776
## 2 United Kingdom 92238534 636 40216000928
## 3 Japan 91378382 58 3655135283
## 4 United States of America 86526542 3955 251619185559
## 5 Australia 65300096 110 5093407465
## 6 Germany 60117363 324 14007345498
## 7 Canada 56292196 261 9119335766
## 8 India 51916896 54 1505589991
## 9 France 40577027 306 7953097333
## 10 Italy 37779516 72 1813416781
## 11 Spain 29449043 71 1266308835
Top 10 production companies and number of movies produced by them.
#Count of movies by production_companies
company_count <- number(production_companies1, 3)
#Top 10 production companies
company_count %>%
head(n = 10) %>%
ggplot(aes(x = reorder(df, -count), y = count)) +
geom_bar(stat = "identity", fill = "steel blue") +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("Top 10 movie producing companies") +
xlab("Production companies") +
ylab("Count of Movies")
Production companies vs Average gross:
company_df <- movies %>%
select(movie_id, title, runtime, year, original_language, gross) %>%
left_join(production_companies1, by = c("movie_id","title"))
company_df %>%
filter(!is.na(production_companies) == TRUE) %>%
group_by(production_companies) %>%
summarise(average_gross = mean(gross, na.rm = TRUE), movies_count = n(),total_gross = sum(gross, na.rm = TRUE)) %>%
filter(movies_count >= 50) %>%
head(n = 10) %>%
arrange(desc(total_gross))
## # A tibble: 10 x 4
## production_companies average_gross movies_count total_gross
## <chr> <dbl> <int> <dbl>
## 1 Columbia Pictures 107615329 201 17971759867
## 2 DreamWorks SKG 149731845 79 10032033626
## 3 Dune Entertainment 180153356 59 9728281204
## 4 Metro-Goldwyn-Mayer (MGM) 66372067 122 6106230207
## 5 Columbia Pictures Corporation 63071308 96 5424132526
## 6 Fox 2000 Pictures 98113296 55 4218871743
## 7 Lionsgate 78141144 56 3828916050
## 8 Canal+ 54072819 75 2595495328
## 9 Fox Searchlight Pictures 45342996 64 2131120795
## 10 Dimension Films 46308979 54 1944977127
Average vote of movies was analysed over the years and only movies which had more than 1000 votes were included in the plot. It is observed that average vote of movies was highest in mid 1960s, after which it has steadily declined.
#vote_average and vote_count over the years
movies %>%
filter(vote_count >= 1000) %>%
group_by(year) %>%
summarise(avg_vote = mean(vote_average)) %>%
ggplot(aes(x = year, y = avg_vote)) +
geom_point() +
geom_smooth(stat = "smooth", position = "identity") +
xlab("Year") +
ylab("Average Vote")
A look of how average vote varies across languages. Original language was used for the analysis.
#vote_average vs original_language
movies %>%
filter(vote_count >= 100) %>%
ggplot(aes(x = original_language, y = vote_average)) +
geom_boxplot()
Average vote vs gross across different categories of movies : It is seen that only a handful of movies have gross of over 1 billion dollars. Highest grossing movie belongs to long category but regular category has nearly double the number of movies with gross above 1 billion while there are no movies from short category in that range. We can also see that some movies have gross less than 0, i.e, the movies were in loss. Except a few movies, most of them with average rating above 7 have grossed less than 1 billion dollars. In fact, three regular movies with less than average rating of 7 have managed to gross above 1 billion dollars.
movies %>%
filter(vote_count >= 1000) %>%
ggplot(aes(x = gross, y = vote_average, color = Description)) +
geom_point(na.rm = TRUE) +
facet_wrap(~ Description, ncol = 1) +
scale_x_continuous(labels = scales::comma)
Analysis of movies over the years is shown using line plot. Year 2009 saw the highest number with 247 movies. It is seen that number of movies steadily increased till year 1999, after which there have been ups and downs. As the data set contains information about movies till 2016, We see a sharp decline in number of movies after 2015.
#Trend analysis of movies over the years
movies %>%
group_by(year) %>%
summarise(movie_count = n()) %>%
filter(movie_count >= 10) %>%
ggplot(aes(x = year, y = movie_count)) +
geom_line() +
geom_point() +
xlab("Year") +
ylab("Number of Movies") +
theme_classic()
Top 10 highest grossing movies of all time
#Top 10 grossing movies
movies %>%
select(title, gross) %>%
arrange(desc(gross)) %>%
head(n = 10) %>%
ggplot(aes(x = reorder(title, -gross), y = gross, fill = title)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("Top 10 highest grossing movies of all time") +
xlab("Movie Title") +
ylab("Gross (in $)") +
scale_y_continuous(labels = scales::comma)
Top 10 movies with highest rating. Only movies which have more than 1000 vote counts have been considered.
#top 10 movies with highest rating
movies %>%
filter(vote_count >= 1000) %>%
select(title, vote_average) %>%
arrange(desc(vote_average)) %>%
head(n = 10) %>%
ggplot(aes(x = reorder(title, -vote_average), y = vote_average, fill = title)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("Top 10 highest rated movies of all time") +
xlab("Movie Title") +
ylab("Average Vote")
The points mentioned in problem statement were addressed in this project.
Functions in Dplyr package such as mutate, select, filter , arrange, sumarise, etc and jsonlite package were used for data manipulation, extracting data from JSON columns and creating new columns. For data analysis, various plots and charts such as scatter plot, bar chart, histogram, heat map, 3D scatter plot, word cloud, etc were used which were available in Ggplot, wordcloud and plotly packages.
Insights gained from data analysis are summarised below:
Implications: Based on above analysis, movies in the following categories might be able to fetch high gross revenue:
Limitations and future work: