The Dataset Description

For this project I used a dataset about movies from the GitHub page of the TidyTuesday project. This dataset included 3401 observations and 9 variables. The relevant variables include:

Using this dataset as a film analyst, I tried to analyze how movie production, budgets, revenue have evolved over the years. I will try to answer some questions regarding movies produced between 1950 and 2019. First, I loaded the data using the tidy tuesday package and converted it to a data table format for ue.

movies <- tt_load('2018-10-23')

# convert it to a data table object
movies <- as.data.table(movies$movie_profit)

CUSTOM FUNCTION

I created a custom function that I thought would be useful for all the visualizations. This way my visualizations will look tidy and the presentation would be better.

# Create a custom graph theme applicable for all graphs
theme_movies <- function(){
  
  theme_bw() + 
    theme(plot.caption = element_text(hjust = 0, face = "italic"), 
          plot.title = element_text(hjust = 0.5, color = "black", size = 14, face = "bold"),
          plot.subtitle = element_text(hjust = 0.5, size = 11),
          panel.grid.minor = element_blank() ,
          panel.grid.major = element_blank())
  
}

DATA CLEANING

I checked the structure of the data and decided accordingly what transformations were required. I also got a data summary for all the variables giving the minimum, maximum values, missing values, unique values.Then I filtered the required columns and dropped rows where movie title was missing because without it our analysis about the movie would not make sense. To extract the year, I split the date column after converting it from an integer to date. I also dropped month and day as I did intend to use them. Lastly, I renamed the production budget column for easier use.

# glimpse of dataset
glimpse(movies)

# skim data 
skim(movies)

# checking format for all columns
str(movies)

#change date from character to date
movies$release_date <- mdy(movies$release_date)

# filter required columns
movies <- movies[ , .(release_date,movie,production_budget,domestic_gross,worldwide_gross,distributor,mpaa_rating,genre)]

# drop column where movie is missing
movies <- movies[is.na(movie) == FALSE]

# check summary
summary(movies)

# change year from integer to date
movies$release_date <- as.Date(movies$release_date)  

# split release date to get year
movies <- movies %>%
  mutate(year = year(release_date), date = day(release_date), month = month(release_date))

# drop month and day
movies <- movies[, -11:-13]

# rename columns
data.table::setnames(movies,'production_budget','budget')

DATA TRANSFORMATIONS

Although initially the dataset had only 8 variables that could be used. But I realized that with some additions this dataset could be very useful to understand how the movie business works and how it has changed over the years. Therefore, I added the domestic and worldwide gross revenue to calculate the total revenue made by the movie. Next, using the budget and total gross revenue I calculated the percentage of revenue according to the allotted budget. We also had enough information to calculate the profit which required subtracting budget from total gross revenue. I created a new column for the success rate of movies depending on the budget percentage revenue. A new column with a binary variable for movies success was created where a superhit movie was 1 otherwise it was 0.

# add total revenue (domestic+worldwide) 
movies <- movies[, `:=` ( # dplyr::mutate
  total_gross_rev = domestic_gross + worldwide_gross)]

# add budget 
movies <- movies[, `:=` ( #dplyr::mutate ==
  budget_percent_rev = (budget / total_gross_rev)*100,
  profit = total_gross_rev - budget,
  year = as.integer(year))]

# rate movies success by revenue
movies <- movies %>% 
  mutate(success_rate = case_when(budget_percent_rev > 90 ~ "Superhit",
                                         budget_percent_rev >= 70 ~ "Very good",
                                         budget_percent_rev > 50 ~ "Profitable",
                                         budget_percent_rev >= 0.1 ~ "flop"))

# rate movies out of 10 by profitability
movies <- movies %>% 
  mutate(ratings = case_when(budget_percent_rev > 90 ~ "10",
                                         budget_percent_rev >= 70 ~ "8",
                                         budget_percent_rev > 50 ~ "6",
                                         budget_percent_rev >= 0.1 ~ "4"))

# change ratings from character to numeric
movies$ratings <- as.integer(movies$ratings)  

# add superhit/flop binary variable
movies[, rating_binary := as.factor(ifelse(movies$success_rate == "Superhit", 1,0))]

# categorize movies by release date
movies <- movies %>% 
  mutate(time_period = case_when(year > 1900 ~ "2010s",
                                         year > 1999 ~ "2000s"))

EXPLORATORY DATA ANALYSIS

Now my data was fit to be used for some basic visualizations that would give me more details about my data. I used summary statistics to give this information. The results are given below. We can see that most of our variables are either left or right skewed. We can see that worldwide gross revenue and domestic gross revenue are following a similar pattern. We can also observe that most movies in our dataset were rated 4, which is fairly low, followed by 10 which is the maximum given to superhit movies. So most of our movies are either flop or superhit.

## EDA

# first I created some basic visualizations to give holistic understanding of the data before answering some analytical questions
movies %>%
  keep(is.numeric) %>% 
  gather() %>%
  ggplot(aes(value)) +
  facet_wrap(~key, scales = "free") +
  geom_histogram(color = "black", fill = "#fa9fb5")+
  theme_movies()

Finally, I took a look at the correlation of the important numeric variables. . The strongest correlation was between worldwide gross revenue and profit. However, the domestic gross revenue and profit also followed a similar pattern.

# correlation of evaluation variables
#install.packages("GGally")
library(GGally)
correlation <- movies %>%  select(c("profit","budget","domestic_gross","worldwide_gross"))
# Check their correlation
ggpairs(correlation)

Before formally starting my analysis by answering questions, I wanted to know the trajectory of movie production between years 1950 and 2019. What did the pattern look like? Below we can see that movie production took up pace after 1975. But there were a lot more movies released after 2000 compared to the previous years. This shows a growing trend, perhaps better technical facilities, better information and developing interest in this field resulted in such a scenario.

# check the distribution by year
ggplot(movies[, .N, by = year], aes(x= year, y=N)) +
  geom_col(colour="black", fill="#fa9fb5") +
  ggtitle('Number of movies released by year') +
  xlab("") +
  ylab("") +
  theme_movies()

VISUAL ANALYSIS

  1. Profit by mpaa ratings

I wanted to see how much profit was made by movies according to the mpaa ratings which are given below. For this I filtered the data by average profit earned by mpaa rating. Using the kable function from knitr to show how many observations for in each category. We can see that there are NA values which I dropped for the visualization.

# Number of observations in the two category
ratings_table <- movies[,list(observations = .N, avg_profit = (mean(profit))), by = mpaa_rating]
knitr::kable(ratings_table, caption="Appropriate age ratings movie by profit")
Appropriate age ratings movie by profit
mpaa_rating observations avg_profit
PG 573 165080935
PG-13 1092 127007554
G 85 215229463
R 1514 70883864
NA 137 18009642

The box plot below shows that most movies lie in the category G which means they are appropriate for all ages. However, we cannot assume that the average profit for these will also be higher. The case is different as we can see that the average profit for PG-13 movies is higher.

df <- movies[!is.na(mpaa_rating),.(profit),  by = mpaa_rating]
ggplot(df, aes(factor(mpaa_rating),profit)) + 
  geom_boxplot(color = "black", fill = c("#fa9fb5","#c994c7", "#bcbddc", "#8856a7")) +  labs(title = "Distribution of profits by appropriate age rating", x = "", y = "Average Profit") +
  theme_movies() 

2. Average profit earned by the year by genre

I was interested to see the the top 3 genres that made the maximum profit. In this case action, adventure and horror made the most profit. Next, I wanted to visualize this profit over the years.

# Amount of profit by genre
genre_table <- movies[,list(observations = .N, avg_profit = (mean(profit))), by = genre]
knitr::kable(genre_table, caption="Profit of movie by genre")
Profit of movie by genre
genre observations avg_profit
Comedy 813 80991031
Action 573 154676716
Adventure 481 224086193
Drama 1236 59637771
Horror 298 85202723

The time series animation below is the best way to see trends in profit. We can see that there is a drastic change in the average profit of horror movies as compared thr action and adventure. It is very high before 1980’s but tends to decline over time, picking up pace again after 1995. Nonetheless, its profits remain below that earned by action and adventure movies. Horror seems like a niche category, not liked by everyone but rather few people. For example, small children or older people prefer adventure or drama over horror movies. This could be due to the element of fiction involved in horror.

df1 <- movies[, .(avg_profit = mean(profit)), by=.(year, genre)]
df1 <- subset(df1, genre == c("Action","Adventure", "Horror"))
p3 <- ggplot(df1, aes(x = year , y = avg_profit , color = genre)) + 
    geom_line() + geom_point() +
    transition_reveal(year) + 
    labs(title = 'Profit Distribuition', 
                 subtitle = 'Average profit of movies separated by year and genre', 
                   y = 'Average Profit by Genre',
                   x = 'Year') +
     scale_x_continuous(limits = c(1970,2019), breaks = seq(1970,2019, by = 4)) +
     scale_y_continuous(breaks = seq(0,800000000, by = 50000000)) +
     theme_movies() +
    scale_colour_manual(values = c("#fa9fb5", "#c51b8a", "#c994c7"))
animate(p3, end_pause = 10, fps = 5)

  1. MDS for movies released between 1950-2019

I performed MDS on a random subsample of mocies, selecting a movie from every year between 1950 and 2019. In the plot below we can see which movies have similar features that are close to each other. However, we can observe a few exceptions. Using this plot we can determine which movies may be grouped in one genre or have similar characteristics.

# select a random movie from every year between 1950 and 2019
# create empty data frame where rows can be added
mds_df <- data.frame(matrix(ncol = ncol(movies), nrow = 0))
colnames(mds_df) <- colnames(movies) # add column names
# select the games
set.seed(8872)
for (i in 1950:2019){
  temp_df <- movies[year == i]
  mds_df <- rbind(mds_df, temp_df[sample(nrow(temp_df), 1),])
}
# put the names of the games as row names
mds_df <- column_to_rownames(mds_df, var = 'movie')
# filter numeric variables
mds_df <- mds_df[, lapply(mds_df, is.numeric) == TRUE]
# create distance matrix and perform MDS
mds_df <- cmdscale(dist(scale(mds_df)))
mds_df <- as.data.frame(mds_df) # convert to df
mds_df$movie <- rownames(mds_df) # add names of movies
rownames(mds_df) <- NULL
# create the plot
ggplot(data = mds_df, aes(x = V1, y = V2, label = movie)) +
  geom_text_repel(colour="#c51b8a") +
  labs( x = '', y = '', title = 'MDS for movies released between 1950 and 2019') +
  theme(axis.line=element_blank(),axis.text.x=element_blank(),
          axis.text.y=element_blank(),axis.ticks=element_blank()) +
   theme_movies()

4. Top 5 genres with most movies

This plot shows the top 5 genres with most movies in our dataset. In a chart above we saw the average profit made by movies in three different genres. Here I wanted to see which genre had the most movies and select the top 5 genres. Drama topped the list, followed by comedy and action. Perhaps drama category includes more because they are produced more. This could be due to public preference.

# check the genres with the most movies
# separate genres into rows so that each movie has as many rows as many genres it belongs to
df_cat <- tidyr::separate_rows(movies, genre, sep = ",")
# convert back to data table
df_cat <- as.data.table(df_cat)
# calculate number of movies in each genre and get top 5
df_cat <- df_cat[ , .(num = .N), by = genre][order(-num)][1:5]
# create plot
ggplot(df_cat, aes(x = reorder( genre, num ), y = num)) + 
  geom_segment(aes(x = reorder( genre, num ), xend = genre, y = 0, yend = num)) + 
  geom_point(size = 5, colour="#c51b8a") + 
  labs( x = "", y = "", title = "Top 5 genres based on number of movies") + 
  coord_flip() +
  geom_text(aes(label=num),hjust=-0.5, vjust=0.5) +
  theme(axis.line=element_blank(),axis.text.x=element_blank(),
  axis.ticks=element_blank())+
  theme_movies()

5. Worldwide and Domestic Gross Revenue Comparison

Another interesting visualization was using geom smooth function to compare the domestic gross revenue and worldwide revenue of each movie. The chart below shows that there are some movies whose domestic revenue is low but their worldwide revenue is higher. This could be because the movie was liked across the world but not so much within the United States.

# domestic and worldwide gross revenue comparison
ggplot(movies, aes(x = domestic_gross, y = worldwide_gross)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = 'lm', colour="#c51b8a") +
  labs( x='Domestic Gross Revenue', y='Worldwide Gross Revenue', title = 'Relationship between domestic and worldwide gross revenue') +
  geom_point(data = movies, colour = '#fa9fb5') +
  theme_movies()

  1. Number of Movies & Average Rating by Distributor

As you hover over the points you can see the names of the distributors responsible for the distribution of movies. We can see that the highest rating is contested by Illuminare, Montery Media, Alchemy & Regent Releasing. However, the most movies distributed are by Warner, but those movies have received a poor rating.

df3 <- tidyr::separate_rows(movies, distributor, sep = ",")
# convert back to data table
df3 <- as.data.table(df3)

# get number of games by mechanic and average rating
df3 <- df3[ , list(num = .N, avg_rating = mean(ratings)), by = distributor]
# create interactive plot
p1 <- ggplot(df3, aes(x=num, y=avg_rating, tooltip = distributor, data_id = distributor)) +
  geom_point_interactive() +
  theme_movies() +
  labs(x='Number of movies', y='Average rating', title = 'Relationship between rating and number of movie \n (Hover over points to see distributor)')
# customize tooltips
tooltip_css <- "background-color:gray;color:white;font-style:italic;padding:10px;border-radius:5px;"
girafe(ggobj = p1,
  options = list( opts_tooltip(css = tooltip_css),
                  opts_hover_inv(css = "opacity:0.5;"),
                  opts_hover(css = "fill:pink;")))
  1. Success Rate by Budget

To see how much budget was allocated to movies rated as superhit, very good, profitable and flop, I created a simple bar chart. We can see in the earlier years all types of films were allocated more or less the same budget but after 1980’s movies that became flop had more budget allocated than the rest, followed by superhit movies.

p4 <- ggplot(movies, aes(x = success_rate, y = budget, fill = success_rate)) + geom_bar(stat = 'identity') +
  theme_movies() +
  transition_states(year) +
  labs(title = 'Average budget allocated by success rate', 
       subtitle = 'year: {closest_state}',
       y = 'Average Budget') +
  scale_fill_manual(values = c("#fa9fb5", "#c51b8a", "#c994c7", "#9ebcda")) +
  theme(legend.position = "top" , legend.title = element_blank())
  
animate(p4, end_pause = 10, fps = 5)

  1. Average budget and year of release

To answer the question how budget has changed over the years for movies, I have divided the budget into two categories; below and above. The light pink shows movies that received a less than average budget in that particular movies. In 2007 movies were allocated less budget as compared to 2016. A reason for this could be the US financial crunch during that period and a resulting effect was the film industry having shortage of funds to produce movies due to the crisis.

# calculate average budget by year
df_budget <- movies[ , .(budget = mean(budget)), by = year]
# filter years because there are too many of them
df_budget <- df_budget[year >= 2000]
# add if an observation is above or below average
df_budget[ , type := ifelse(budget < 35000000, 'below', 'above')]
# sort values based on budget
df_budget <- df_budget[order(budget), ]
# convert year to factor so that order remains the same on the plot
df_budget <- df_budget[, year := factor(year, levels = year)]
# create diverging bar chart
ggplot(df_budget, aes(x=year, y=budget, label=budget)) + 
  geom_bar(stat='identity', aes(fill=type), width=.5)  +
  scale_fill_manual(name="Budget", 
                    labels = c("Above Average", "Below Average"), 
                    values = c("above"="#c51b8a", "below"="#fa9fb5")) + 
  labs(title= "The change in budget of movies in 21st century", y="", x="") + 
  coord_flip() +
  theme_movies() +
  theme(axis.line=element_blank(),axis.text.x=element_blank(),axis.ticks=element_blank())

CONCLUSION

For an analyst interested in seeing trends, patterns and other interesting aspects of movies or any particular industry for that matter, visualziations are the best method. If I were to explain my analysis to a production house, regarding a movie that I wish to produce, rather than writing a 20 page report, I can simply show him these interactive charts that summarize all the important, filtered information for the funding company and would enable the producer to betetr explain his understanding of the film industry.

In this study, due to the variety of information given, we were interested in profit, budget and revenues of films by genres , distributors and ratings. This is a relatively general study and can be further used for analysis given more variables for example the production company, information about the cast etc. This could certainly serve as a foundation study, upon which much more analysis can be done.