IMDb Movie Analysis

Sylvia Gong

August 23, 17

Synopsis

IMDb, the world’s most popular and authoritative source for movie, TV and celebrity content.

Every year, thousands of movies are produced worldwide. Before going to cinema, people tend to check IMDB to see whether their favourite actors and directors are in the cast list, how high the movie scores and probably check on the critics of the movie. Then, they will decide whether or not they are willing to buy a ticket. This analysis would explore the overview of movies from IMDb and focus on what are the indicators of a successful movie.

Movie investors would like to predict how well a movie will perform before its release. So certain standards would probably include: duration, type, actors and directors and the propaganda means. To examine to what extent these variables are related to the IMDB scores and box office, my analysis would gave answers and provide reference.

Packages

These are required packages for further analysis.

library(tidyverse) # Data manipulation and plotting
library(ggrepel) # Provides text and label geoms
library(formattable) # Provides formattable vectors and data frames
library(data.table) # provides an enhanced version of data.frames
library(plotly) # Makes interactive graphs and tables
library(readxl) # Read excel files
library(DT) # Show a preview
library(stringr) # Character Manipulation
library(tidytext) # Text Mining and processing
library(wordcloud) # Create a wordcloud
library(RColorBrewer) # Color schemes
library(tm) # Text management
library(corrgram) # Correlation plotting
library(ggpubr) # For a better display

Data preparation

Data loading

The original dataset is from Kaggle. It(IMDB 5000 Movie Dataset)contains 28 variables, spanning across 100 years in 66 countries over 5000 movies from IMDB, scrapped from IMDB by Chuan Sun.

movies_raw <- read_csv("movie_metadata.csv")

Now we want to see the number of missing values and how they are distrubuted. The original dataset has 2702 missing values. I will deal with them when analyzing data.

  # Detect missing values
colSums(is.na(movies_raw)) %>%
  sort(decreasing = TRUE)

##                     gross                    budget 
##                       884                       496 
##              aspect_ratio            content_rating 
##                       329                       303 
##             plot_keywords                title_year 
##                       153                       108 
##             director_name   director_facebook_likes 
##                       104                       104 
##    num_critic_for_reviews    actor_3_facebook_likes 
##                        50                        23 
##              actor_3_name      num_user_for_reviews 
##                        23                        21 
##                     color                  duration 
##                        19                        15 
##              actor_2_name      facenumber_in_poster 
##                        13                        13 
##    actor_2_facebook_likes                  language 
##                        13                        12 
##    actor_1_facebook_likes              actor_1_name 
##                         7                         7 
##                   country                    genres 
##                         5                         0 
##               movie_title           num_voted_users 
##                         0                         0 
## cast_total_facebook_likes           movie_imdb_link 
##                         0                         0 
##                imdb_score      movie_facebook_likes 
##                         0                         0

Data Cleaning

The purpose of data cleaning in this project includes:

Remove instances which have at least one NA variable
Remove duplicate values in movie_title
Create new variable profit and ROI for profit analysis
Select useful variables and create a cleaned dataset

  # Remove instances which have at least one NA variable
movies_raw <- movies_raw[complete.cases(movies_raw), ]
  # Remove duplicate values
movies_raw = movies_raw[!duplicated(movies_raw$movie_title),]

  # Add variable profit and ROI
movies_raw1 <- movies_raw %>% 
  mutate( profit = gross - budget, ROI = ( profit/ budget)* 100)
  # create subset and select useful variables
movies_new <- movies_raw1 %>%
  select(director_name, duration, gross, genres, movie_title,
  movie_facebook_likes,plot_keywords, language, country, content_rating, budget, title_year, imdb_score, profit, ROI)

Now we have the cleaned dataset which will be used later. Here is a preview of what it looks like.

datatable(movies_new)

Data Analysis

This section contains exploratory data analysis and interpretation. Information includes titles, code, plots and findings. Main analysis I conducted are:

Movies released by year and country
Top Director analysis
Content rating analysis
Profit analysis
Duration analysis
Keywords analysis
Correlation analysis

Histogram of movie released since 1970

Since movie released before 1970 are very few, we only filter out movie released since 1970 and see the trend of movies production.

# histogram of years since 1970
movies_year <- subset(movies_new, title_year >= 1970)
ggplot(movies_year, aes(x = title_year)) +
  geom_histogram(bins=100, fill = "steelblue", alpha=0.5, col = "black") +  
  labs(title = "Histogram of Movies Released Each Year", x = "Year", y = "Frequency")

This plot basically shows us the trend of amounts of movies released by year. 2000-2010 seems to be a golden decade for movies and there is a decreasing trend till now. And it also shows a giant growth between 1990-2000. Furthermore, I’m curious about the proportion by countries, so I continued on plotting movies released by year and meanwhile marked countries in different colors.

# barchart of movies released by year and country
graph <- ggplot( movies_new,
  aes(x = movies_new$title_year, fill = movies_new$country)) + 
  geom_bar() +
  labs(x = "year",y = "country")
# country other than usa
nousa<- movies_new %>% 
  filter(country!="USA")
withoutusa <- ggplot(nousa,
  aes(x= nousa$title_year, fill = nousa$country)) + 
  geom_bar() +
  labs(x = "Year",
       y = "movies")
ggarrange( graph, withoutusa, labels = c("Worldwide", "Without USA"),
            common.legend = TRUE, legend = "bottom")

Above two plots shows us how many movies are released each year by countries. The left one indicates a sigificant large proportion of movies released are in USA(in Pinky Red). And we can’t really tell the other countries except a relative clear proportion from Europe(in Green). So I created the right one without USA, from which we can see the number of movies in foreign countries. UK and other Europian countries consist a large propotion but we also notice certain amount of movies from China and Canada(in dark yellowish). Compare the left one(USA included) and the right one(without USA), I noticed the slightly difference in distribution-the decreasing trend since 2010 is more obvious in the right one. So the number of USA movies are relatively consistent in recent years while much fewer movies released each year in other parts of the world.

Profit and Mean Socre of Directors

For directors who have directed over 3 movies, we want to know their movies’ average IMDb score and profit to see who are ranked top directors.

#profit and mean score of directors
movies_new %>% group_by(director_name) %>% 
  summarise(avg_rating = mean(imdb_score), avg_profit = mean(profit), num = n()) %>% 
  filter(num >=3, avg_rating>7.5, avg_profit > 1e+5) %>% 
  na.omit() %>%
ggplot( aes( x = avg_rating, y = avg_profit, label = director_name)) +                   geom_point( color = "blue", fill = "white", size = 1.2, stroke = 2) +                 geom_text_repel()+
   labs( x = "Mean IMDB Score", y="Average Profit", title = "Top Directors' Average IMDB Score and Profit")

This ranking plot of directors are based on directors’ average movie score and profit. Firstly I only want directors with more than 3 movies and all of these names we see here have a average movie score of more than 7.5, which seems to be a good score to me. We also have their average movie profit marked in y axis. So now we have some familiar names to look at. On the very right side, Nolan has the highest average movie score of about 8.4, which shows his consistent high quality in movies. And his movies also has a fair profit. James Cameron is a lot different than Nolan. His movies hav a average score of around 7.85, but generates high profits. One of my favourite director Quentin scores 8.2 on avrage but the profit of his movies is not as high.

Content Rating Anlysis

I would like to know whether content rating is a key factor to imdb score and which catogories have higher scores. Also, I compared movie facebook likes and ROI in different content rating groups.

# boxplot by content_rating
set.seed(123)
content_score <- plot_ly(movies_new, x = ~imdb_score, color = ~content_rating, type = "box") %>%
  layout(title = "Movie Imdb Score by Content Rating",
  xaxis = list(title = "Imdb Score"),
  yaxis = list (title = "Content Rating"))
content_score

The boxplot proves that content rating is a factor to movie score. Movies rated M have the highest average score while movies rated X scored relatively lower than others. I also notice the range of R rating movies, PG-13 and PG are more dispersed, but their median score is not sigificantly different.

# scatterplot-score vs fb_likes by content rating
ggplot(movies_new, aes(imdb_score, movie_facebook_likes)) + geom_point(aes(color = content_rating)) + 
scale_x_continuous("Imdb score", breaks = seq(0,10,1.25))+
  scale_y_continuous("fb likes", breaks = seq(0,2e+5,by = 2e+4))+ 
  theme_bw() + labs(title="Scores and Fb likes") + facet_wrap( ~ content_rating)

Some content rating groups of movies have high scores but very few facebook likes. As evidenced above, PG-13, R and PG rating movies have greater facebook popularity than other content rating groups.

#heatmap to see ROI
  ggplot(movies_new, aes(language, content_rating))+
    geom_raster(aes(fill = ROI))+
    labs(title ="Heat Map", x = "Language", y = "Content Rating")+
    scale_fill_continuous(name = "ROI")

Seems an color error. Will fix it and update later.

Profit analysis

movies_new %>%
  filter( profit > 0) %>%
  na.omit() %>%
  top_n(20, ROI) %>%
  arrange(ROI) %>%
  ggplot( aes( x = budget /1000000, y = profit /1000000, col = genres))+
  geom_point()+
  geom_text_repel( aes( label = movie_title), nudge_x = 0, nudge_y = 5)+
  labs( title="Top 20 Profitable Movies and their genres")

To be honest, I have never heard of these movies before. But they are ranked the most profitable movies in my plot. Therefore, they must have the highest ROI, which indicates these movies are not popular but can make movie makers’ end meet. So I think producers may want to look at these movies and see if they have something in common to make themselves good movie investment choice.

How Duration Affects IMDb Score

Though we’ve done that in class, I still want to put them here because duration is a key factor to movies. So essentially I just plot the duration of movies and then catogorize them in long, regular and short movies.

# duration-histogram
ggplot( movies_new, aes(duration))+
  geom_histogram( binwidth = 5)+
  coord_cartesian( xlim = c(0, 60*4))+
  labs( title = "Histogram of Movie duration")

movies_duration <- movies_new %>%
  mutate(durationtype = ifelse(duration >= 125, "Long",
                              ifelse(duration <= 60, "Short",
                                     "Regular"))) %>% na.omit() %>%
  select( movie_title, durationtype, everything())
# duration vs score facet
dur_facet <- ggplot(movies_duration, aes(imdb_score, fill = durationtype)) +
  geom_histogram() +facet_wrap(~durationtype, ncol = 1, scales = "free_y") 
# duration vs score density
dur_density <- ggplot(movies_duration, aes(imdb_score, fill = durationtype)) +
  geom_density( alpha = .4)
ggarrange(dur_facet, dur_density, labels = c("Facet Plot by Duration", "Density by      Duration"), common.legend = TRUE, legend = "bottom")

These graphs indicates that long movies tend to score higher than short and regular movies. The peak of movie length is around 100 minutes so that would be a most common length. Meanwhile, the density plot tells a different story. Short movies score more dense at around 7.5 among the three.

Keywords analysis

Each movie has several plot keywords, which pictures a general idea of what the movie is about. Keywords are good indicators of movies themes. So I did some analysis based on keywords.

 # seperate keywords
movies_new$keywords <- ( sapply( movies_new$plot_keywords,gsub,pattern =                "\\|",replacement=" "))
# keywords analysis
movies_new$keywords <- strsplit(movies_new$plot_keywords,"\\|",fixed = TRUE)
keywords1 <- Corpus(VectorSource(movies_new$keywords))
keywords_txt <- DocumentTermMatrix(keywords1)
keywords_freq <- colSums(as.matrix(keywords_txt))
keywords_df <- data.frame(word = names(keywords_freq), freq = keywords_freq)
kw_top30 <- keywords_df %>%
  top_n(30, freq) %>%
  filter(!(word == "the" | word=="title"| word == "based"))

ggplot( kw_top30, aes(x = reorder( word, -freq), y = freq))+ 
  geom_bar( stat = "identity")+
  theme( axis.text.x = element_text( angle = 45, hjust = 1))+
  ggtitle("keywords Frequency Graph")+
  xlab("Keywords")+
  ylab("Frequency")

This graph shows the most frequent keywords and I ranked top 30 here. So the most frequent keywords for a movie is “female”, maybe it has something to do with women’s rights or it indicates that movie is targeted at female viewers. Either way, we now know that female are important movie viewers and possiblely also reviewers. Other top ranking keywords such as “friend”, “love”, “school” appear to describe teenage movies. There are also death, murder and police, seems to be used in detective or criminal movies.

kw_top100 <- keywords_df %>% top_n(100, freq) %>%
  filter(!(word == "the" | word == "title"| word == "based"))
wc1<- wordcloud(words = kw_top100$word, freq = kw_top100$freq, lang = "english",        excludeWords = NULL, textStemming = FALSE, random.order = FALSE, min.freq = 1,        colors = brewer.pal( 8, "Dark2"))

  # compare with keywords of top movies
movies_new$keywords <- strsplit( movies_new$plot_keywords,"\\|",fixed=TRUE)
top_movies <- movies_new %>% top_n( 100,movies_new$imdb_score)
keywords2 <- Corpus(VectorSource( top_movies$keywords))
keywords2_txt <- DocumentTermMatrix( keywords2)
keywords2_freq <- colSums(as.matrix( keywords2_txt))
keywords2_df <- data.frame( word = names(keywords2_freq), freq = keywords2_freq)
kw2_top100 <- keywords2_df %>% top_n(100, freq) %>%
  filter(!( word == "the" | word == "title"| word == "based"))
wc2 <- wordcloud( words=kw2_top100$word, freq = kw2_top100$freq, lang = "english",      excludeWords = NULL, textStemming = FALSE, random.order = FALSE, min.freq = 1,colors   = brewer.pal(8, "Dark2"))

See also in two wordclouds. The size of each word indicates its frequency. The first wordcloud is based on all movies while the other one contains only top 100 score movies. Comparing these two, we can see some slight different in ranking-“police” replace “female” to be the most frequent keywords. And also there are more heavy topics such as “nazi”, “war” and “time”. Maybe these topics are more high qualified and tend to create good movies.

Correlation analysis

I was wondering if variables such as facebook likes and movie score are related in some way. And whether there is a linear regression model for movie score. So I conducted a coorelation plot to text all variables first.

  corrgram( movies_new, order = TRUE, lower.panel = panel.shade,
           upper.panel = panel.pie, text.panel = panel.txt,
           main = "Correlation plot")

As is shown in this corrgram, we can see the correlation of variables in pais. Blue indicates a positive correlation and red for negative. The most correlaed variables are gross and profit. However, that’s not what we are interested in because it’s obvious. And then comes duration and score, which we have analyzed before. Another relative strong correlation pair is facebook likes and gross. So we may take a closer look at them. Before that, I don’t believe it would be meaningful to do regression model because not many variables are correlated to score and it’s hard to create a simple linear regression for it.

corr1 <- ggplot(movies_new, aes(x = movies_new$movie_facebook_likes, y = movies_new$imdb_score))+
  geom_point(color = "blue") + labs(x = "fb_likes", y = "score") + 
  stat_smooth(method = lm, se = F, color = "red") +theme_grey() + 
  ggtitle(paste("cor:", 0.478)) + geom_smooth( color = "black")
# look at gross and fb_likes
corr2 <- ggplot(movies_new,aes(movie_facebook_likes,gross))+
    geom_point(color = "blue")+
    geom_smooth()+
    coord_cartesian( xlim = c(0,2e+5))
ggarrange( corr1, corr2, labels = c("Correlation between facebook likes and score",       "correlation between facebook likes and gross"))

As I expected, movie facebook likes seems to have a correlation to some extent. But the correlation coefficient is 0.478, not a very strong predictor. Therefore, movies which have many facebook likes may not mean they will be scored high. The latter one is the correlation plot between facebook likes and gross. The correlation seems even weaker. So facebook likes is not a good predictor of gross either.

Summary

The analysis focused on several subjects and now we can summarize the findings for each analysis we conducted.

Movies released by year: The past decade 2000-2010 seems to be a golden decade for movies and there is a decreasing trend till now. And it witnessed a giant growth between 1990-2000.
Country analysis: USA has a lead in movie industry both in quality and quantity. follwed by India, China and UK.
Top director: Some famous directors have high average score while some have high profit of movies. Nolan has the highest average movie score of about 8.4, which shows his consistent high quality in movies. James Cameron’s movies have a average score of around 7.85, but generates high profits. Quentin scores 8.2 on avrage but the profit of his movies is not as high.
Content-Rating analysis: Movies rated M have the highest average score while movies rated X scored relatively lower than others. And the range of R rating movies, PG-13 and PG are more dispersed, but their median score is not sigificantly different. In terms of facebook popularity, PG-13, R and PG rating movies have greater facebook popularity than other content rating groups.
Duration analysis: Long movies tend to score higher than short and regular movies. The peak of movie length is around 100 minutes so that would be a most common length. Meanwhile, short movie scores are more dense at around among the three.
Keywords analysis: The most frequent keywords for a movie is “female”, so female are important movie viewers and possiblely also reviewers. Other top ranking keywords such as “friend”, “love”, “school” appear to describe teenage movies. There are also death, murder and police, seems to be used in detective or criminal movies. Furthermore, heavy topics such as “nazi”, “war” and “time” are frequent in high-score movies.
Correlation analysis: Have not detected very strong correlations between variables. No need to do further regression modeling. There is weak correlation between facebook likes vs gross and facebook likes vs score.