Import IMDB 5000 Dataset

In the analysis, we will be analyzing the dataset created by chuansun76, Please download the dataset before you start with any of the following analysis. https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset.

The first step is to import the data and download the related packages(ggplot2,dplyr,GGally) we will be using later:

movie<-read.csv("movie_metadata.csv",header=T,stringsAsFactors = F)

library(ggplot2)
library(dplyr)
library(GGally)
library(MASS)
library(caTools)

Clean Up the Dataset

After importing the data, let’s start to remove the duplicated records and also remove any records with NA Values:

movie.nodup<-movie[!duplicated(movie),]
movie.final<-movie.nodup[complete.cases(movie.nodup),]
colSums(is.na(movie.final))

Noted in the intial investigation that the color variable is having null value, and the aspect_ratio and duration variable should be factor instead of num in datatype. the following step further transform the data according to the points above:

movie.final<-subset(movie.final,color == "Color" | color == " Black and White")
movie.final$aspect_ratio <- as.factor(movie.final$aspect_ratio)

movie.final$duration2 <- cut(movie.final$duration,breaks = 3)
movie.final$duration2 <- factor(movie.final$duration2,labels = c("short","medium","long"))

After all the cleaning up, it’s time to get a sense of what do we have in the dataset.

Here we hide the results, as it could be too messy. basically, we have 3767 row of completed records and 29 variables including variables about movie features, IMDB score, gross sales and social media impact:

str(movie.final)
summary(movie.final)

Basic Descriptive Analysis

IMDB Score

As the IMDB score will become the obvious predicted variable we are going to use, we are interested in seeing the overall distribution of IMDB score. the data shows IMDB score is concentrated around 5.9 ~ 7.2 (the inter-quarter range).And roughly from the graph, it is normally distributed:

summary(movie.final$imdb_score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.600   5.900   6.600   6.467   7.200   9.300
IMDB_Dist<-ggplot(movie.final,aes(x=imdb_score))

IMDB_Dist_Plot<-IMDB_Dist + geom_histogram(colour="darkgreen",fill="white",binwidth = 0.5) + ggtitle("IMDB Score Distribution") + theme(plot.title = element_text(hjust = 0.5))

IMDB_Dist_Plot

IMDB Score by Color of Movies

We have black & white AND color movies categories in the dataset. Is there any difference between these two categories in terms of IMDB Score? W started to answer the question by examine the distribution:

table(movie.final$color)
## 
##  Black and White            Color 
##              128             3639
table(movie.final$color)/nrow(movie.final)
## 
##  Black and White            Color 
##       0.03397929       0.96602071
with(movie.final,tapply(imdb_score,color,summary))
## $` Black and White`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.600   6.500   7.100   7.103   7.725   8.900 
## 
## $Color
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.600   5.800   6.500   6.444   7.200   9.300
IMDB_Dist_Plot+facet_wrap(~color)

According the results, even though we only have about 3% of B&W movies, it averagely has higher IMDB score than Color movies. But is the difference significant? It’s time to apply a two-sample ttest now:

test_IMDB_Color <- t.test(movie.final$imdb_score[which(movie.final$color==" Black and White")],movie.final$imdb_score[which(movie.final$color=="Color")])

test_IMDB_Color$p.value
## [1] 2.214732e-13
test_IMDB_Color$estimate
## mean of x mean of y 
##  7.103125  6.444133

As the result shows, the difference is significant, which means we are confident to say B&W movies averagely have higher IMDB Scores than Color movies.

Movie Counts by Years

It could be interesting to see are we releasing more movies or less movies across all these years. Please note here as the dataset was built early 2016, the 2016 data should not be complete:

year_num<-as.data.frame(table(movie.final$title_year))
year_num_sorted<-arrange(year_num,desc(Freq))
colnames(year_num_sorted)<-c('Year','Count')
head(year_num_sorted,10)
##    Year Count
## 1  2006   189
## 2  2002   187
## 3  2009   185
## 4  2005   184
## 5  2004   182
## 6  2008   182
## 7  2001   176
## 8  2010   169
## 9  2011   168
## 10 2013   165
plot(year_num$Var1,year_num$Freq,ylab="Count",main="Movie Counts by Year")
lines(year_num$Var1,year_num$Freq,type="h",col='darkgreen')

It’s easy to see from above that the boom of movie release happened after year 2000, adn it drops a little after year 2014.

Movie Counts by Country

So which country is releasing most movies? Although the answer is obviously the USA, it’s more convicing to see it in data:

country_num<-as.data.frame(table(movie.final$country))
country_num<-arrange(country_num,desc(Freq))
colnames(country_num)<-c('Country','Count')
head(country_num,10)
##      Country Count
## 1        USA  2979
## 2         UK   317
## 3     France   103
## 4    Germany    81
## 5     Canada    62
## 6  Australia    40
## 7      Spain    22
## 8      Japan    17
## 9      China    13
## 10 Hong Kong    13
c<-ggplot(head(country_num,10),aes(x=reorder(factor(Country),Count),y=Count))

c+geom_bar(stat = "identity",fill="darkgreen")+coord_flip()+labs(x="Country",y="Number of movies") + ggtitle("Top 10 Most Movie Countries") + theme(plot.title = element_text(hjust = 0.5))

Movie Counts / IMDB Score / Duration by Genres

Movie Genres is an informative category in the dataset, which can be used to explore different metrics. Here we start by seeing the basic movie counts by genres:

genre_num<-as.data.frame(table(movie.final$genres))
genre_num<-arrange(genre_num,desc(Freq))
colnames(genre_num)<-c('Genres','Count')
head(genre_num,10)
##                         Genres Count
## 1                        Drama   149
## 2         Comedy|Drama|Romance   148
## 3                 Comedy|Drama   141
## 4                       Comedy   137
## 5               Comedy|Romance   131
## 6                Drama|Romance   115
## 7         Crime|Drama|Thriller    82
## 8        Action|Crime|Thriller    55
## 9  Action|Crime|Drama|Thriller    50
## 10     Action|Adventure|Sci-Fi    46
g<-ggplot(head(genre_num,10),aes(x=reorder(factor(Genres),Count),y=Count,alpha=Count))

g+geom_bar(stat = "identity",fill="darkgreen")+coord_flip()+labs(x="Genres",y="Number of movies") + ggtitle("Top 10 Most Movie Genres") + theme(plot.title = element_text(hjust = 0.5))

Obviously we can see Drama,Comedy,Romance and the combination of these three releases the most movies.

So how’s the IMDB Score varies between the top 10 movie genres? Let’s take a look:

genre_num_top<-head(genre_num,10)

db_g<-ggplot(movie.final[movie.final$genres%in%genre_num_top$Genres,c("genres","imdb_score")],aes(factor(genres),imdb_score))

db_g+geom_boxplot(aes(fill=factor(genres)))+labs(x="Genres",y="IMDB Score") + ggtitle("IMDB Score Distribution by Genres")+theme(plot.title = element_text(hjust = 0.5),axis.title.x=element_blank(),axis.text.x=element_blank(), axis.ticks.x=element_blank())

The difference obvious, but we can still see comedy related genres have averagely lower scores, while Drama related have the highest.

Another minor point might be fun to know - is there a duration variable across different genres? Let’s take a look:

d<-ggplot(movie.final[movie.final$genres%in%genre_num_top$Genres,c("genres","duration")],aes(factor(genres),duration))

d+geom_boxplot(aes(fill=factor(genres)))+labs(x="Genres",y="duration") + ggtitle("Duration Distribution by Genres") + theme(plot.title = element_text(hjust = 0.5),axis.title.x=element_blank(),axis.text.x=element_blank(), axis.ticks.x=element_blank())

It seems Action|Adventure|Sci-Fi averagely runs longer than Comedies. So probably next when you just want a short relax, start your browse from Comedies:)

IMDB Score by Duration

As we talked about duration, what’s the IMDB Score across different duration? Here is the clear result:

with(movie.final,tapply(imdb_score,duration2,summary))
## $short
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.60    5.80    6.50    6.37    7.10    8.80 
## 
## $medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.700   7.300   7.278   7.800   9.300 
## 
## $long
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.300   6.900   8.000   7.627   8.300   8.500
Duration<-ggplot(movie.final,aes(factor(duration2),imdb_score))

Duration+geom_boxplot(aes(fill=factor(duration2)))+labs(x="duration",y="IMDB Score") + ggtitle("IMDB Score by duration")+theme(plot.title = element_text(hjust = 0.5))

It seems the longer the movie, the higher the score. But is it statistically significant? Let’s examine by ANOVA test:

fit_duration<-aov(imdb_score~duration2,data=movie.final)
summary(fit_duration)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## duration2      2    301  150.26   145.3 <2e-16 ***
## Residuals   3764   3893    1.03                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Clearly, the result shows the difference is significant.

Movie Counts / IMDB Score by Director / Actors

Who are the Directors with most movies? And who are the directors with averagely the highest IMDB score? The result might be indicative for investors when they are consider their invest in movies:

director_num<-as.data.frame(table(movie.final$director_name))
directors_num<-arrange(director_num,desc(Freq))
colnames(directors_num)<-c('Director','Count')
head(directors_num,10)
##             Director Count
## 1   Steven Spielberg    25
## 2     Clint Eastwood    19
## 3        Woody Allen    19
## 4       Ridley Scott    17
## 5    Martin Scorsese    16
## 6  Steven Soderbergh    16
## 7         Tim Burton    16
## 8       Renny Harlin    15
## 9          Spike Lee    15
## 10    Barry Levinson    13
director<-ggplot(head(directors_num,10),aes(x=reorder(factor(Director),Count),y=Count,alpha=Count))
director+geom_bar(stat = "identity",fill="darkgreen")+coord_flip()+labs(x="Directors",y="Number of movies") + ggtitle("Top 10 Most Movie Directors")+theme(plot.title = element_text(hjust = 0.5))

Steven Spielberg wins in amount! But how about the average IMDB Score? Before we do the analysis, we need to exclude directors with only one movie, as it’s not meaningful to put them into comparison:

director_num_order<-order(director_num$Freq,decreasing=T)
director_num_sorted<-director_num[director_num_order,]

dir_num_final <- director_num_sorted[director_num_sorted$Freq>=2,c("Var1","Freq")]

dire_num_subset<-movie.final[movie.final$director_name%in%dir_num_final$Var1,c("director_name","imdb_score")]

OK now let’s take a look at the IMDB Score by directors:

attach(dire_num_subset)
director_score<-as.data.frame(tapply(imdb_score,director_name,mean))
colnames(director_score)<-"mean_score"

director_score_sorted<-sort(director_score$mean_score,decreasing=T)
director_score_top <- as.data.frame(head(director_score_sorted,10))
director_score_top$director_name<-rownames(director_score_top)

colnames(director_score_top)<-c("mean_score","Director")

director_score_top
##                   mean_score          Director
## Sergio Leone        8.433333      Sergio Leone
## Christopher Nolan   8.425000 Christopher Nolan
## Pete Docter         8.233333       Pete Docter
## Hayao Miyazaki      8.225000    Hayao Miyazaki
## George Roy Hill     8.200000   George Roy Hill
## Quentin Tarantino   8.200000 Quentin Tarantino
## Victor Fleming      8.150000    Victor Fleming
## Milos Forman        8.133333      Milos Forman
## Akira Kurosawa      8.100000    Akira Kurosawa
## David Lean          8.000000        David Lean
detach(dire_num_subset)

How about the same analysis on actors? :

actor_num<-as.data.frame(table(movie.final$actor_1_name))
actor_num<-arrange(actor_num,desc(Freq))

colnames(actor_num)<-c('Actor','Count')
head(actor_num,10)
##                Actor Count
## 1     Robert De Niro    42
## 2        Johnny Depp    38
## 3       J.K. Simmons    31
## 4  Denzel Washington    30
## 5       Nicolas Cage    30
## 6       Bruce Willis    29
## 7         Matt Damon    28
## 8        Liam Neeson    26
## 9  Robert Downey Jr.    26
## 10     Harrison Ford    25
actor<-ggplot(head(actor_num,10),aes(x=reorder(factor(Actor),Count),y=Count,alpha=Count))
actor+geom_bar(stat = "identity",fill="darkgreen")+coord_flip()+labs(x="Actors",y="Number of movies") + ggtitle("Top 10 Most Movie Actors") + theme(plot.title = element_text(hjust = 0.5))

actor_num<-as.data.frame(table(movie.final$actor_1_name))
actor_num_order<-order(actor_num$Freq,decreasing=T)
actor_num_sorted<-actor_num[actor_num_order,]

act_num_final <- actor_num_sorted[actor_num_sorted$Freq>=2,c("Var1","Freq")]

act_num_subset<-movie.final[movie.final$actor_1_name%in%dir_num_final$Var1,c("actor_1_name","imdb_score")]
attach(act_num_subset)
actor_score<-as.data.frame(tapply(imdb_score,actor_1_name,mean))
colnames(actor_score)<-"mean_score"
actor_score_sorted<-sort(actor_score$mean_score,decreasing=T)
actor_score_top <- as.data.frame(head(actor_score_sorted,10))
actor_score_top$actor_name<-rownames(actor_score_top)
colnames(actor_score_top)<-c("mean_score","actor_name")

actor_score_top
##                       mean_score            actor_name
## Mathieu Kassovitz       8.400000     Mathieu Kassovitz
## Michael Moore           7.833333         Michael Moore
## John Cameron Mitchell   7.800000 John Cameron Mitchell
## Shane Black             7.800000           Shane Black
## Quentin Tarantino       7.625000     Quentin Tarantino
## Irvin Kershner          7.600000        Irvin Kershner
## Clint Eastwood          7.441667        Clint Eastwood
## Tom Hanks               7.425000             Tom Hanks
## Kevin Spacey            7.150000          Kevin Spacey
## Denzel Washington       7.083333     Denzel Washington
detach(act_num_subset)

It seems Robert De Niro wins in amount, while Mathieu Kassovitz wins in average score.

Correlation Exploration

As we have got a basic sense of the dataset, it’s time to see what are the most related variables, and build models on these relationships. Let’s start with taking a general look all the correlations:

ggcorr(movie.final,nbreaks = 6)+ggtitle("Correlation Overview")+theme(plot.title = element_text(hjust = 0.5))

The details data can be viewed using the following command:

corr_graph<-ggcorr(movie.final)
corr_data<-ggplot_build(corr_graph)
cor(movie.final[,corr_data$data[[2]]$label])

Based on the data, imdb_Score is related to num_voted_users + duration + num_of_critic_for_reviews.

Data transformation for Modeling IMDB Score

By the initial check for variable normality, it seems num_voted_users is highly right skewed. We transformed the variable by taking a log of the values:

attach(movie.final)
par(mfrow=c(1,1))
qqnorm(num_voted_users)
qqline(num_voted_users)

par(mfrow=c(1,2))
hist(num_voted_users,main="num_voted_users (original)",xlab = 'num_voted_users')
hist(log(num_voted_users),main="num_voted_users (log)",xlab = 'log(num_voted_users)')

par(mfrow=c(1,1))
qqnorm(log(num_voted_users))
qqline(log(num_voted_users))

detach(movie.final)

Modeling

Spilt Train/Validation dataset

To better understand the prediction power of the model, we first randomly split the data into 70/30 Train/validation sets, using Training set to build the model, and using validation set to validate the model:

set.seed(42)
split<-sample.split(movie.final,SplitRatio = 0.7)
movie.train<-movie.final[split,1:29]
movie.validation<-movie.final[!split,1:29]

dim(movie.final)
## [1] 3767   29
dim(movie.train)
## [1] 2598   29
dim(movie.validation)
## [1] 1169   29

Building Model Based on Most Correlated Variables (Stepwise)

By using the 3 most correlated variables, with one of them log transformed, we build the final model using stepwise method:

movie.fit<-lm(imdb_score ~ log(num_voted_users) + duration + num_critic_for_reviews ,data=movie.train)
movie.step<-stepAIC(movie.fit,direction = "both")
movie.step$anova
#stepwise suggest to keep all three variables
summary(movie.fit)
## 
## Call:
## lm(formula = imdb_score ~ log(num_voted_users) + duration + num_critic_for_reviews, 
##     data = movie.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9681 -0.5258  0.0711  0.6399  2.4505 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.5799783  0.1694208  15.228  < 2e-16 ***
## log(num_voted_users)   0.2302911  0.0171271  13.446  < 2e-16 ***
## duration               0.0116360  0.0008285  14.044  < 2e-16 ***
## num_critic_for_reviews 0.0007960  0.0001979   4.022 5.94e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.921 on 2594 degrees of freedom
## Multiple R-squared:  0.2633, Adjusted R-squared:  0.2625 
## F-statistic:   309 on 3 and 2594 DF,  p-value: < 2.2e-16

The r squared is about 26%. It indicates that IMDB_score is a complicated case to be predicted, and currently what we have in the data can only explain a little part of it.

Predict Power of the Model

So are we getting similar R squred in Prediction? Let’s caculate it:

movie.pred<-predict(movie.fit,movie.validation)
SSE <-sum((movie.pred-movie.validation$imdb_score)^2)
SST <-sum((mean(movie.final$imdb_score)-movie.validation$imdb_score)^2)
R2 <-1-SSE/SST
R2
## [1] 0.2396459

R squared is about 23%, it shows the out-of-sample prediction is less powerful than training.