In the analysis, we will be analyzing the dataset created by chuansun76, Please download the dataset before you start with any of the following analysis. https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset.
The first step is to import the data and download the related packages(ggplot2,dplyr,GGally) we will be using later:
movie<-read.csv("movie_metadata.csv",header=T,stringsAsFactors = F)
library(ggplot2)
library(dplyr)
library(GGally)
library(MASS)
library(caTools)
After importing the data, let’s start to remove the duplicated records and also remove any records with NA Values:
movie.nodup<-movie[!duplicated(movie),]
movie.final<-movie.nodup[complete.cases(movie.nodup),]
colSums(is.na(movie.final))
Noted in the intial investigation that the color variable is having null value, and the aspect_ratio and duration variable should be factor instead of num in datatype. the following step further transform the data according to the points above:
movie.final<-subset(movie.final,color == "Color" | color == " Black and White")
movie.final$aspect_ratio <- as.factor(movie.final$aspect_ratio)
movie.final$duration2 <- cut(movie.final$duration,breaks = 3)
movie.final$duration2 <- factor(movie.final$duration2,labels = c("short","medium","long"))
After all the cleaning up, it’s time to get a sense of what do we have in the dataset.
Here we hide the results, as it could be too messy. basically, we have 3767 row of completed records and 29 variables including variables about movie features, IMDB score, gross sales and social media impact:
str(movie.final)
summary(movie.final)
As the IMDB score will become the obvious predicted variable we are going to use, we are interested in seeing the overall distribution of IMDB score. the data shows IMDB score is concentrated around 5.9 ~ 7.2 (the inter-quarter range).And roughly from the graph, it is normally distributed:
summary(movie.final$imdb_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.600 5.900 6.600 6.467 7.200 9.300
IMDB_Dist<-ggplot(movie.final,aes(x=imdb_score))
IMDB_Dist_Plot<-IMDB_Dist + geom_histogram(colour="darkgreen",fill="white",binwidth = 0.5) + ggtitle("IMDB Score Distribution") + theme(plot.title = element_text(hjust = 0.5))
IMDB_Dist_Plot
We have black & white AND color movies categories in the dataset. Is there any difference between these two categories in terms of IMDB Score? W started to answer the question by examine the distribution:
table(movie.final$color)
##
## Black and White Color
## 128 3639
table(movie.final$color)/nrow(movie.final)
##
## Black and White Color
## 0.03397929 0.96602071
with(movie.final,tapply(imdb_score,color,summary))
## $` Black and White`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.600 6.500 7.100 7.103 7.725 8.900
##
## $Color
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.600 5.800 6.500 6.444 7.200 9.300
IMDB_Dist_Plot+facet_wrap(~color)
According the results, even though we only have about 3% of B&W movies, it averagely has higher IMDB score than Color movies. But is the difference significant? It’s time to apply a two-sample ttest now:
test_IMDB_Color <- t.test(movie.final$imdb_score[which(movie.final$color==" Black and White")],movie.final$imdb_score[which(movie.final$color=="Color")])
test_IMDB_Color$p.value
## [1] 2.214732e-13
test_IMDB_Color$estimate
## mean of x mean of y
## 7.103125 6.444133
As the result shows, the difference is significant, which means we are confident to say B&W movies averagely have higher IMDB Scores than Color movies.
It could be interesting to see are we releasing more movies or less movies across all these years. Please note here as the dataset was built early 2016, the 2016 data should not be complete:
year_num<-as.data.frame(table(movie.final$title_year))
year_num_sorted<-arrange(year_num,desc(Freq))
colnames(year_num_sorted)<-c('Year','Count')
head(year_num_sorted,10)
## Year Count
## 1 2006 189
## 2 2002 187
## 3 2009 185
## 4 2005 184
## 5 2004 182
## 6 2008 182
## 7 2001 176
## 8 2010 169
## 9 2011 168
## 10 2013 165
plot(year_num$Var1,year_num$Freq,ylab="Count",main="Movie Counts by Year")
lines(year_num$Var1,year_num$Freq,type="h",col='darkgreen')
It’s easy to see from above that the boom of movie release happened after year 2000, adn it drops a little after year 2014.
So which country is releasing most movies? Although the answer is obviously the USA, it’s more convicing to see it in data:
country_num<-as.data.frame(table(movie.final$country))
country_num<-arrange(country_num,desc(Freq))
colnames(country_num)<-c('Country','Count')
head(country_num,10)
## Country Count
## 1 USA 2979
## 2 UK 317
## 3 France 103
## 4 Germany 81
## 5 Canada 62
## 6 Australia 40
## 7 Spain 22
## 8 Japan 17
## 9 China 13
## 10 Hong Kong 13
c<-ggplot(head(country_num,10),aes(x=reorder(factor(Country),Count),y=Count))
c+geom_bar(stat = "identity",fill="darkgreen")+coord_flip()+labs(x="Country",y="Number of movies") + ggtitle("Top 10 Most Movie Countries") + theme(plot.title = element_text(hjust = 0.5))
Movie Genres is an informative category in the dataset, which can be used to explore different metrics. Here we start by seeing the basic movie counts by genres:
genre_num<-as.data.frame(table(movie.final$genres))
genre_num<-arrange(genre_num,desc(Freq))
colnames(genre_num)<-c('Genres','Count')
head(genre_num,10)
## Genres Count
## 1 Drama 149
## 2 Comedy|Drama|Romance 148
## 3 Comedy|Drama 141
## 4 Comedy 137
## 5 Comedy|Romance 131
## 6 Drama|Romance 115
## 7 Crime|Drama|Thriller 82
## 8 Action|Crime|Thriller 55
## 9 Action|Crime|Drama|Thriller 50
## 10 Action|Adventure|Sci-Fi 46
g<-ggplot(head(genre_num,10),aes(x=reorder(factor(Genres),Count),y=Count,alpha=Count))
g+geom_bar(stat = "identity",fill="darkgreen")+coord_flip()+labs(x="Genres",y="Number of movies") + ggtitle("Top 10 Most Movie Genres") + theme(plot.title = element_text(hjust = 0.5))
Obviously we can see Drama,Comedy,Romance and the combination of these three releases the most movies.
So how’s the IMDB Score varies between the top 10 movie genres? Let’s take a look:
genre_num_top<-head(genre_num,10)
db_g<-ggplot(movie.final[movie.final$genres%in%genre_num_top$Genres,c("genres","imdb_score")],aes(factor(genres),imdb_score))
db_g+geom_boxplot(aes(fill=factor(genres)))+labs(x="Genres",y="IMDB Score") + ggtitle("IMDB Score Distribution by Genres")+theme(plot.title = element_text(hjust = 0.5),axis.title.x=element_blank(),axis.text.x=element_blank(), axis.ticks.x=element_blank())
The difference obvious, but we can still see comedy related genres have averagely lower scores, while Drama related have the highest.
Another minor point might be fun to know - is there a duration variable across different genres? Let’s take a look:
d<-ggplot(movie.final[movie.final$genres%in%genre_num_top$Genres,c("genres","duration")],aes(factor(genres),duration))
d+geom_boxplot(aes(fill=factor(genres)))+labs(x="Genres",y="duration") + ggtitle("Duration Distribution by Genres") + theme(plot.title = element_text(hjust = 0.5),axis.title.x=element_blank(),axis.text.x=element_blank(), axis.ticks.x=element_blank())
It seems Action|Adventure|Sci-Fi averagely runs longer than Comedies. So probably next when you just want a short relax, start your browse from Comedies:)
As we talked about duration, what’s the IMDB Score across different duration? Here is the clear result:
with(movie.final,tapply(imdb_score,duration2,summary))
## $short
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.60 5.80 6.50 6.37 7.10 8.80
##
## $medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.700 7.300 7.278 7.800 9.300
##
## $long
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.300 6.900 8.000 7.627 8.300 8.500
Duration<-ggplot(movie.final,aes(factor(duration2),imdb_score))
Duration+geom_boxplot(aes(fill=factor(duration2)))+labs(x="duration",y="IMDB Score") + ggtitle("IMDB Score by duration")+theme(plot.title = element_text(hjust = 0.5))
It seems the longer the movie, the higher the score. But is it statistically significant? Let’s examine by ANOVA test:
fit_duration<-aov(imdb_score~duration2,data=movie.final)
summary(fit_duration)
## Df Sum Sq Mean Sq F value Pr(>F)
## duration2 2 301 150.26 145.3 <2e-16 ***
## Residuals 3764 3893 1.03
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Clearly, the result shows the difference is significant.
Who are the Directors with most movies? And who are the directors with averagely the highest IMDB score? The result might be indicative for investors when they are consider their invest in movies:
director_num<-as.data.frame(table(movie.final$director_name))
directors_num<-arrange(director_num,desc(Freq))
colnames(directors_num)<-c('Director','Count')
head(directors_num,10)
## Director Count
## 1 Steven Spielberg 25
## 2 Clint Eastwood 19
## 3 Woody Allen 19
## 4 Ridley Scott 17
## 5 Martin Scorsese 16
## 6 Steven Soderbergh 16
## 7 Tim Burton 16
## 8 Renny Harlin 15
## 9 Spike Lee 15
## 10 Barry Levinson 13
director<-ggplot(head(directors_num,10),aes(x=reorder(factor(Director),Count),y=Count,alpha=Count))
director+geom_bar(stat = "identity",fill="darkgreen")+coord_flip()+labs(x="Directors",y="Number of movies") + ggtitle("Top 10 Most Movie Directors")+theme(plot.title = element_text(hjust = 0.5))
Steven Spielberg wins in amount! But how about the average IMDB Score? Before we do the analysis, we need to exclude directors with only one movie, as it’s not meaningful to put them into comparison:
director_num_order<-order(director_num$Freq,decreasing=T)
director_num_sorted<-director_num[director_num_order,]
dir_num_final <- director_num_sorted[director_num_sorted$Freq>=2,c("Var1","Freq")]
dire_num_subset<-movie.final[movie.final$director_name%in%dir_num_final$Var1,c("director_name","imdb_score")]
OK now let’s take a look at the IMDB Score by directors:
attach(dire_num_subset)
director_score<-as.data.frame(tapply(imdb_score,director_name,mean))
colnames(director_score)<-"mean_score"
director_score_sorted<-sort(director_score$mean_score,decreasing=T)
director_score_top <- as.data.frame(head(director_score_sorted,10))
director_score_top$director_name<-rownames(director_score_top)
colnames(director_score_top)<-c("mean_score","Director")
director_score_top
## mean_score Director
## Sergio Leone 8.433333 Sergio Leone
## Christopher Nolan 8.425000 Christopher Nolan
## Pete Docter 8.233333 Pete Docter
## Hayao Miyazaki 8.225000 Hayao Miyazaki
## George Roy Hill 8.200000 George Roy Hill
## Quentin Tarantino 8.200000 Quentin Tarantino
## Victor Fleming 8.150000 Victor Fleming
## Milos Forman 8.133333 Milos Forman
## Akira Kurosawa 8.100000 Akira Kurosawa
## David Lean 8.000000 David Lean
detach(dire_num_subset)
How about the same analysis on actors? :
actor_num<-as.data.frame(table(movie.final$actor_1_name))
actor_num<-arrange(actor_num,desc(Freq))
colnames(actor_num)<-c('Actor','Count')
head(actor_num,10)
## Actor Count
## 1 Robert De Niro 42
## 2 Johnny Depp 38
## 3 J.K. Simmons 31
## 4 Denzel Washington 30
## 5 Nicolas Cage 30
## 6 Bruce Willis 29
## 7 Matt Damon 28
## 8 Liam Neeson 26
## 9 Robert Downey Jr. 26
## 10 Harrison Ford 25
actor<-ggplot(head(actor_num,10),aes(x=reorder(factor(Actor),Count),y=Count,alpha=Count))
actor+geom_bar(stat = "identity",fill="darkgreen")+coord_flip()+labs(x="Actors",y="Number of movies") + ggtitle("Top 10 Most Movie Actors") + theme(plot.title = element_text(hjust = 0.5))
actor_num<-as.data.frame(table(movie.final$actor_1_name))
actor_num_order<-order(actor_num$Freq,decreasing=T)
actor_num_sorted<-actor_num[actor_num_order,]
act_num_final <- actor_num_sorted[actor_num_sorted$Freq>=2,c("Var1","Freq")]
act_num_subset<-movie.final[movie.final$actor_1_name%in%dir_num_final$Var1,c("actor_1_name","imdb_score")]
attach(act_num_subset)
actor_score<-as.data.frame(tapply(imdb_score,actor_1_name,mean))
colnames(actor_score)<-"mean_score"
actor_score_sorted<-sort(actor_score$mean_score,decreasing=T)
actor_score_top <- as.data.frame(head(actor_score_sorted,10))
actor_score_top$actor_name<-rownames(actor_score_top)
colnames(actor_score_top)<-c("mean_score","actor_name")
actor_score_top
## mean_score actor_name
## Mathieu Kassovitz 8.400000 Mathieu Kassovitz
## Michael Moore 7.833333 Michael Moore
## John Cameron Mitchell 7.800000 John Cameron Mitchell
## Shane Black 7.800000 Shane Black
## Quentin Tarantino 7.625000 Quentin Tarantino
## Irvin Kershner 7.600000 Irvin Kershner
## Clint Eastwood 7.441667 Clint Eastwood
## Tom Hanks 7.425000 Tom Hanks
## Kevin Spacey 7.150000 Kevin Spacey
## Denzel Washington 7.083333 Denzel Washington
detach(act_num_subset)
It seems Robert De Niro wins in amount, while Mathieu Kassovitz wins in average score.
As we have got a basic sense of the dataset, it’s time to see what are the most related variables, and build models on these relationships. Let’s start with taking a general look all the correlations:
ggcorr(movie.final,nbreaks = 6)+ggtitle("Correlation Overview")+theme(plot.title = element_text(hjust = 0.5))
The details data can be viewed using the following command:
corr_graph<-ggcorr(movie.final)
corr_data<-ggplot_build(corr_graph)
cor(movie.final[,corr_data$data[[2]]$label])
Based on the data, imdb_Score is related to num_voted_users + duration + num_of_critic_for_reviews.
By the initial check for variable normality, it seems num_voted_users is highly right skewed. We transformed the variable by taking a log of the values:
attach(movie.final)
par(mfrow=c(1,1))
qqnorm(num_voted_users)
qqline(num_voted_users)
par(mfrow=c(1,2))
hist(num_voted_users,main="num_voted_users (original)",xlab = 'num_voted_users')
hist(log(num_voted_users),main="num_voted_users (log)",xlab = 'log(num_voted_users)')
par(mfrow=c(1,1))
qqnorm(log(num_voted_users))
qqline(log(num_voted_users))
detach(movie.final)
To better understand the prediction power of the model, we first randomly split the data into 70/30 Train/validation sets, using Training set to build the model, and using validation set to validate the model:
set.seed(42)
split<-sample.split(movie.final,SplitRatio = 0.7)
movie.train<-movie.final[split,1:29]
movie.validation<-movie.final[!split,1:29]
dim(movie.final)
## [1] 3767 29
dim(movie.train)
## [1] 2598 29
dim(movie.validation)
## [1] 1169 29
So are we getting similar R squred in Prediction? Let’s caculate it:
movie.pred<-predict(movie.fit,movie.validation)
SSE <-sum((movie.pred-movie.validation$imdb_score)^2)
SST <-sum((mean(movie.final$imdb_score)-movie.validation$imdb_score)^2)
R2 <-1-SSE/SST
R2
## [1] 0.2396459
R squared is about 23%, it shows the out-of-sample prediction is less powerful than training.