========================================================

## Output is hidden
library(ggplot2)
library(dplyr)
library(GGally)
library(memisc)
library(gridExtra)
## install.packages("ellipse")
library(ellipse) ## for fancy correlation matrix

Analysis

Before we start

I want to say sorry for my bad english level. It’s my second language, so I have a lot of mistakes.

Load the Data

I decided to use IMDb database of movies to predict rating of a movie. It’s availabe in ggplot2 package so I’ll just load it. But still there are commands to load it from the file.

data(movies)
## To load from a file uncomment this:
#movies <- read.table("movies.tab", sep="\t", header=TRUE, quote="", comment="")

Summary of the Data Set

dim(movies)

## [1] 58788    24

names(movies)

##  [1] "title"       "year"        "length"      "budget"      "rating"     
##  [6] "votes"       "r1"          "r2"          "r3"          "r4"         
## [11] "r5"          "r6"          "r7"          "r8"          "r9"         
## [16] "r10"         "mpaa"        "Action"      "Animation"   "Comedy"     
## [21] "Drama"       "Documentary" "Romance"     "Short"

summary(movies)

##     title                year          length            budget         
##  Length:58788       Min.   :1893   Min.   :   1.00   Min.   :        0  
##  Class :character   1st Qu.:1958   1st Qu.:  74.00   1st Qu.:   250000  
##  Mode  :character   Median :1983   Median :  90.00   Median :  3000000  
##                     Mean   :1976   Mean   :  82.34   Mean   : 13412513  
##                     3rd Qu.:1997   3rd Qu.: 100.00   3rd Qu.: 15000000  
##                     Max.   :2005   Max.   :5220.00   Max.   :200000000  
##                                                      NA's   :53573      
##      rating           votes                r1                r2        
##  Min.   : 1.000   Min.   :     5.0   Min.   :  0.000   Min.   : 0.000  
##  1st Qu.: 5.000   1st Qu.:    11.0   1st Qu.:  0.000   1st Qu.: 0.000  
##  Median : 6.100   Median :    30.0   Median :  4.500   Median : 4.500  
##  Mean   : 5.933   Mean   :   632.1   Mean   :  7.014   Mean   : 4.022  
##  3rd Qu.: 7.000   3rd Qu.:   112.0   3rd Qu.:  4.500   3rd Qu.: 4.500  
##  Max.   :10.000   Max.   :157608.0   Max.   :100.000   Max.   :84.500  
##                                                                        
##        r3               r4                r5                r6       
##  Min.   : 0.000   Min.   :  0.000   Min.   :  0.000   Min.   : 0.00  
##  1st Qu.: 0.000   1st Qu.:  0.000   1st Qu.:  4.500   1st Qu.: 4.50  
##  Median : 4.500   Median :  4.500   Median :  4.500   Median :14.50  
##  Mean   : 4.721   Mean   :  6.375   Mean   :  9.797   Mean   :13.04  
##  3rd Qu.: 4.500   3rd Qu.:  4.500   3rd Qu.: 14.500   3rd Qu.:14.50  
##  Max.   :84.500   Max.   :100.000   Max.   :100.000   Max.   :84.50  
##                                                                      
##        r7               r8               r9               r10        
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.000   Min.   :  0.00  
##  1st Qu.:  4.50   1st Qu.:  4.50   1st Qu.:  4.500   1st Qu.:  4.50  
##  Median : 14.50   Median : 14.50   Median :  4.500   Median : 14.50  
##  Mean   : 15.55   Mean   : 13.88   Mean   :  8.954   Mean   : 16.85  
##  3rd Qu.: 24.50   3rd Qu.: 24.50   3rd Qu.: 14.500   3rd Qu.: 24.50  
##  Max.   :100.00   Max.   :100.00   Max.   :100.000   Max.   :100.00  
##                                                                      
##     mpaa           Action          Animation           Comedy      
##       :53864   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
##  NC-17:   16   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000  
##  PG   :  528   Median :0.00000   Median :0.00000   Median :0.0000  
##  PG-13: 1003   Mean   :0.07974   Mean   :0.06277   Mean   :0.2938  
##  R    : 3377   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:1.0000  
##                Max.   :1.00000   Max.   :1.00000   Max.   :1.0000  
##                                                                    
##      Drama        Documentary         Romance           Short       
##  Min.   :0.000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.000   Median :0.00000   Median :0.0000   Median :0.0000  
##  Mean   :0.371   Mean   :0.05906   Mean   :0.0807   Mean   :0.1609  
##  3rd Qu.:1.000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.000   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
##

head(movies)

##                      title year length budget rating votes   r1   r2  r3
## 1                        $ 1971    121     NA    6.4   348  4.5  4.5 4.5
## 2        $1000 a Touchdown 1939     71     NA    6.0    20  0.0 14.5 4.5
## 3   $21 a Day Once a Month 1941      7     NA    8.2     5  0.0  0.0 0.0
## 4                  $40,000 1996     70     NA    8.2     6 14.5  0.0 0.0
## 5 $50,000 Climax Show, The 1975     71     NA    3.4    17 24.5  4.5 0.0
## 6                    $pent 2000     91     NA    4.3    45  4.5  4.5 4.5
##     r4   r5   r6   r7   r8   r9  r10 mpaa Action Animation Comedy Drama
## 1  4.5 14.5 24.5 24.5 14.5  4.5  4.5           0         0      1     1
## 2 24.5 14.5 14.5 14.5  4.5  4.5 14.5           0         0      1     0
## 3  0.0  0.0 24.5  0.0 44.5 24.5 24.5           0         1      0     0
## 4  0.0  0.0  0.0  0.0  0.0 34.5 45.5           0         0      1     0
## 5 14.5 14.5  4.5  0.0  0.0  0.0 24.5           0         0      0     0
## 6 14.5 14.5 14.5  4.5  4.5 14.5 14.5           0         0      0     1
##   Documentary Romance Short
## 1           0       0     0
## 2           0       0     0
## 3           0       0     1
## 4           0       0     0
## 5           0       0     0
## 6           0       0     0

str(movies)

## 'data.frame':    58788 obs. of  24 variables:
##  $ title      : chr  "$" "$1000 a Touchdown" "$21 a Day Once a Month" "$40,000" ...
##  $ year       : int  1971 1939 1941 1996 1975 2000 2002 2002 1987 1917 ...
##  $ length     : int  121 71 7 70 71 91 93 25 97 61 ...
##  $ budget     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ rating     : num  6.4 6 8.2 8.2 3.4 4.3 5.3 6.7 6.6 6 ...
##  $ votes      : int  348 20 5 6 17 45 200 24 18 51 ...
##  $ r1         : num  4.5 0 0 14.5 24.5 4.5 4.5 4.5 4.5 4.5 ...
##  $ r2         : num  4.5 14.5 0 0 4.5 4.5 0 4.5 4.5 0 ...
##  $ r3         : num  4.5 4.5 0 0 0 4.5 4.5 4.5 4.5 4.5 ...
##  $ r4         : num  4.5 24.5 0 0 14.5 14.5 4.5 4.5 0 4.5 ...
##  $ r5         : num  14.5 14.5 0 0 14.5 14.5 24.5 4.5 0 4.5 ...
##  $ r6         : num  24.5 14.5 24.5 0 4.5 14.5 24.5 14.5 0 44.5 ...
##  $ r7         : num  24.5 14.5 0 0 0 4.5 14.5 14.5 34.5 14.5 ...
##  $ r8         : num  14.5 4.5 44.5 0 0 4.5 4.5 14.5 14.5 4.5 ...
##  $ r9         : num  4.5 4.5 24.5 34.5 0 14.5 4.5 4.5 4.5 4.5 ...
##  $ r10        : num  4.5 14.5 24.5 45.5 24.5 14.5 14.5 14.5 24.5 4.5 ...
##  $ mpaa       : Factor w/ 5 levels "","NC-17","PG",..: 1 1 1 1 1 1 5 1 1 1 ...
##  $ Action     : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Animation  : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ Comedy     : int  1 1 0 1 0 0 0 0 0 0 ...
##  $ Drama      : int  1 0 0 0 0 1 1 0 1 0 ...
##  $ Documentary: int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Romance    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Short      : int  0 0 1 0 0 0 0 1 0 0 ...

Looks nice, but a lot of NA’s in budget. For the basic analysis I don’t need distribution of ratings, so I will delete r1-r10 variables with dplyr.

movies<-dplyr::select(movies,-(r1:r10)) 
## We have to specify package, because of select() fun in MASS package
head(movies)

##                      title year length budget rating votes mpaa Action
## 1                        $ 1971    121     NA    6.4   348           0
## 2        $1000 a Touchdown 1939     71     NA    6.0    20           0
## 3   $21 a Day Once a Month 1941      7     NA    8.2     5           0
## 4                  $40,000 1996     70     NA    8.2     6           0
## 5 $50,000 Climax Show, The 1975     71     NA    3.4    17           0
## 6                    $pent 2000     91     NA    4.3    45           0
##   Animation Comedy Drama Documentary Romance Short
## 1         0      1     1           0       0     0
## 2         0      1     0           0       0     0
## 3         1      0     0           0       0     1
## 4         0      1     0           0       0     0
## 5         0      0     0           0       0     0
## 6         0      0     1           0       0     0

And what about some artificial variables? Lets check how length of movie title reflects it’s rating. Also I want to create new variable - century in which movie was made.

movies<-mutate(movies,
  lenOfTitle=nchar(as.character(title)), ## Because title was a factor
  century=trunc(year/100))

Let’s take a final look, before we start our analysis:

str(movies)

## 'data.frame':    58788 obs. of  16 variables:
##  $ title      : chr  "$" "$1000 a Touchdown" "$21 a Day Once a Month" "$40,000" ...
##  $ year       : int  1971 1939 1941 1996 1975 2000 2002 2002 1987 1917 ...
##  $ length     : int  121 71 7 70 71 91 93 25 97 61 ...
##  $ budget     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ rating     : num  6.4 6 8.2 8.2 3.4 4.3 5.3 6.7 6.6 6 ...
##  $ votes      : int  348 20 5 6 17 45 200 24 18 51 ...
##  $ mpaa       : Factor w/ 5 levels "","NC-17","PG",..: 1 1 1 1 1 1 5 1 1 1 ...
##  $ Action     : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Animation  : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ Comedy     : int  1 1 0 1 0 0 0 0 0 0 ...
##  $ Drama      : int  1 0 0 0 0 1 1 0 1 0 ...
##  $ Documentary: int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Romance    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Short      : int  0 0 1 0 0 0 0 1 0 0 ...
##  $ lenOfTitle : int  1 17 22 7 24 5 7 4 3 7 ...
##  $ century    : num  19 19 19 19 19 20 20 20 19 19 ...

summary(movies)

##     title                year          length            budget         
##  Length:58788       Min.   :1893   Min.   :   1.00   Min.   :        0  
##  Class :character   1st Qu.:1958   1st Qu.:  74.00   1st Qu.:   250000  
##  Mode  :character   Median :1983   Median :  90.00   Median :  3000000  
##                     Mean   :1976   Mean   :  82.34   Mean   : 13412513  
##                     3rd Qu.:1997   3rd Qu.: 100.00   3rd Qu.: 15000000  
##                     Max.   :2005   Max.   :5220.00   Max.   :200000000  
##                                                      NA's   :53573      
##      rating           votes             mpaa           Action       
##  Min.   : 1.000   Min.   :     5.0        :53864   Min.   :0.00000  
##  1st Qu.: 5.000   1st Qu.:    11.0   NC-17:   16   1st Qu.:0.00000  
##  Median : 6.100   Median :    30.0   PG   :  528   Median :0.00000  
##  Mean   : 5.933   Mean   :   632.1   PG-13: 1003   Mean   :0.07974  
##  3rd Qu.: 7.000   3rd Qu.:   112.0   R    : 3377   3rd Qu.:0.00000  
##  Max.   :10.000   Max.   :157608.0                 Max.   :1.00000  
##                                                                     
##    Animation           Comedy           Drama        Documentary     
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.0000   Median :0.000   Median :0.00000  
##  Mean   :0.06277   Mean   :0.2938   Mean   :0.371   Mean   :0.05906  
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.000   Max.   :1.00000  
##                                                                      
##     Romance           Short          lenOfTitle        century     
##  Min.   :0.0000   Min.   :0.0000   Min.   :  1.00   Min.   :18.00  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 11.00   1st Qu.:19.00  
##  Median :0.0000   Median :0.0000   Median : 15.00   Median :19.00  
##  Mean   :0.0807   Mean   :0.1609   Mean   : 16.58   Mean   :19.18  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.: 20.00   3rd Qu.:19.00  
##  Max.   :1.0000   Max.   :1.0000   Max.   :121.00   Max.   :20.00  
##

Individual variables

I will start from simple histograms of rating, year, length, votes and length of title (comments are after 4 plots):

ggplot(data=movies,aes(x = rating))+
  geom_histogram(binwidth=0.5)

ggplot(data=movies,aes(x = year))+
  geom_histogram(binwidth=1)

ggplot(data=movies,aes(x = length))+
  geom_histogram(binwidth=3)+
  xlim(c(0,quantile(movies$length,0.99)))

ggplot(data=movies,aes(x = votes))+
  geom_histogram(binwidth=15)+
  xlim(c(0,quantile(movies$votes,0.9))) ## Votes have VERY long tail

ggplot(data=movies,aes(x = lenOfTitle))+
  geom_histogram(binwidth=1)+
  xlim(c(0,quantile(movies$lenOfTitle,0.95)))

Conclusions:

Ratings distribution looks normal.
Interesting increasing patter on years distribution. We’ll investigate it later.
There are two peaks in lengt-of-movie distribution. Obviosly for Short films and for standart movies.
Distribution of votes is not intreasting - quasi-hyperbolic distribution.
Distribution of title lengths is single-peak-distribution.

Investigation of movies vs year histogram.

We can see huge increase in ~1930, and constant increase till now with few drops in it. Let’s try to explain such patter with real historical events correlated with movie industry.

What we can test first? Obviosly sound films! According to wikipedia first sound film was made in 1926. Let’s add vertical line to our plot.

ggplot(data=movies,aes(x=year))+
  geom_histogram(binwidth=1)+
  geom_vline(xintercept=1926,col=3,lwd=1,alpha=1/2) ## Sound films

Yep, looks close. Maybe something else? Colored movies? Mass production started in late 1920-s. Let’s try

ggplot(data=movies,aes(x=year))+
  geom_histogram(binwidth=1)+
  geom_vline(xintercept=1926,col=3,lwd=1,alpha=1/2)+ ## Sound films
  geom_vline(xintercept=1928,col=6,lwd=3,alpha=1/2) ## Colored films

And that line fits perfect. So we can say that sound and colors bring a lot to movoe industry. How much?

sum(movies$year==1932)/sum(movies$year==1928)

## [1] 3.779817

1932-1928

## [1] 4

It means that number of movies increased at 278% in just 4 years. Amazing. We need to understand that world population increaset only at 4.5% in these 4 years.

What’s next? Let’s check some great historical events. What about World War II from 1939 to 1945?

ggplot(data=movies,aes(x=year))+
  geom_histogram(binwidth=1)+
  geom_vline(xintercept=1926,col=3,lwd=1,alpha=1/2)+ ## Sound films
  geom_vline(xintercept=1928,col=6,lwd=3,alpha=1/2)+ ## Colored films
  geom_vline(xintercept=1943,col=2,lwd=4,alpha=1/2) ## World War II

Yes, WWII significantly decreased movie production. And finally: Black Monday

ggplot(data=movies,aes(x=year))+
  geom_histogram(binwidth=1)+
  geom_vline(xintercept=1926,col=3,lwd=1,alpha=1/2)+ ## Sound films
  geom_vline(xintercept=1928,col=6,lwd=3,alpha=1/2)+ ## Colored films
  geom_vline(xintercept=1943,col=2,lwd=4,alpha=1/2)+ ## World War II
  geom_vline(xintercept=1987,col=1,lwd=1,alpha=1) ## Black Monday

Looks great. We can also say that something happened in about 1960, but I do not know what exactly schoked movie industry. I’ll add labels in the Final plots part. And let’s continue with two variables analysis.

Relationships between two variables

First of all, let’s take a look at our correlation matrix with nice ellipse graph. But we need to drop “title” and “mpaa” variables, because they are not numerical. Also I’ll exclude budget, because there are only 5000 observations, and it’s incorrect to compare correlation coefficients for diffrent lengths vectors.

movies.num<-dplyr::select(movies,-c(title,mpaa,budget)) 
corMatrix<-cor(movies.num)
## Vector of colors, red for positive, blue for negative
cols <- ifelse(corMatrix>0,
  rgb(abs(corMatrix)^(1/2),0,0), ## Shades of red
  rgb(0,0,abs(corMatrix)^(1/2))) ## Shades of red
## I have used ^(1/2) to make colors more bright
plotcorr(corMatrix,col=cols,mar=c(0,0,0,0))

And here we get some insights. Few things can be noted:

length seems correlated with year and Drama.
Animation negatively correlated with length. A lot of of cartoons are short, so it looks like a truth.
Length of title seems to be higher for documentary movies. Also looks normal, I’ll not call a documentary movie “Interstellar”, but prefer to choose something long like “Cosmos: Space-time odyssey”
Rating seems to be positively correlated with Animation and Short, and negatively with year.

Also I want to get some ideas from GGally packege (I’ll make this plot big to see something):

set.seed(1984)
## Let's explain that Genre and cetury are factors
movies.factored<-movies
movies.factored[,8:14]<-lapply(movies[,8:14],factor)
samp<-movies.factored[sample(1:nrow(movies), 0.05*nrow(movies)),2:15] ## 5% sample
## My computer is not that fast to use 10%+ samples.
## I also excluded title and century from the analysis
ggpairs(samp, params = c(shape = I("."), outlier.shape = I(".")),axisLabels = "none")

Looks promissing. Let’s continue with insights from correlation matrix.

ggplot(data=movies,aes(x=year,y=length))+
  geom_point(alpha=1/15,position="jitter")+
  ylim(c(0,quantile(movies$length,.99)))

And that is the most interesting graph so far. Insights:

Defenetly here are some clusters. Very short movies (<10 min) in 1890-1910 for example.
Two clusters of short films: about 10 and about 20 minutes long produced in 1930-1970. And that’s quite strange clusters. It’s not stripes, so data is correct.
There almost weren’t short films in 1970-2000 years. Why did everybody decide to create a full 90 minutes long movies? I don’t know.
Wide cluster of Short films in 2000-2010. God bless, Short-film-makers came back from 30 years of rest.
In the huge cluster of full (not short) movies there is an upward trend.
And we still can see white beam related to WW II.
There are also stripes for modern movies and we can’t know: is it data imperfection, or moviemakers tries to creat round-number-of-minutes long movies.

Yes, that graph was insghtfull. I’ll certanly investigate it further in multivariable section.

ggplot(data=movies,
  aes(x=as.factor(Animation),y=length,fill=as.factor(Animation)))+
  geom_boxplot()+ylim(c(0,60*5))

## Considering only movies not longer than 5 hrs.
## Movie length median for Animation
median(movies$length[movies$Animation==1])

## [1] 7

## Movie length median for not-Animation
median(movies$length[movies$Animation==0])

## [1] 91

They are signifanctly different. As expected. Let’s check idea about title length of documentary films:

ggplot(data=movies,
  aes(x=as.factor(Documentary),y=lenOfTitle,fill=as.factor(Documentary)))+
  geom_boxplot()

Difference is not so significant, let’s check it with staistical tests:

t.test(lenOfTitle~Documentary,data=movies) ## Standart t-test

## 
##  Welch Two Sample t-test
## 
## data:  lenOfTitle by Documentary
## t = -24.8437, df = 3647.379, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.746041 -4.905447
## sample estimates:
## mean in group 0 mean in group 1 
##        16.26383        21.58957

wilcox.test(lenOfTitle~Documentary,data=movies) ## Nonparametrical analog

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  lenOfTitle by Documentary
## W = 72073436, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

And the both says that there is statistically significant difference between these groups. What about decrease in rating with time?

Let’s check these ideas with some nice plots:

ggplot(data=movies,aes(x=year,y=rating))+
  geom_point(alpha=1/20,position="jitter",scale=0.8)+ ## Fighting overfitting
  geom_line(stat="summary",fun.y=median,col=2,linetype=2) ## Adding median line

Okay, it’s downward trend in the huge time range from 1930 till 2000, but strong upward moving in the last few years. I can imagine such explanation: who does watch movies made in 1940-1980? Only fans and critics, but not mass audience. And we can expect high ratings for such classic movies. And older means more classic, and more classic means higher rating. But in early 2000-s began a massive movie production targeted exectly at mass audience. And it’s expected that such bestselelrs got high ratings from average people (low from critics, but who cares - there are few critics per thousad IMDb visitors).

So far so good. What about rating?

plot1<-ggplot(
  data=movies,
  aes(x=as.factor(Animation),y=rating,fill=as.factor(Animation)))+
  geom_boxplot()
plot2<-ggplot(
  data=movies,
  aes(x=as.factor(Short),y=rating,fill=as.factor(Short)))+
  geom_boxplot()
grid.arrange(plot1,plot2,ncol=2)

Looks like Short and Animated movies get higher ratings. Let’s check conclusion about short films:

ggplot(data=movies,aes(x=length,y=rating,color=as.factor(century)))+
  geom_point(position="jitter",alpha=1/5)+
  xlim(c(0,quantile(movies$length,0.99)))

It’s difficult to male any conclusions here. I am going to continue with three variables (like on previous plot)

Three or more variables

Let’s back to our length~year graph and bring some colors.

ggplot(data=movies,aes(x=year,y=length,color=as.factor(Animation)))+
    geom_point(alpha=1/10,position="jitter")+
      ylim(c(0,quantile(movies$length,.99)))

Whoa! One of our clusters is animated movies. Let’s remember some cartoons from 1930… Yep, that’s famous Disney’s Mickey! We are gooing to the Mickey Mouse movie list and find out that almost all of them are 7-8 minutes long! So, riddle of one cluster is solved! Thank you, Mickey (and other Disnaey’s characters).

text

But our aim is to analyze movie ratings, so let’s back to our rating~something graphs:

ggplot(data=movies,aes(x=year,y=rating,color=as.factor(mpaa)))+
  geom_point(alpha=1/10,position="jitter")

We could expect some correlation betwee MPAA rating and IMDb rating, but nothing interesing here. Let’s move on.

ggplot(data=movies,aes(x=year,y=rating,color=as.factor(Comedy)))+
  geom_point(alpha=1/10,position="jitter")

Here we can notice blue point after ~1925. It’s known effect of Great Depression. Everything was bad, so everybody wanted for comedies. I’ll mkae more precise plot in final part.

Predictions

It’s time to make some predictions now:

model1<-lm(data=movies,rating~length)
summary(model1)

## 
## Call:
## lm(formula = rating ~ length, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0182 -0.9268  0.1549  1.0786  4.0120 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.0214706  0.0135013 445.991  < 2e-16 ***
## length      -0.0010763  0.0001444  -7.455 9.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.552 on 58786 degrees of freedom
## Multiple R-squared:  0.0009446,  Adjusted R-squared:  0.0009276 
## F-statistic: 55.58 on 1 and 58786 DF,  p-value: 9.087e-14

Okay, R squared is tiny, but explanatory variable is significant. I’ll just add more variables:

model2<-update(model1,.~
  +year
  +budget
  +length
  +lenOfTitle
  +Action
  +Animation
  +Comedy
  +Drama
  +Documentary
  +Romance
  +Short
  +mpaa
  )
summary(model2)

## 
## Call:
## lm(formula = rating ~ year + budget + length + lenOfTitle + Action + 
##     Animation + Comedy + Drama + Documentary + Romance + Short + 
##     mpaa, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5655 -0.8097  0.0960  0.9050  4.8420 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.777e+01  2.119e+00   8.386  < 2e-16 ***
## year        -6.903e-03  1.070e-03  -6.453 1.20e-10 ***
## budget       1.317e-09  1.093e-09   1.206 0.228052    
## length       1.573e-02  9.028e-04  17.423  < 2e-16 ***
## lenOfTitle  -3.463e-03  2.356e-03  -1.470 0.141755    
## Action      -1.992e-01  5.686e-02  -3.502 0.000465 ***
## Animation    8.776e-01  1.301e-01   6.744 1.71e-11 ***
## Comedy       2.683e-01  4.456e-02   6.022 1.84e-09 ***
## Drama        6.683e-01  4.299e-02  15.544  < 2e-16 ***
## Documentary  1.279e+00  1.286e-01   9.941  < 2e-16 ***
## Romance      1.088e-01  5.685e-02   1.914 0.055653 .  
## Short        2.732e+00  1.017e-01  26.854  < 2e-16 ***
## mpaaNC-17   -3.078e-01  5.246e-01  -0.587 0.557398    
## mpaaPG      -2.681e-01  1.062e-01  -2.524 0.011647 *  
## mpaaPG-13   -2.198e-01  7.770e-02  -2.829 0.004688 ** 
## mpaaR       -1.122e-01  5.664e-02  -1.981 0.047640 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.385 on 5199 degrees of freedom
##   (53573 observations deleted due to missingness)
## Multiple R-squared:  0.2011, Adjusted R-squared:  0.1988 
## F-statistic: 87.25 on 15 and 5199 DF,  p-value: < 2.2e-16

Adjusted R squred is now almost 20%. It means that 20% of variance in rating could be explained by these simple variables. I think, this is great result, because rating is human decision based data, it’s always difficult to predict. Intresting that budget, length of title and romance genre are not significant. Based on this model I can creat The Best Movie EVER.

The Best Movie EVER production

We’ll just need:

Movie should be long: length coefficient is positive (=1.573e-02). Let’s make it 15 hrs long.
(coef = -0.0035) With a short title. Maybe just “DA” (stand for Data Analysis).
(coef = -0.2) No action at all. Everybody just sitting and talking whole 15 hrs.
(coef = 0.89) It should be a cartoon, Animation coefficient is positive. Now it became funny :)
(coef = 0.27) Comedy! 15 hrs of jokes!
(coef = 0.67) But Drama too. So, 10 hrs of jokes, 5 hrs of sad stories.
(coef = 1.28) And it should be somehow documentary.
(coef = 0.11) … with a love story.
(coef = 2.73) And it should be short. Wait. We’ll make two movies. 35 minutes Short film and a Directors’cut 15 hrs long film. (we’ll call them “DA” and “DA.Dc”)
(coef = 1.317e-09) Little unsignificant coefficient with budjet. But we are still gonna use it. Our budjet is 3$.
(coef = -3.078e-01 ) And there should not be any MPAA restrictions.

Let’s predict it’s rating:

BestMovieEVER<-data.frame(
  title=c("DA","DA.Dc"),
  year=2015,
  length=c(35,15*60),
  budget=3,
  mpaa="",
  Action=0,
  Animation=1,
  Comedy=1,
  Drama=1,
  Documentary=1,
  Romance=1,
  Short=c(1,0),
  lenOfTitle=c(2,5)
)
BestMovieEVER

##   title year length budget mpaa Action Animation Comedy Drama Documentary
## 1    DA 2015     35      3           0         1      1     1           1
## 2 DA.Dc 2015    900      3           0         1      1     1           1
##   Romance Short lenOfTitle
## 1       1     1          2
## 2       1     0          5

predict(model2,newdata = BestMovieEVER,interval="predict",level=0.9)

##        fit       lwr      upr
## 1 10.33558  8.031507 12.63965
## 2 21.19827 18.597548 23.79900

Just great! Both our movies about Data Analysis will get high ratings on IMDb. Short film will get 10/10 (with 90%+ probability it’ll get 8+/10)… And our director’s cut will get 21/10. That’s awesome! (even in worst ~5% it will get 18/10). They shoud shange IMDb rating system for such epic movie.

I thinks that is awesome insight for moviemaking industry. They shoud make a comedy-romance-drama-documentary 15 hrs-long cartoon about data analysis. But 15 hrs is too long even for Lars von Trier. :)

Final Plots and Summary

Plot One

ggplot(data=movies,aes(x=year))+
  geom_histogram(binwidth=1,fill="darkgrey",color="black")+
  geom_vline(xintercept=1926,col=3,lwd=1,alpha=1/2)+ ## Sound films
  geom_vline(xintercept=1928,col=6,lwd=3,alpha=1/2)+ ## Colored films
  geom_vline(xintercept=1943,col=2,lwd=4,alpha=1/2)+ ## World War II
  geom_vline(xintercept=1987,col=1,lwd=1,alpha=1)+ ## Black Monday
  annotate("text",x=1926-7,y=1300,label="Sound\nfilms",color=3)+ ## labels
  annotate("text",x=1928+7,y=1500,label="Color\nfilms",color=6)+
  annotate("text",x=1943+8,y=1300,label="WW II",color=2)+
  annotate("text",x=1987-9,y=1300,label="Black\nMonday",color=1)+
  ggtitle("Historical events impact on movie production")+ ## Title
  scale_y_continuous(name="Movies produced in year")+ ## y axis lable
  scale_x_continuous(name="Year") ## x axis lable

Description One

I think that this graph is extrmely interesting because we can see how real economical and political historical events impacted on movie industry. We have found impact of sound and colored movies inventions, second World Ward and US economic crysis started with Black Monday. But for sure we can see that somethig happend in 60-s. Maybe the Beatles decreased movie production? Maybe Cold War between USA and USSR? Or Yuri Gagarin and Niel Armstrong space achivements? It’s difficult to say, but I suppose Cold War as the most important factor here.

Plot Two

ggplot(data=movies,aes(x=year,y=length,color=as.factor(Animation)))+
  geom_point(alpha=1/12,position="jitter")+
  ggtitle("Length of animated and non-animated movies in time")+
  scale_color_manual( ## Nice legend
    values=c("#999999", "#E69F00"), ## Cartoons should be bright, othe  - grey
    name="Animated movie?",
    labels=c("No", "Yes"))+
  scale_y_continuous(
    name="Length of a movie",
    limits=c(0,quantile(movies$length,.99)))+
  scale_x_continuous(
    name="Year")

Description Two

As I mentioned above, we can explain one of these strange clusters with only one company - Disney. Also we can see a lot of different interesting anomalies: my favorite is the lack of short films between 1950 and 1990.

Plot Three

ggplot(data=movies,aes(x=year,y=rating))+
    geom_point(position="jitter",alpha=1/20)+
    ## Adding medians for Comedies and non-Comedies
    geom_line(stat="summary",fun.y=median,aes(color=as.factor(Comedy)),lwd=1.5)+
    ## Adding annotations, because legend is too mainstream
    annotate("text",x=1950,y=6.75,label="Comedy",
        col="#EE4000",fontface="bold")+
    annotate("text",x=1955,y=5.75,label="non-Comedy",
        col="#008B00",fontface="bold")+
    ## Adding some historical events with descriptions
    geom_vline(xintercept=1945,col=2)+
    annotate("text",x=1945+5,y=4.5,label="WW II",
        col=2,fontface="bold")+
    geom_vline(xintercept=1934,col="darkgreen",lwd=12,alpha=1/2)+
    annotate("text",x=1934-8,y=4.5,
        label="Great\nDepression",col="darkgreen",fontface="bold")+
    ## Nice title of course
    ggtitle("How do people liked comedies over the time?")+
    ## Scaling and axis lables
    scale_y_continuous(
        name="Movie rating",
        limits=c(4,8))+
    scale_x_continuous(
        name="Year",
        limits=c(1920,1980))+
    ## Removing the legend
    theme(legend.position="none")+
    ## Nice colors
    scale_color_manual( values=c("#008B00", "#EE4000"))

Description Two

Here I’ve plotted median of comedies rating over time, to see how to interest in this genre have changed. Points are not so important here, so I’ve made them very transparent. We can find here interesing historical pattern: after great depression comedies became prefered for a long time, especially after WW II: rating of average non-comedy droped down, rating of an average comedy remained the same (despite of overall downward trend). Interesting that in time of Cold War sitatuon was opposite: there were a lot of bad comedies, but non-comedies films were great. We’ll talk about it in reflection part.

Reflection

I have made some interesting conclusions about movie industry and it’s history:

There were some significant breakthroughs in movie making industry: when sound and colors came to movies. In 4 years industry grown up in almost 280%.
There were also some shocks like WW II, Cold War and Black Monday.
We can see some certain clusters on length~year chart. It means there there are some types of movies, and we can see how these types were changing over the time. Also one of these clusters corresponds to Disney company.
Political and economical events impact on audience mood. After the great depression and second World War there were significant increse in comidies quality. But during the Cold war situation were opposie: everybody wanted for some serious movies, because of stressful situation. I think that it’s good explanation, because 1945-1950 weren’t stressfull: Nazism was defeated, everything was destroyd (not in the US of course), but everything was Ok. So, if you live in a stressfull time it’s not good idea to create a comedy.
We have created a model for a super movie. Little bit impossible, but it’s rating will be over 10/10!
My personal hypotethis that length of movie title impacts it’s rating wasn’t pproved. My logic was: It has long title=>It’s difficult to share=>Low rating. Really, coefficient in our model is equal to -3.463e-03, so you will loose 0.35 rating points per every 100 letters in the title. But coefficient is unsignificant.

It was extremely interesting, not too hard and very insightful analysis. I think it could be enriched with further analysis in both main directions: prediction of ratings and looking at impact of historical events. I suppose using new variables could enhance prediction model (maybe some crazy variables such as length of a creenplay in charaters, size of movie cast, etc.)

References

Wikipedia
IMDb database

WolframAlpha
Book (in russian): “Наглядная статистика. Используем R!”, А. Б. Шипунов, Е. М. Балдин, П. А. Волкова, А. И. Коробейников, С. А. Назарова, С. В. Петров, В. Г. Суфиянов
Snapshot from Steamboat Willie, Disney.

IMDb exploratory data analysis project

Ilya Ezepov (Илья Езепов)

5 Jan 2015

Analysis

Before we start

Load the Data

Summary of the Data Set

Individual variables

Investigation of movies vs year histogram.

Relationships between two variables

Three or more variables

Predictions

The Best Movie EVER production

Final Plots and Summary

Plot One

Description One

Plot Two

Description Two

Plot Three

Description Two

Reflection

References