========================================================
## Output is hidden
library(ggplot2)
library(dplyr)
library(GGally)
library(memisc)
library(gridExtra)
## install.packages("ellipse")
library(ellipse) ## for fancy correlation matrix
I want to say sorry for my bad english level. It’s my second language, so I have a lot of mistakes.
I decided to use IMDb database of movies to predict rating of a movie. It’s availabe in ggplot2 package so I’ll just load it. But still there are commands to load it from the file.
data(movies)
## To load from a file uncomment this:
#movies <- read.table("movies.tab", sep="\t", header=TRUE, quote="", comment="")
dim(movies)
## [1] 58788 24
names(movies)
## [1] "title" "year" "length" "budget" "rating"
## [6] "votes" "r1" "r2" "r3" "r4"
## [11] "r5" "r6" "r7" "r8" "r9"
## [16] "r10" "mpaa" "Action" "Animation" "Comedy"
## [21] "Drama" "Documentary" "Romance" "Short"
summary(movies)
## title year length budget
## Length:58788 Min. :1893 Min. : 1.00 Min. : 0
## Class :character 1st Qu.:1958 1st Qu.: 74.00 1st Qu.: 250000
## Mode :character Median :1983 Median : 90.00 Median : 3000000
## Mean :1976 Mean : 82.34 Mean : 13412513
## 3rd Qu.:1997 3rd Qu.: 100.00 3rd Qu.: 15000000
## Max. :2005 Max. :5220.00 Max. :200000000
## NA's :53573
## rating votes r1 r2
## Min. : 1.000 Min. : 5.0 Min. : 0.000 Min. : 0.000
## 1st Qu.: 5.000 1st Qu.: 11.0 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 6.100 Median : 30.0 Median : 4.500 Median : 4.500
## Mean : 5.933 Mean : 632.1 Mean : 7.014 Mean : 4.022
## 3rd Qu.: 7.000 3rd Qu.: 112.0 3rd Qu.: 4.500 3rd Qu.: 4.500
## Max. :10.000 Max. :157608.0 Max. :100.000 Max. :84.500
##
## r3 r4 r5 r6
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 4.500 1st Qu.: 4.50
## Median : 4.500 Median : 4.500 Median : 4.500 Median :14.50
## Mean : 4.721 Mean : 6.375 Mean : 9.797 Mean :13.04
## 3rd Qu.: 4.500 3rd Qu.: 4.500 3rd Qu.: 14.500 3rd Qu.:14.50
## Max. :84.500 Max. :100.000 Max. :100.000 Max. :84.50
##
## r7 r8 r9 r10
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.: 4.50 1st Qu.: 4.50 1st Qu.: 4.500 1st Qu.: 4.50
## Median : 14.50 Median : 14.50 Median : 4.500 Median : 14.50
## Mean : 15.55 Mean : 13.88 Mean : 8.954 Mean : 16.85
## 3rd Qu.: 24.50 3rd Qu.: 24.50 3rd Qu.: 14.500 3rd Qu.: 24.50
## Max. :100.00 Max. :100.00 Max. :100.000 Max. :100.00
##
## mpaa Action Animation Comedy
## :53864 Min. :0.00000 Min. :0.00000 Min. :0.0000
## NC-17: 16 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## PG : 528 Median :0.00000 Median :0.00000 Median :0.0000
## PG-13: 1003 Mean :0.07974 Mean :0.06277 Mean :0.2938
## R : 3377 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.00000 Max. :1.0000
##
## Drama Documentary Romance Short
## Min. :0.000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.000 Median :0.00000 Median :0.0000 Median :0.0000
## Mean :0.371 Mean :0.05906 Mean :0.0807 Mean :0.1609
## 3rd Qu.:1.000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.000 Max. :1.00000 Max. :1.0000 Max. :1.0000
##
head(movies)
## title year length budget rating votes r1 r2 r3
## 1 $ 1971 121 NA 6.4 348 4.5 4.5 4.5
## 2 $1000 a Touchdown 1939 71 NA 6.0 20 0.0 14.5 4.5
## 3 $21 a Day Once a Month 1941 7 NA 8.2 5 0.0 0.0 0.0
## 4 $40,000 1996 70 NA 8.2 6 14.5 0.0 0.0
## 5 $50,000 Climax Show, The 1975 71 NA 3.4 17 24.5 4.5 0.0
## 6 $pent 2000 91 NA 4.3 45 4.5 4.5 4.5
## r4 r5 r6 r7 r8 r9 r10 mpaa Action Animation Comedy Drama
## 1 4.5 14.5 24.5 24.5 14.5 4.5 4.5 0 0 1 1
## 2 24.5 14.5 14.5 14.5 4.5 4.5 14.5 0 0 1 0
## 3 0.0 0.0 24.5 0.0 44.5 24.5 24.5 0 1 0 0
## 4 0.0 0.0 0.0 0.0 0.0 34.5 45.5 0 0 1 0
## 5 14.5 14.5 4.5 0.0 0.0 0.0 24.5 0 0 0 0
## 6 14.5 14.5 14.5 4.5 4.5 14.5 14.5 0 0 0 1
## Documentary Romance Short
## 1 0 0 0
## 2 0 0 0
## 3 0 0 1
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
str(movies)
## 'data.frame': 58788 obs. of 24 variables:
## $ title : chr "$" "$1000 a Touchdown" "$21 a Day Once a Month" "$40,000" ...
## $ year : int 1971 1939 1941 1996 1975 2000 2002 2002 1987 1917 ...
## $ length : int 121 71 7 70 71 91 93 25 97 61 ...
## $ budget : int NA NA NA NA NA NA NA NA NA NA ...
## $ rating : num 6.4 6 8.2 8.2 3.4 4.3 5.3 6.7 6.6 6 ...
## $ votes : int 348 20 5 6 17 45 200 24 18 51 ...
## $ r1 : num 4.5 0 0 14.5 24.5 4.5 4.5 4.5 4.5 4.5 ...
## $ r2 : num 4.5 14.5 0 0 4.5 4.5 0 4.5 4.5 0 ...
## $ r3 : num 4.5 4.5 0 0 0 4.5 4.5 4.5 4.5 4.5 ...
## $ r4 : num 4.5 24.5 0 0 14.5 14.5 4.5 4.5 0 4.5 ...
## $ r5 : num 14.5 14.5 0 0 14.5 14.5 24.5 4.5 0 4.5 ...
## $ r6 : num 24.5 14.5 24.5 0 4.5 14.5 24.5 14.5 0 44.5 ...
## $ r7 : num 24.5 14.5 0 0 0 4.5 14.5 14.5 34.5 14.5 ...
## $ r8 : num 14.5 4.5 44.5 0 0 4.5 4.5 14.5 14.5 4.5 ...
## $ r9 : num 4.5 4.5 24.5 34.5 0 14.5 4.5 4.5 4.5 4.5 ...
## $ r10 : num 4.5 14.5 24.5 45.5 24.5 14.5 14.5 14.5 24.5 4.5 ...
## $ mpaa : Factor w/ 5 levels "","NC-17","PG",..: 1 1 1 1 1 1 5 1 1 1 ...
## $ Action : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Animation : int 0 0 1 0 0 0 0 0 0 0 ...
## $ Comedy : int 1 1 0 1 0 0 0 0 0 0 ...
## $ Drama : int 1 0 0 0 0 1 1 0 1 0 ...
## $ Documentary: int 0 0 0 0 0 0 0 1 0 0 ...
## $ Romance : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Short : int 0 0 1 0 0 0 0 1 0 0 ...
Looks nice, but a lot of NA’s in budget. For the basic analysis I don’t need distribution of ratings, so I will delete r1-r10 variables with dplyr.
movies<-dplyr::select(movies,-(r1:r10))
## We have to specify package, because of select() fun in MASS package
head(movies)
## title year length budget rating votes mpaa Action
## 1 $ 1971 121 NA 6.4 348 0
## 2 $1000 a Touchdown 1939 71 NA 6.0 20 0
## 3 $21 a Day Once a Month 1941 7 NA 8.2 5 0
## 4 $40,000 1996 70 NA 8.2 6 0
## 5 $50,000 Climax Show, The 1975 71 NA 3.4 17 0
## 6 $pent 2000 91 NA 4.3 45 0
## Animation Comedy Drama Documentary Romance Short
## 1 0 1 1 0 0 0
## 2 0 1 0 0 0 0
## 3 1 0 0 0 0 1
## 4 0 1 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 1 0 0 0
And what about some artificial variables? Lets check how length of movie title reflects it’s rating. Also I want to create new variable - century in which movie was made.
movies<-mutate(movies,
lenOfTitle=nchar(as.character(title)), ## Because title was a factor
century=trunc(year/100))
Let’s take a final look, before we start our analysis:
str(movies)
## 'data.frame': 58788 obs. of 16 variables:
## $ title : chr "$" "$1000 a Touchdown" "$21 a Day Once a Month" "$40,000" ...
## $ year : int 1971 1939 1941 1996 1975 2000 2002 2002 1987 1917 ...
## $ length : int 121 71 7 70 71 91 93 25 97 61 ...
## $ budget : int NA NA NA NA NA NA NA NA NA NA ...
## $ rating : num 6.4 6 8.2 8.2 3.4 4.3 5.3 6.7 6.6 6 ...
## $ votes : int 348 20 5 6 17 45 200 24 18 51 ...
## $ mpaa : Factor w/ 5 levels "","NC-17","PG",..: 1 1 1 1 1 1 5 1 1 1 ...
## $ Action : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Animation : int 0 0 1 0 0 0 0 0 0 0 ...
## $ Comedy : int 1 1 0 1 0 0 0 0 0 0 ...
## $ Drama : int 1 0 0 0 0 1 1 0 1 0 ...
## $ Documentary: int 0 0 0 0 0 0 0 1 0 0 ...
## $ Romance : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Short : int 0 0 1 0 0 0 0 1 0 0 ...
## $ lenOfTitle : int 1 17 22 7 24 5 7 4 3 7 ...
## $ century : num 19 19 19 19 19 20 20 20 19 19 ...
summary(movies)
## title year length budget
## Length:58788 Min. :1893 Min. : 1.00 Min. : 0
## Class :character 1st Qu.:1958 1st Qu.: 74.00 1st Qu.: 250000
## Mode :character Median :1983 Median : 90.00 Median : 3000000
## Mean :1976 Mean : 82.34 Mean : 13412513
## 3rd Qu.:1997 3rd Qu.: 100.00 3rd Qu.: 15000000
## Max. :2005 Max. :5220.00 Max. :200000000
## NA's :53573
## rating votes mpaa Action
## Min. : 1.000 Min. : 5.0 :53864 Min. :0.00000
## 1st Qu.: 5.000 1st Qu.: 11.0 NC-17: 16 1st Qu.:0.00000
## Median : 6.100 Median : 30.0 PG : 528 Median :0.00000
## Mean : 5.933 Mean : 632.1 PG-13: 1003 Mean :0.07974
## 3rd Qu.: 7.000 3rd Qu.: 112.0 R : 3377 3rd Qu.:0.00000
## Max. :10.000 Max. :157608.0 Max. :1.00000
##
## Animation Comedy Drama Documentary
## Min. :0.00000 Min. :0.0000 Min. :0.000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.00000
## Median :0.00000 Median :0.0000 Median :0.000 Median :0.00000
## Mean :0.06277 Mean :0.2938 Mean :0.371 Mean :0.05906
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.0000 Max. :1.000 Max. :1.00000
##
## Romance Short lenOfTitle century
## Min. :0.0000 Min. :0.0000 Min. : 1.00 Min. :18.00
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 11.00 1st Qu.:19.00
## Median :0.0000 Median :0.0000 Median : 15.00 Median :19.00
## Mean :0.0807 Mean :0.1609 Mean : 16.58 Mean :19.18
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.: 20.00 3rd Qu.:19.00
## Max. :1.0000 Max. :1.0000 Max. :121.00 Max. :20.00
##
I will start from simple histograms of rating, year, length, votes and length of title (comments are after 4 plots):
ggplot(data=movies,aes(x = rating))+
geom_histogram(binwidth=0.5)
ggplot(data=movies,aes(x = year))+
geom_histogram(binwidth=1)
ggplot(data=movies,aes(x = length))+
geom_histogram(binwidth=3)+
xlim(c(0,quantile(movies$length,0.99)))
ggplot(data=movies,aes(x = votes))+
geom_histogram(binwidth=15)+
xlim(c(0,quantile(movies$votes,0.9))) ## Votes have VERY long tail
ggplot(data=movies,aes(x = lenOfTitle))+
geom_histogram(binwidth=1)+
xlim(c(0,quantile(movies$lenOfTitle,0.95)))
Conclusions:
We can see huge increase in ~1930, and constant increase till now with few drops in it. Let’s try to explain such patter with real historical events correlated with movie industry.
What we can test first? Obviosly sound films! According to wikipedia first sound film was made in 1926. Let’s add vertical line to our plot.
ggplot(data=movies,aes(x=year))+
geom_histogram(binwidth=1)+
geom_vline(xintercept=1926,col=3,lwd=1,alpha=1/2) ## Sound films
Yep, looks close. Maybe something else? Colored movies? Mass production started in late 1920-s. Let’s try
ggplot(data=movies,aes(x=year))+
geom_histogram(binwidth=1)+
geom_vline(xintercept=1926,col=3,lwd=1,alpha=1/2)+ ## Sound films
geom_vline(xintercept=1928,col=6,lwd=3,alpha=1/2) ## Colored films
And that line fits perfect. So we can say that sound and colors bring a lot to movoe industry. How much?
sum(movies$year==1932)/sum(movies$year==1928)
## [1] 3.779817
1932-1928
## [1] 4
It means that number of movies increased at 278% in just 4 years. Amazing. We need to understand that world population increaset only at 4.5% in these 4 years.
What’s next? Let’s check some great historical events. What about World War II from 1939 to 1945?
ggplot(data=movies,aes(x=year))+
geom_histogram(binwidth=1)+
geom_vline(xintercept=1926,col=3,lwd=1,alpha=1/2)+ ## Sound films
geom_vline(xintercept=1928,col=6,lwd=3,alpha=1/2)+ ## Colored films
geom_vline(xintercept=1943,col=2,lwd=4,alpha=1/2) ## World War II
Yes, WWII significantly decreased movie production. And finally: Black Monday
ggplot(data=movies,aes(x=year))+
geom_histogram(binwidth=1)+
geom_vline(xintercept=1926,col=3,lwd=1,alpha=1/2)+ ## Sound films
geom_vline(xintercept=1928,col=6,lwd=3,alpha=1/2)+ ## Colored films
geom_vline(xintercept=1943,col=2,lwd=4,alpha=1/2)+ ## World War II
geom_vline(xintercept=1987,col=1,lwd=1,alpha=1) ## Black Monday
Looks great. We can also say that something happened in about 1960, but I do not know what exactly schoked movie industry. I’ll add labels in the Final plots part. And let’s continue with two variables analysis.
First of all, let’s take a look at our correlation matrix with nice ellipse graph. But we need to drop “title” and “mpaa” variables, because they are not numerical. Also I’ll exclude budget, because there are only 5000 observations, and it’s incorrect to compare correlation coefficients for diffrent lengths vectors.
movies.num<-dplyr::select(movies,-c(title,mpaa,budget))
corMatrix<-cor(movies.num)
## Vector of colors, red for positive, blue for negative
cols <- ifelse(corMatrix>0,
rgb(abs(corMatrix)^(1/2),0,0), ## Shades of red
rgb(0,0,abs(corMatrix)^(1/2))) ## Shades of red
## I have used ^(1/2) to make colors more bright
plotcorr(corMatrix,col=cols,mar=c(0,0,0,0))
And here we get some insights. Few things can be noted:
Also I want to get some ideas from GGally packege (I’ll make this plot big to see something):
set.seed(1984)
## Let's explain that Genre and cetury are factors
movies.factored<-movies
movies.factored[,8:14]<-lapply(movies[,8:14],factor)
samp<-movies.factored[sample(1:nrow(movies), 0.05*nrow(movies)),2:15] ## 5% sample
## My computer is not that fast to use 10%+ samples.
## I also excluded title and century from the analysis
ggpairs(samp, params = c(shape = I("."), outlier.shape = I(".")),axisLabels = "none")
Looks promissing. Let’s continue with insights from correlation matrix.
ggplot(data=movies,aes(x=year,y=length))+
geom_point(alpha=1/15,position="jitter")+
ylim(c(0,quantile(movies$length,.99)))
And that is the most interesting graph so far. Insights:
Yes, that graph was insghtfull. I’ll certanly investigate it further in multivariable section.
ggplot(data=movies,
aes(x=as.factor(Animation),y=length,fill=as.factor(Animation)))+
geom_boxplot()+ylim(c(0,60*5))
## Considering only movies not longer than 5 hrs.
## Movie length median for Animation
median(movies$length[movies$Animation==1])
## [1] 7
## Movie length median for not-Animation
median(movies$length[movies$Animation==0])
## [1] 91
They are signifanctly different. As expected. Let’s check idea about title length of documentary films:
ggplot(data=movies,
aes(x=as.factor(Documentary),y=lenOfTitle,fill=as.factor(Documentary)))+
geom_boxplot()
Difference is not so significant, let’s check it with staistical tests:
t.test(lenOfTitle~Documentary,data=movies) ## Standart t-test
##
## Welch Two Sample t-test
##
## data: lenOfTitle by Documentary
## t = -24.8437, df = 3647.379, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.746041 -4.905447
## sample estimates:
## mean in group 0 mean in group 1
## 16.26383 21.58957
wilcox.test(lenOfTitle~Documentary,data=movies) ## Nonparametrical analog
##
## Wilcoxon rank sum test with continuity correction
##
## data: lenOfTitle by Documentary
## W = 72073436, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
And the both says that there is statistically significant difference between these groups. What about decrease in rating with time?
Let’s check these ideas with some nice plots:
ggplot(data=movies,aes(x=year,y=rating))+
geom_point(alpha=1/20,position="jitter",scale=0.8)+ ## Fighting overfitting
geom_line(stat="summary",fun.y=median,col=2,linetype=2) ## Adding median line
Okay, it’s downward trend in the huge time range from 1930 till 2000, but strong upward moving in the last few years. I can imagine such explanation: who does watch movies made in 1940-1980? Only fans and critics, but not mass audience. And we can expect high ratings for such classic movies. And older means more classic, and more classic means higher rating. But in early 2000-s began a massive movie production targeted exectly at mass audience. And it’s expected that such bestselelrs got high ratings from average people (low from critics, but who cares - there are few critics per thousad IMDb visitors).
So far so good. What about rating?
plot1<-ggplot(
data=movies,
aes(x=as.factor(Animation),y=rating,fill=as.factor(Animation)))+
geom_boxplot()
plot2<-ggplot(
data=movies,
aes(x=as.factor(Short),y=rating,fill=as.factor(Short)))+
geom_boxplot()
grid.arrange(plot1,plot2,ncol=2)
Looks like Short and Animated movies get higher ratings. Let’s check conclusion about short films:
ggplot(data=movies,aes(x=length,y=rating,color=as.factor(century)))+
geom_point(position="jitter",alpha=1/5)+
xlim(c(0,quantile(movies$length,0.99)))
It’s difficult to male any conclusions here. I am going to continue with three variables (like on previous plot)
Let’s back to our length~year graph and bring some colors.
ggplot(data=movies,aes(x=year,y=length,color=as.factor(Animation)))+
geom_point(alpha=1/10,position="jitter")+
ylim(c(0,quantile(movies$length,.99)))
Whoa! One of our clusters is animated movies. Let’s remember some cartoons from 1930… Yep, that’s famous Disney’s Mickey! We are gooing to the Mickey Mouse movie list and find out that almost all of them are 7-8 minutes long! So, riddle of one cluster is solved! Thank you, Mickey (and other Disnaey’s characters).
But our aim is to analyze movie ratings, so let’s back to our rating~something graphs:
ggplot(data=movies,aes(x=year,y=rating,color=as.factor(mpaa)))+
geom_point(alpha=1/10,position="jitter")
We could expect some correlation betwee MPAA rating and IMDb rating, but nothing interesing here. Let’s move on.
ggplot(data=movies,aes(x=year,y=rating,color=as.factor(Comedy)))+
geom_point(alpha=1/10,position="jitter")
Here we can notice blue point after ~1925. It’s known effect of Great Depression. Everything was bad, so everybody wanted for comedies. I’ll mkae more precise plot in final part.
It’s time to make some predictions now:
model1<-lm(data=movies,rating~length)
summary(model1)
##
## Call:
## lm(formula = rating ~ length, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0182 -0.9268 0.1549 1.0786 4.0120
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.0214706 0.0135013 445.991 < 2e-16 ***
## length -0.0010763 0.0001444 -7.455 9.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.552 on 58786 degrees of freedom
## Multiple R-squared: 0.0009446, Adjusted R-squared: 0.0009276
## F-statistic: 55.58 on 1 and 58786 DF, p-value: 9.087e-14
Okay, R squared is tiny, but explanatory variable is significant. I’ll just add more variables:
model2<-update(model1,.~
+year
+budget
+length
+lenOfTitle
+Action
+Animation
+Comedy
+Drama
+Documentary
+Romance
+Short
+mpaa
)
summary(model2)
##
## Call:
## lm(formula = rating ~ year + budget + length + lenOfTitle + Action +
## Animation + Comedy + Drama + Documentary + Romance + Short +
## mpaa, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5655 -0.8097 0.0960 0.9050 4.8420
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.777e+01 2.119e+00 8.386 < 2e-16 ***
## year -6.903e-03 1.070e-03 -6.453 1.20e-10 ***
## budget 1.317e-09 1.093e-09 1.206 0.228052
## length 1.573e-02 9.028e-04 17.423 < 2e-16 ***
## lenOfTitle -3.463e-03 2.356e-03 -1.470 0.141755
## Action -1.992e-01 5.686e-02 -3.502 0.000465 ***
## Animation 8.776e-01 1.301e-01 6.744 1.71e-11 ***
## Comedy 2.683e-01 4.456e-02 6.022 1.84e-09 ***
## Drama 6.683e-01 4.299e-02 15.544 < 2e-16 ***
## Documentary 1.279e+00 1.286e-01 9.941 < 2e-16 ***
## Romance 1.088e-01 5.685e-02 1.914 0.055653 .
## Short 2.732e+00 1.017e-01 26.854 < 2e-16 ***
## mpaaNC-17 -3.078e-01 5.246e-01 -0.587 0.557398
## mpaaPG -2.681e-01 1.062e-01 -2.524 0.011647 *
## mpaaPG-13 -2.198e-01 7.770e-02 -2.829 0.004688 **
## mpaaR -1.122e-01 5.664e-02 -1.981 0.047640 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.385 on 5199 degrees of freedom
## (53573 observations deleted due to missingness)
## Multiple R-squared: 0.2011, Adjusted R-squared: 0.1988
## F-statistic: 87.25 on 15 and 5199 DF, p-value: < 2.2e-16
Adjusted R squred is now almost 20%. It means that 20% of variance in rating could be explained by these simple variables. I think, this is great result, because rating is human decision based data, it’s always difficult to predict. Intresting that budget, length of title and romance genre are not significant. Based on this model I can creat The Best Movie EVER.
We’ll just need:
Let’s predict it’s rating:
BestMovieEVER<-data.frame(
title=c("DA","DA.Dc"),
year=2015,
length=c(35,15*60),
budget=3,
mpaa="",
Action=0,
Animation=1,
Comedy=1,
Drama=1,
Documentary=1,
Romance=1,
Short=c(1,0),
lenOfTitle=c(2,5)
)
BestMovieEVER
## title year length budget mpaa Action Animation Comedy Drama Documentary
## 1 DA 2015 35 3 0 1 1 1 1
## 2 DA.Dc 2015 900 3 0 1 1 1 1
## Romance Short lenOfTitle
## 1 1 1 2
## 2 1 0 5
predict(model2,newdata = BestMovieEVER,interval="predict",level=0.9)
## fit lwr upr
## 1 10.33558 8.031507 12.63965
## 2 21.19827 18.597548 23.79900
Just great! Both our movies about Data Analysis will get high ratings on IMDb. Short film will get 10/10 (with 90%+ probability it’ll get 8+/10)… And our director’s cut will get 21/10. That’s awesome! (even in worst ~5% it will get 18/10). They shoud shange IMDb rating system for such epic movie.
I thinks that is awesome insight for moviemaking industry. They shoud make a comedy-romance-drama-documentary 15 hrs-long cartoon about data analysis. But 15 hrs is too long even for Lars von Trier. :)
ggplot(data=movies,aes(x=year))+
geom_histogram(binwidth=1,fill="darkgrey",color="black")+
geom_vline(xintercept=1926,col=3,lwd=1,alpha=1/2)+ ## Sound films
geom_vline(xintercept=1928,col=6,lwd=3,alpha=1/2)+ ## Colored films
geom_vline(xintercept=1943,col=2,lwd=4,alpha=1/2)+ ## World War II
geom_vline(xintercept=1987,col=1,lwd=1,alpha=1)+ ## Black Monday
annotate("text",x=1926-7,y=1300,label="Sound\nfilms",color=3)+ ## labels
annotate("text",x=1928+7,y=1500,label="Color\nfilms",color=6)+
annotate("text",x=1943+8,y=1300,label="WW II",color=2)+
annotate("text",x=1987-9,y=1300,label="Black\nMonday",color=1)+
ggtitle("Historical events impact on movie production")+ ## Title
scale_y_continuous(name="Movies produced in year")+ ## y axis lable
scale_x_continuous(name="Year") ## x axis lable
I think that this graph is extrmely interesting because we can see how real economical and political historical events impacted on movie industry. We have found impact of sound and colored movies inventions, second World Ward and US economic crysis started with Black Monday. But for sure we can see that somethig happend in 60-s. Maybe the Beatles decreased movie production? Maybe Cold War between USA and USSR? Or Yuri Gagarin and Niel Armstrong space achivements? It’s difficult to say, but I suppose Cold War as the most important factor here.
ggplot(data=movies,aes(x=year,y=length,color=as.factor(Animation)))+
geom_point(alpha=1/12,position="jitter")+
ggtitle("Length of animated and non-animated movies in time")+
scale_color_manual( ## Nice legend
values=c("#999999", "#E69F00"), ## Cartoons should be bright, othe - grey
name="Animated movie?",
labels=c("No", "Yes"))+
scale_y_continuous(
name="Length of a movie",
limits=c(0,quantile(movies$length,.99)))+
scale_x_continuous(
name="Year")
As I mentioned above, we can explain one of these strange clusters with only one company - Disney. Also we can see a lot of different interesting anomalies: my favorite is the lack of short films between 1950 and 1990.
ggplot(data=movies,aes(x=year,y=rating))+
geom_point(position="jitter",alpha=1/20)+
## Adding medians for Comedies and non-Comedies
geom_line(stat="summary",fun.y=median,aes(color=as.factor(Comedy)),lwd=1.5)+
## Adding annotations, because legend is too mainstream
annotate("text",x=1950,y=6.75,label="Comedy",
col="#EE4000",fontface="bold")+
annotate("text",x=1955,y=5.75,label="non-Comedy",
col="#008B00",fontface="bold")+
## Adding some historical events with descriptions
geom_vline(xintercept=1945,col=2)+
annotate("text",x=1945+5,y=4.5,label="WW II",
col=2,fontface="bold")+
geom_vline(xintercept=1934,col="darkgreen",lwd=12,alpha=1/2)+
annotate("text",x=1934-8,y=4.5,
label="Great\nDepression",col="darkgreen",fontface="bold")+
## Nice title of course
ggtitle("How do people liked comedies over the time?")+
## Scaling and axis lables
scale_y_continuous(
name="Movie rating",
limits=c(4,8))+
scale_x_continuous(
name="Year",
limits=c(1920,1980))+
## Removing the legend
theme(legend.position="none")+
## Nice colors
scale_color_manual( values=c("#008B00", "#EE4000"))
Here I’ve plotted median of comedies rating over time, to see how to interest in this genre have changed. Points are not so important here, so I’ve made them very transparent. We can find here interesing historical pattern: after great depression comedies became prefered for a long time, especially after WW II: rating of average non-comedy droped down, rating of an average comedy remained the same (despite of overall downward trend). Interesting that in time of Cold War sitatuon was opposite: there were a lot of bad comedies, but non-comedies films were great. We’ll talk about it in reflection part.
I have made some interesting conclusions about movie industry and it’s history:
It was extremely interesting, not too hard and very insightful analysis. I think it could be enriched with further analysis in both main directions: prediction of ratings and looking at impact of historical events. I suppose using new variables could enhance prediction model (maybe some crazy variables such as length of a creenplay in charaters, size of movie cast, etc.)