Introduction

Having had a dataset of movies, randomly extracted from www.imdb.com and www.rottentomatos.com, I wondered how the proportion of R-rated movies has changed over decades, from 1970s to 2010s. This report is the result of such investigation that has been done in R, and it includes all the codes. While the statistics behind this analytical report is limited to exploratory data analysis and bootstraping, the result is quite interesting.

According to the dataset, the the proportion of r-rated movies has increased over 70s, 80s and 90s, to some extent that on the 90s the proportion is the highest amount among all. However, the trend has become reverse after that, and the proportion of r-rated movies has decreased in 2000s comparing to 90s, and in 2010s comparing to 2000s. The difference in proportions between each two consecutive decades is statistically significant with significance level of 0.05, except for 2010s and 2000s.

One possible reason of such trend is the tendency of studios to produce movies for broader audience, since doing so may positively affect the revenue. Although I am no expert in this domain, some sources such as this article by Tim Winter on Foxnews endorses my educated guess.

This report can be improved in several aspects, from which replacement of the current dataset with a comprehensive one may be the most influential.

This analysis has been a hobby mini-project to me, and I have decided to publish it in my linkedin profile just to motivate myself to publish more mini-projects, and explore capabilities of the platform. At last, any critical comment regarding the technical and non-technical aspects are highly welcomed.

R-rated movies through decades: A quantitative analysis

Has the proportion of r-rated movies changed during the recent decades?

This report tries to investigate the above question, for sake of playing with data! In other words, this is a hobby mini-project that I have done with a dataset of movies. The question came to my mind when I was working on the data in order to build a regression model, and I found it a good opportunity to do a little bit of analysis, and publish a little report on it.

The data is extracted from two well-known sources www.imdb.com and www.rottentomatoes.com. The extraction and prepration have been done by the authors of the Linear Regression and Modelling course, created by Duke University on Coursera Platform, thanks to Dr.Mine Çetinkaya-Rundel and her team!

The dataset can be downloaded from the course page, under the project files and Rubric sub-section.

The dataset is relatively small, with only 651 titles. However, the titles have been chosen randomly, so the results of inferential analysis hopefully can be generalized to the population. I assume that the size of the dataset is sufficient for such inference, while keeping in mind that for a more serious analysis, a comprehensive one, a larger dataset has to be prepared using the mentioned sources.

The R-rated currently is defined as Under 17 requires accompanying parent or adult guardian. Contains some adult material. Parents are urged to learn more about the film before taking their young children with them. The rating is done by The Motion Picture Association of America’s (MPAA) film-rating system.

Let’s rephrase the research question:

Is there any relationship between the proportion of R movies and time periods? The time periods would be 1970-80, 1980-90, 1990-2000 ,2000-2010, and 2010-2014, since the latest movies on the dataset are from 2014

FALSE Loading required package: ggplot2
FALSE Loading required package: knitr

In the first step, let’s have a look at the proportions of R-rated movies over various decades.

Table1 - Contigency table of MPAA rating categories and time periods
70s 80s 90s 2000s 2010s
G 6 2 5 3 3
NC-17 1 1 0 0 0
PG 27 33 19 30 9
PG-13 0 13 35 60 25
R 16 51 101 115 46
Unrated 1 3 1 26 19
51 103 161 234 102
Table2 - The proportion of R-rated movies over the total movies of the corresponding decade
70s 80s 90s 2000s 2010s
Proportion 0.3137 0.4951 0.6273 0.4915 0.451

Now let’s have a look at these proportions, literally! The figure1 shows the proportions as bars over the time periods.

The change is conspicuous, the proportion of the R-rated movies has increased and then decreased. The acme was 90s. But how significant is this evidence? Are the differences due to randomness, or they are statistically significant and meaningful? In order to answer this question, I am going to rely on frequentist statistics, and old p-value method!

What it is needed to be done is comparison of the R rated proportions of each subsequent pairs of periods. It can be done with either bootstraping or CLT-based method of comparison of two proportions. In either case the hypotheses are as it follows:

Null hypothesis: The difference between two proportions is due to random chance. (p1 = p2).

Alternate Hypothesis: The proportions of the two periods are significantly different, therefore they are from two different populations. (p2 != p1).

In order to use CLT based method (calculation of Z-score and so on), success-failure conditions must be checked. For the first pairs, 70s and 80s movies, these conditions are not met. So I wholeheartedly use bootstraping for the calculation of P-value, not only because CLT conditions are not met, but also because my personal preference!

The rest of this report shows graphically the P-value, which is the probability of observing such proportion differences or more extreme difference values. All the differences are statistically significant, except the difference between 2000s and 2010s. Anyway, 2010s is not over yet, so the data is incomplete, hence the result may change with more recent movies.

It should be stated that I have used two-tail tests, as one can see the two red lines on the histograms. Maybe I should have used one tail test for this case, however the results would not change. I would be happy if you comment on this issue.

There are five periods, and therefore I will do four tests.

## [1] "for the first group the n1*poolHat is greater than 10? FALSE"
## [1] "for the second group the n2*poolHat is greater than 10? TRUE"
## [1] "So the success-failure condition is not met, and I can happily venture on using bootstrap method"
## [1] "The Pvalue of bootstrap method is 0.0248"

The Pvalue clearly shows that the probability of the difference between the proportion of 70s R rated movies and 80s R rated movies under null hypo is below 0.05. Therefore, we reject the null hypothesis in favor of the alternative hypothesis. Consequently, the difference between the proportion of R rated movies in these two periods is not due to random chance, but it shows a behvaiour or structure. The difference is statistically significant.

The plot also shows the bootstrap distribution of proportion differences under null hypothesis, and the happened difference between the two periods.

Similar to these two periods, the bootstrap Pvalue is calculated for the next period pairs. In order to keep the consistency, I continue with bootstraping.

## [1] "The Pvalue of bootstrap method is 0.0295"

The same result, Pvalue is less than 0.05 so I reject the null hypothesis in favor of the alternative hypothesis. The change in proportion of R-rated movies between 80s and 90s periods is not due to chance.

## [1] "The Pvalue of bootstrap method is 0.00829999999999997"

Again the null hypothesis is rejected in favor of alternative hypothesis. The difference between 90s and 2000s proportions of R rated movies is not due to chance.

## [1] "The Pvalue of bootstrap method is 0.4765"

The Pvalue is much higher than 0.05 threshold, so we fail to reject the null hypothesis. The difference between the proportions of 2000s and 2010s is not significant.

Generally, it seems that a pattern is here in the proportions of R rated movies, produced since 1970. It has increased significantly till start of millennium, then decreased significantly in the first decade of millennium, and then seemingly it has remained almost constant, with insignificant change.

The next section includes the codes that I wrote to fulfill this analysis.

Codes

require(ggplot2)
#require(xtable)
require(knitr)

load(file = "/Users/Shaahin/Dropbox/Statistics/Duke Specialisation/Course 3/_e1fe0c85abec6f73c72d73926884eaca_movies.Rdata")
movies$time_period = movies$thtr_rel_year
movies$time_period[movies$thtr_rel_year<1980] = "70s"
movies$time_period[movies$thtr_rel_year<1990 & movies$thtr_rel_year>=1980] = "80s"
movies$time_period[movies$thtr_rel_year<2000 & movies$thtr_rel_year>=1990] = "90s"
movies$time_period[movies$thtr_rel_year>=2000 &  movies$thtr_rel_year<2010] = "2000s"
movies$time_period[movies$thtr_rel_year >= 2010] = "2010s" 
movies$time_period = (factor(movies$time_period,levels = c("70s","80s","90s","2000s","2010s")))
#first have a look at the contigency table of MPAA rating and time_period
r_table = as.matrix(table(movies$mpaa_rating, movies$time_period))
#rbind(r_table,apply(X = r_table,MARGIN = 2,FUN = sum))
kable(rbind(r_table,apply(X = r_table,MARGIN = 2,FUN = sum)), caption = "Table1 - Contigency table of MPAA rating categories and time periods")
#apply(X = r_table,MARGIN = 2,FUN = sum)
#print(xtable(r_table))
#finding out the position of "R" in the levels of "mpaa_rating", or equivalently the rows of contigency table
R_index=which(rownames(table(movies$mpaa_rating, movies$time_period))=="R")

#now the proportions of R rated movies in each time period

R_time_prop=table(movies$mpaa_rating, movies$time_period)[R_index,]/apply(table(movies$mpaa_rating, movies$time_period),MARGIN = 2, FUN = sum)
t = t(as.matrix(x = round(R_time_prop,4)))
rownames(t) = "Proportion"
kable(t,caption = "Table2 - The proportion of R-rated movies over the total movies of the corresponding decade")
par(mfrow=c(1,1))

r_prop_df = data.frame(decades = factor(names(R_time_prop),levels=c("70s","80s","90s","2000s","2010s")) , proportions = R_time_prop)
rownames(r_prop_df) = NULL

ggplot(data = r_prop_df , aes(decades,proportions )) + geom_bar(stat = "identity") + ggtitle("Figure 1: The proportions of R-rated movies over time periods")
pool_vector = apply(table(movies$mpaa_rating,movies$time_period),2,sum)

#first, checking the success-failure conditions
pool = pool_vector[1] + pool_vector[2]
n1 = table(movies$mpaa_rating,movies$time_period)[5,1]
n2 = table(movies$mpaa_rating,movies$time_period)[5,2]
prop_dif = (n2/pool_vector[2]) - (n1/pool_vector[1])
pool_hat = (n1+n2)/pool 

print(paste("for the first group the n1*poolHat is greater than 10?",n1*pool_hat > 10)) 
print(paste("for the second group the n2*poolHat is greater than 10?",n2*pool_hat > 10)) 

print("So the success-failure condition is not met, and I can happily venture on using bootstrap method")

bootstrap_pool = c(rep("R",n1+n2),rep("nR",pool-n1-n2) )


set.seed(7)
prop_vector = vector(length = 10000)

for (i in 1:10000) {
bootstrap_pool_t = sample(x = bootstrap_pool, size = pool , replace = FALSE )
firstPeriod=bootstrap_pool_t[1:pool_vector[1]]
secondPeriod= bootstrap_pool_t[(pool_vector[1]+1):pool]
firstProp = sum(firstPeriod=="R")/length(firstPeriod)
secondProp = sum(secondPeriod=="R")/length(secondPeriod)
boot_prop_dif = secondProp -  firstProp
prop_vector[i] = boot_prop_dif
}

boot_Pvalue = 1-sum(abs(prop_dif)>=abs(prop_vector))/length(prop_vector)
print(paste("The Pvalue of bootstrap method is",boot_Pvalue))

hist(prop_vector)
abline(v=prop_dif, col = "red")
abline(v=-prop_dif, col = "red")
pool_vector = apply(table(movies$mpaa_rating,movies$time_period),2,sum)

#first, checking the success-failure conditions
pool1 = pool_vector[2]
pool2 = pool_vector[3]
pool = pool1 + pool2
n1 = table(movies$mpaa_rating,movies$time_period)[5,2]
n2 = table(movies$mpaa_rating,movies$time_period)[5,3]
prop_dif = (n2/pool2) - (n1/pool1)
pool_hat = (n1+n2)/pool 

#print(paste("for the first group the n1*poolHat is greater than 10?",n1*pool_hat > 10)) 
#print(paste("for the second group the n2*poolHat is greater than 10?",n2*pool_hat > 10)) 


bootstrap_pool = c(rep("R",n1+n2),rep("nR",pool-n1-n2) )


set.seed(7)
prop_vector = vector(length = 10000)

for (i in 1:10000) {
bootstrap_pool_t = sample(x = bootstrap_pool, size = pool , replace = FALSE )
firstPeriod=bootstrap_pool_t[1:pool1]
secondPeriod= bootstrap_pool_t[(pool1+1):pool]
firstProp = sum(firstPeriod=="R")/length(firstPeriod)
secondProp = sum(secondPeriod=="R")/length(secondPeriod)
boot_prop_dif = secondProp -  firstProp
prop_vector[i] = boot_prop_dif
}

boot_Pvalue = 1-sum(abs(prop_dif)>=abs(prop_vector))/length(prop_vector)
print(paste("The Pvalue of bootstrap method is",boot_Pvalue))

hist(prop_vector)
abline(v=prop_dif, col = "red")
abline(v=-prop_dif, col = "red")
pool_vector = apply(table(movies$mpaa_rating,movies$time_period),2,sum)

#first, checking the success-failure conditions
pool1 = pool_vector[3]
pool2 = pool_vector[4]
pool = pool1 + pool2
n1 = table(movies$mpaa_rating,movies$time_period)[5,3]
n2 = table(movies$mpaa_rating,movies$time_period)[5,4]
prop_dif = (n2/pool2) - (n1/pool1)
pool_hat = (n1+n2)/pool 

#print(paste("for the first group the n1*poolHat is greater than 10?",n1*pool_hat > 10)) 
#print(paste("for the second group the n2*poolHat is greater than 10?",n2*pool_hat > 10)) 


bootstrap_pool = c(rep("R",n1+n2),rep("nR",pool-n1-n2) )


set.seed(7)
prop_vector = vector(length = 10000)

for (i in 1:10000) {
bootstrap_pool_t = sample(x = bootstrap_pool, size = pool , replace = FALSE )
firstPeriod=bootstrap_pool_t[1:pool1]
secondPeriod= bootstrap_pool_t[(pool1+1):pool]
firstProp = sum(firstPeriod=="R")/length(firstPeriod)
secondProp = sum(secondPeriod=="R")/length(secondPeriod)
boot_prop_dif = secondProp -  firstProp
prop_vector[i] = boot_prop_dif
}
hist(prop_vector)
abline(v=prop_dif, col = "red")
abline(v=-prop_dif, col = "red")

boot_Pvalue = 1-sum(abs(prop_dif)>=abs(prop_vector))/length(prop_vector)
print(paste("The Pvalue of bootstrap method is",boot_Pvalue))
pool_vector = apply(table(movies$mpaa_rating, movies$time_period), 2, sum)

# first, checking the success-failure conditions
pool1 = pool_vector[4]
pool2 = pool_vector[5]
pool = pool1 + pool2
n1 = table(movies$mpaa_rating, movies$time_period)[5, 4]
n2 = table(movies$mpaa_rating, movies$time_period)[5, 5]
prop_dif = (n2/pool2) - (n1/pool1)
pool_hat = (n1 + n2)/pool

# print(paste('for the first group the n1*poolHat is greater than
# 10?',n1*pool_hat > 10)) print(paste('for the second group the n2*poolHat
# is greater than 10?',n2*pool_hat > 10))


bootstrap_pool = c(rep("R", n1 + n2), rep("nR", pool - n1 - n2))


set.seed(7)
prop_vector = vector(length = 10000)

for (i in 1:10000) {
    bootstrap_pool_t = sample(x = bootstrap_pool, size = pool, replace = FALSE)
    firstPeriod = bootstrap_pool_t[1:pool1]
    secondPeriod = bootstrap_pool_t[(pool1 + 1):pool]
    firstProp = sum(firstPeriod == "R")/length(firstPeriod)
    secondProp = sum(secondPeriod == "R")/length(secondPeriod)
    boot_prop_dif = secondProp - firstProp
    prop_vector[i] = boot_prop_dif
}
hist(prop_vector)
abline(v = prop_dif, col = "red")
abline(v = -prop_dif, col = "red")

boot_Pvalue = 1 - sum(abs(prop_dif) >= abs(prop_vector))/length(prop_vector)
print(paste("The Pvalue of bootstrap method is", boot_Pvalue))