The subject of the project is the rating of movies and series from Netflix, Prime Video, Disney+ and Hulu platforms on IMDB and Rotten Tomatoes.
This project will seek answers to the following questions:
Data used are: “Movies on Netflix, Prime Video, Hulu and Disney+. A collection of movies found on these streaming platforms.” This data has been compiled by Ruchi Bhatia https://www.kaggle.com/ruchi798 and is hosted on Kaggle.
The link to access the data is: https://www.kaggle.com/ruchi798/movies-on-netflix-prime-video-hulu-and-disney
The license is “CC0: Public Domain” and has the tags “arts and entertainment, online communities, movies and tv shows”
The dataset is an amalgamation of:
Data visualization of the above can be found on: https://public.tableau.com/profile/ruchi.bhatia#!/vizhome/Moviesavailableonstreamingplatforms/Moviesavailableonstreamingapplications
The data is in csv format, and is composed of 16700 rows and 17 columns.
Each row is a unique record of a movie or series and each column is a different variable. In summary:
Fifty percent of the items are realized between 2012 and 2020 and the rest between 1902 and 2020.
The distribution of registrations by platform is:
We import the data with the “read_csv” function. We select the columns useful for the analysis. Rename the variables.
We clean the variable Rotten_Tomatoes, eliminating the character “%”. In the variable “Age” we delete the character “+”. We multiply the IMDb score by 10 so that it has the same scale as Rotten Tomatoes.
#### Use read_csv() or another function
df <- read_csv("./Data/MoviesOnStreamingPlatforms_updated.csv")
### We delete the columns that are not going to be used
df <- select(df, -c(X1, Type))
### Rename the columns to eliminate spaces
df <- df %>% rename(Rotten_Tomatoes = "Rotten Tomatoes",
Prime_Video = "Prime Video",
Disney = "Disney+")
### We eliminate the characters "%" and "+" from the Rotten and Age variables.
df$Rotten_Tomatoes <- as.numeric(gsub("%", "", df$Rotten_Tomatoes))
df$Age <- gsub("\\+", "", df$Age)
### We multiply the IMDb rating by 10 to compare it with the Rotten Tomatoes scale.
df$IMDb <- df$IMDb*10
str(df)
## tibble [16,744 × 15] (S3: tbl_df/tbl/data.frame)
## $ ID : num [1:16744] 1 2 3 4 5 6 7 8 9 10 ...
## $ Title : chr [1:16744] "Inception" "The Matrix" "Avengers: Infinity War" "Back to the Future" ...
## $ Year : num [1:16744] 2010 1999 2018 1985 1966 ...
## $ Age : chr [1:16744] "13" "18" "13" "7" ...
## $ IMDb : num [1:16744] 88 87 85 85 88 84 85 84 84 83 ...
## $ Rotten_Tomatoes: num [1:16744] 87 87 84 96 97 97 95 87 95 89 ...
## $ Netflix : num [1:16744] 1 1 1 1 1 1 1 1 1 1 ...
## $ Hulu : num [1:16744] 0 0 0 0 0 0 0 0 0 0 ...
## $ Prime_Video : num [1:16744] 0 0 0 0 1 0 1 0 0 0 ...
## $ Disney : num [1:16744] 0 0 0 0 0 0 0 0 0 0 ...
## $ Directors : chr [1:16744] "Christopher Nolan" "Lana Wachowski,Lilly Wachowski" "Anthony Russo,Joe Russo" "Robert Zemeckis" ...
## $ Genres : chr [1:16744] "Action,Adventure,Sci-Fi,Thriller" "Action,Sci-Fi" "Action,Adventure,Sci-Fi" "Adventure,Comedy,Sci-Fi" ...
## $ Country : chr [1:16744] "United States,United Kingdom" "United States" "United States" "United States" ...
## $ Language : chr [1:16744] "English,Japanese,French" "English" "English" "English" ...
## $ Runtime : num [1:16744] 148 136 149 116 161 117 150 165 115 153 ...
head(df)
## # A tibble: 6 x 15
## ID Title Year Age IMDb Rotten_Tomatoes Netflix Hulu Prime_Video Disney
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Ince… 2010 13 88 87 1 0 0 0
## 2 2 The … 1999 18 87 87 1 0 0 0
## 3 3 Aven… 2018 13 85 84 1 0 0 0
## 4 4 Back… 1985 7 85 96 1 0 0 0
## 5 5 The … 1966 18 88 97 1 0 1 0
## 6 6 Spid… 2018 7 84 97 1 0 0 0
## # … with 5 more variables: Directors <chr>, Genres <chr>, Country <chr>,
## # Language <chr>, Runtime <dbl>
summary(df)
## ID Title Year Age
## Min. : 1 Length:16744 Min. :1902 Length:16744
## 1st Qu.: 4187 Class :character 1st Qu.:2000 Class :character
## Median : 8372 Mode :character Median :2012 Mode :character
## Mean : 8372 Mean :2003
## 3rd Qu.:12558 3rd Qu.:2016
## Max. :16744 Max. :2020
##
## IMDb Rotten_Tomatoes Netflix Hulu
## Min. : 0.00 Min. : 2.00 Min. :0.0000 Min. :0.00000
## 1st Qu.:51.00 1st Qu.: 44.00 1st Qu.:0.0000 1st Qu.:0.00000
## Median :61.00 Median : 71.00 Median :0.0000 Median :0.00000
## Mean :59.03 Mean : 65.43 Mean :0.2126 Mean :0.05393
## 3rd Qu.:69.00 3rd Qu.: 88.00 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :93.00 Max. :100.00 Max. :1.0000 Max. :1.00000
## NA's :571 NA's :11586
## Prime_Video Disney Directors Genres
## Min. :0.0000 Min. :0.00000 Length:16744 Length:16744
## 1st Qu.:0.0000 1st Qu.:0.00000 Class :character Class :character
## Median :1.0000 Median :0.00000 Mode :character Mode :character
## Mean :0.7378 Mean :0.03368
## 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000
##
## Country Language Runtime
## Length:16744 Length:16744 Min. : 1.00
## Class :character Class :character 1st Qu.: 82.00
## Mode :character Mode :character Median : 92.00
## Mean : 93.41
## 3rd Qu.: 104.00
## Max. :1256.00
## NA's :592
Let’s make a histogram and a box plot to answer the question: Are there significant differences between IMDB and Rotten Tomatoes scores?
We need to select the variables of IMDb and Rotten Tomatoes scores. We will filter out the rows containing NaN. We will transform the data frame to long format.
# We selected the variables "platforms" and the IMDb and Rotten Tomatoes scores.
data_1 <- df %>%
select(IMDb, Rotten_Tomatoes)
# We eliminate the rows with any Nan value.
data_1 <- data_1 %>% drop_na()
# We transform the wide format
fig_dat_1 <- data_1 %>%
pivot_longer(IMDb:Rotten_Tomatoes, names_to = "score", values_to = "value")
# gather(score, value, IMDb:Rotten_Tomatoes, factor_key=TRUE)
records_fig_1 <- dim(fig_dat_1)[1]
As can be seen in the summary table, the number of records has been reduced to 10312.
# str(fig_dat_1)
score_median <- fig_dat_1 %>%
group_by(score) %>%
summarise(median = median(value))
IMDb_median <- score_median[1,2]
Rotten_median <- score_median[2,2]
summary(fig_dat_1)
## score value
## Length:10312 Min. : 2.00
## Class :character 1st Qu.: 55.00
## Mode :character Median : 66.00
## Mean : 64.59
## 3rd Qu.: 77.00
## Max. :100.00
The median IMDb score is a 65 and the median Rotten Tomatoes score is a 71.
Is the difference in medians significant? Let’s plot the distribution of both scores.
The histogram shows that the IMDb scores seem to obey a Gaussian distribution, while the Rotten Tomatoes score is more flattened.
The box plot shows more clearly the difference between their medians and how the distribution of IMDb scores is narrower.
As a preview, the p-value of the Wilcoxon test is printed on the graph.
### from library(ggpubr)
fig1_1 <- fig_dat_1 %>% gghistogram(x = "value",
add = "median", rug = TRUE,
color = "score", fill = "score",
palette = c("#00AFBB", "#E7B800"),
title = "Histogram of IMDb and Rotten Tomatoes scores",
legend.title="")+
labs(x="score")
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
fig1_2 <- fig_dat_1 %>% ggboxplot(x = "score", y = "value",
title = "Distribution of IMDb and Rotten Tomatoes scores",
color = "score",
palette = c("#00AFBB", "#E7B800"),
legend.title="") +
labs(x="Type of score",
y="score")+
# Add pairwise comparisons p-value
stat_compare_means(comparisons = c("IMDb",
"Rotten_Tomatoes")) +
# Add global p-value
stat_compare_means(label.y = 100)
ggarrange(fig1_1, fig1_2)
## Warning: Computation failed in `stat_signif()`:
## not enough 'y' observations
Applying the F-test we can check if the variances are equal. From the output of the test we can state that they are not equal.
var.test(data_1$IMDb, data_1$Rotten_Tomatoes)
##
## F test to compare two variances
##
## data: data_1$IMDb and data_1$Rotten_Tomatoes
## F = 0.14186, num df = 5155, denom df = 5155, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.1343198 0.1498190
## sample estimates:
## ratio of variances
## 0.1418579
ttest_fig_1 <- t.test(data_1$IMDb, data_1$Rotten_Tomatoes)
ttest_fig_1
##
## Welch Two Sample t-test
##
## data: data_1$IMDb and data_1$Rotten_Tomatoes
## t = -4.2392, df = 6588.7, p-value = 2.274e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.4554389 -0.9025906
## sample estimates:
## mean of x mean of y
## 63.75233 65.43134
wilcox.test(data_1$IMDb, data_1$Rotten_Tomatoes)
##
## Wilcoxon rank sum test with continuity correction
##
## data: data_1$IMDb and data_1$Rotten_Tomatoes
## W = 11284972, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
Let’s apply the two-tailed t-test and Wicox test, assuming unpaired samples and different variances. Since the confidence interval (-2.4554389, -0.9025906) does not contain zero and the p-value (2.2738753^{-5}) is less than 0.05, we can state that the IMDb and Rotten Tomatoes scores are different and have means of: 63.75, 65.43, respectively.
The data seems to indicate that the average score on Rotten Tomatoes is slightly higher than on IMDb.
Let’s make a box plot to answer the question: Which platform has the highest rated series/movies?
We are going to select only the IMDb scores as the Rotten Tomatoes score is limited to one third of the records. We will select the variables IMDb, Netflix, Hulu, Prime_Video and Disney. We will filter out the rows containing NaN. We will transform the data frame to long format.
#fig_dat2
### We selected the variables "platforms" and the IMDb scores.
data_2 <- df %>%
select(IMDb, Netflix, Hulu, Prime_Video, Disney)
data_2 <- data_2 %>% drop_na()
### We convert to long format
fig_dat_2 <- data_2 %>%
pivot_longer(Netflix:Disney, names_to = "platforms", values_to = "value") %>%
# gather(platforms, value, Netflix:Disney) %>%
filter(value==1) %>%
select(platforms, IMDb)
### We convert the variable "platforms" into a categorical variable.
fig_dat_2$platforms <- as.factor(fig_dat_2$platforms)
### Platform sorted by median IMDb rating
platforms_median <-fig_dat_2 %>%
group_by(platforms) %>%
summarise(median = median(IMDb)) %>%
arrange(desc(median))
# We reorder the categories (platforms) by median distribution.
fig_dat_2$platforms <- fct_reorder(fig_dat_2$platforms, fig_dat_2$IMDb, median)
knitr::kable(platforms_median, caption = "Platform sorted by median IMDb rating")
| platforms | median |
|---|---|
| Disney | 65 |
| Netflix | 63 |
| Hulu | 62 |
| Prime_Video | 59 |
The platform with the highest median is Disney with 65 and the platform with the lowest median is Amazon Video with 59. As can be seen in the table above.
The number of films is not homogeneous across the four platforms. Amazon Video has the most movies with more than twice as many as the one with the least (Disney), as can be seen in the following summary table.
# str(fig_dat_2)
summary(fig_dat_2)
## platforms IMDb
## Prime_Video:11908 Min. : 0.00
## Hulu : 892 1st Qu.:51.00
## Netflix : 3442 Median :61.00
## Disney : 563 Mean :59.12
## 3rd Qu.:69.00
## Max. :93.00
We will plot the distribution of IMDb scores across the four platforms. We have chosen a violin representation in which a box plot has been included.
A preliminary Kruskal-Wallis analysis by platform pairs has been performed and included in the graph. As is known, the Kruskal-Wallis test is the nonparametric alternative to the one-way ANOVA test for unpaired data.
From the p-values, we can infer that there are significant differences between all platform pairs.
Therefore we can state that Disney has the highest median scores, followed by Netflix and Hulu. Amazon Video has the lowest median.
### Violin plot of the distribution of scores by platform.
### We establish the comparison pairs.
my_comparisons <- list( c("Disney", "Hulu"), c("Disney", "Netflix"),
c("Disney", "Prime_Video"), c("Hulu", "Netflix"),
c("Hulu", "Prime_Video"), c("Netflix", "Prime_Video") )
fig_dat_2 %>% ggviolin(x = "platforms", y = "IMDb", fill="platforms",
title = "Distribution of IMDb scores by platform",
add = "boxplot", add.params = list(fill = "white"),
legend.title="",
palette = "jco") +
# Add pairwise comparisons p-value
stat_compare_means(comparisons = my_comparisons, abel = "p.signif") +
# Add global p-value
stat_compare_means(label.y = 100)+
labs(y="IMDb score")
To show the differences between the distribution of scores per platform, we have resorted to a density representation of the platform. Vertical lines have also been included to mark the median value of each distribution. In this way the differences between them can be better observed.
fig_dat_2 %>% ggdensity(x = "IMDb",
color = "platforms", fill = "platforms",
add = "median", rug = TRUE,
title = "Density of scores by streaming platform",
legend.title="",
palette = "jco") +
labs(x="IMDb score")
We are going to perform a scatter graph to answer the question: Does the length of a movie influence its score?
We are going to select only the IMDb and Runtime variables. We will delete the rows containing NaN.
data_3 <- df %>%
select(IMDb, Runtime) %>% drop_na()
records_data_3 <- dim(data_3)[1]
min_runtime <- min(data_3$Runtime)
max_runtime <- max(data_3$Runtime)
str(data_3)
## tibble [15,819 × 2] (S3: tbl_df/tbl/data.frame)
## $ IMDb : num [1:15819] 88 87 85 85 88 84 85 84 84 83 ...
## $ Runtime: num [1:15819] 148 136 149 116 161 117 150 165 115 153 ...
The records are reduced to 15819. Runtime values range from 1 to 1256. Both values seem unusual for a film, so let’s plot a histogram of the films duration.
summary(data_3)
## IMDb Runtime
## Min. : 0.00 Min. : 1.00
## 1st Qu.:51.00 1st Qu.: 83.00
## Median :61.00 Median : 92.00
## Mean :59.04 Mean : 93.77
## 3rd Qu.:69.00 3rd Qu.: 104.00
## Max. :93.00 Max. :1256.00
fig3_1 <- data_3 %>% gghistogram(x = "Runtime",
add = "median", rug = TRUE,
color = "#00AFBB", fill = "#00AFBB",
#palette = c("#00AFBB", "#E7B800"),
title = "Histogram of runtime movies",
legend.title="")+
labs(x= "runtime (min)")
fig3_1
The distribution has a very long tail on the right. These extreme values may condition the result of the least squares fit.
As can be seen in the following interactive figure the fit line is conditioned by these extreme points. From the distribution of the points we see that there seem to be two groups of points. Since most of the movies are shorter than 300 minutes, we will filter the data, leaving only movies with a duration of less than 300 minutes.
p <- data_3 %>%
ggplot(aes(x=Runtime, y=IMDb))+
geom_point() +
geom_smooth(method=lm)+
labs(y="IMDb score",
x= "runtime (min)",
title="\nIMDb score vs. runtime of movies")
fig_3_0 <- ggplotly(p)
## `geom_smooth()` using formula 'y ~ x'
fig_3_0
data_3_f <- data_3 %>%
filter(Runtime < 300)
records_data_3_f <- dim(data_3_f)[1]
data_3_lost <- round((1 - records_data_3_f / records_data_3) * 100, 2)
min_runtime_f <- min(data_3_f $Runtime)
max_runtime_f <- max(data_3_f $Runtime)
summary(data_3_f)
## IMDb Runtime
## Min. : 0.00 Min. : 1.00
## 1st Qu.:51.00 1st Qu.: 83.00
## Median :61.00 Median : 92.00
## Mean :59.04 Mean : 93.57
## 3rd Qu.:69.00 3rd Qu.:104.00
## Max. :93.00 Max. :270.00
0.04 % of the data have been eliminated. The records are reduced to 15812. Runtime values range from 1 to 270.
As we can see in the following graph, the distribution still has a bit of a tail on the right, but overall it is much more centered.
fig3_1_b <- data_3_f %>% gghistogram(x = "Runtime",
add = "median", rug = TRUE,
color = "#00AFBB", fill = "#00AFBB",
#palette = c("#00AFBB", "#E7B800"),
title = "Histogram of runtime movies < 300 min",
legend.title="")+
labs(x= "runtime (min)")
fig3_1_b
p_f <- data_3_f %>%
ggplot(aes(x=Runtime, y=IMDb))+
geom_point() +
geom_smooth(method=lm)+
labs(y="IMDb score",
x= "runtime (min)",
title="\nIMDb score vs. runtime of movies <300 min.")
fig_3_0_f <- ggplotly(p_f)
## `geom_smooth()` using formula 'y ~ x'
fig_3_0_f
runtime_lm <- lm(IMDb ~ Runtime, data = data_3_f)
runtime_lm_summary <- summary(runtime_lm)
fig_3_intercept <- runtime_lm$coefficients[1]
fig_3_slop <- runtime_lm$coefficients[2]
fig_3_r.squered <- runtime_lm_summary$r.squared
# runtime_lm_summary$coefficients[2,4]
runtime_lm_summary
##
## Call:
## lm(formula = IMDb ~ Runtime, data = data_3_f)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.282 -8.109 1.778 9.778 35.718
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.134940 0.406439 133.19 <2e-16 ***
## Runtime 0.052449 0.004192 12.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.4 on 15810 degrees of freedom
## Multiple R-squared: 0.009807, Adjusted R-squared: 0.009744
## F-statistic: 156.6 on 1 and 15810 DF, p-value: < 2.2e-16
# runtime_lm_summary$coefficients
The linear adjustment complies with the following equation:
\[\ Score IMDb = 54.1349397 + 0.0524491 * Runtime \] \(\ R^2\) takes the value of 0.0098065, and therefore justifies a 0.98 % variability.
To answer the question Of the last 15 years, which one has the best score on IMDb?, we will use a graph of
In the following interactive graph we can find that the year with the best average is 2020 and the worst year is 2007.
data_4 <- df %>%
select(Year, IMDb) %>% drop_na() %>% filter(Year>2005)
fig_4_data <- data_4 %>%
group_by(Year) %>%
summarise(mean = mean(IMDb))
pf_4 <- fig_4_data %>% ggplot(aes(x=Year, y=mean)) +
geom_point() +
geom_line(color="blue") +
ylim(0, max(fig_4_data$mean))+
labs(y="Average IMDb rating",
title="\nAverage IMDb score vs. years since 2006")
fig_4 <- ggplotly(pf_4)
fig_4
We select the IMDb and Genres variables. A movie can belong to several genres. In the variable “Genres” the values are separated by commas. We separate the values by creating new rows with the score and each “genre”.
data_5_0 <- df %>%
select(IMDb, Genres)
data_5_0
## # A tibble: 16,744 x 2
## IMDb Genres
## <dbl> <chr>
## 1 88 Action,Adventure,Sci-Fi,Thriller
## 2 87 Action,Sci-Fi
## 3 85 Action,Adventure,Sci-Fi
## 4 85 Adventure,Comedy,Sci-Fi
## 5 88 Western
## 6 84 Animation,Action,Adventure,Family,Sci-Fi
## 7 85 Biography,Drama,Music,War
## 8 84 Drama,Western
## 9 84 Action,Adventure
## 10 83 Adventure,Drama,War
## # … with 16,734 more rows
data_5 <- df %>%
select(IMDb, Genres) %>%
separate_rows(Genres, sep=",", convert=TRUE) %>% drop_na()
#unique(data_5$Genres)
data_5$Genres <- as.factor(data_5$Genres)
str(data_5)
## tibble [38,376 × 2] (S3: tbl_df/tbl/data.frame)
## $ IMDb : num [1:38376] 88 88 88 88 87 87 85 85 85 85 ...
## $ Genres: Factor w/ 27 levels "Action","Adventure",..: 1 2 21 25 1 21 1 2 21 2 ...
From the following graph showing the histograms of IMDb scores by genre, ordered from lowest to highest, we can infer that the genre with the worst scores is horror and the genre with the best scores is news.
We can also observe that most of the films belong to the drama genre, followed by comedy.
p <- data_5 %>%
mutate(Genres = fct_reorder(Genres, IMDb)) %>%
ggplot( aes(x=IMDb, color=Genres, fill=Genres)) +
geom_histogram(alpha=0.6, binwidth = 5) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
theme_ipsum() +
theme(
legend.position="none",
panel.spacing = unit(0.5, "lines"),
strip.text.x = element_text(size = 8)
) +
xlab("IMDb score") +
ylab("Number of films") +
labs(title = "Histograms of movie numbers vs. IMDb score by genre") +
theme(axis.text.x = element_text(size=6),
axis.text.y = element_text(size=6),
axis.title.y = element_text(hjust=0.5),
axis.title.x = element_text(hjust=0.5)) +
facet_wrap(~Genres)
mean_data_5 <- data_5 %>%
group_by(Genres) %>%
summarise(mean_IMDb= round(mean(IMDb), 2))
knitr::kable(arrange(mean_data_5, mean_IMDb))
| Genres | mean_IMDb |
|---|---|
| Horror | 47.24 |
| Sci-Fi | 49.56 |
| Thriller | 53.52 |
| Action | 54.50 |
| Reality-TV | 56.00 |
| Adventure | 56.15 |
| Mystery | 56.46 |
| Fantasy | 56.85 |
| Talk-Show | 58.47 |
| Crime | 58.71 |
| Western | 59.01 |
| Comedy | 59.06 |
| Family | 60.23 |
| Romance | 60.45 |
| Drama | 60.75 |
| Game-Show | 61.33 |
| Musical | 62.37 |
| War | 62.50 |
| Animation | 62.52 |
| Film-Noir | 63.33 |
| Sport | 64.18 |
| Short | 64.41 |
| Music | 66.00 |
| History | 67.32 |
| Biography | 68.09 |
| Documentary | 69.43 |
| News | 71.10 |
p