1 SUMMARY

The subject of the project is the rating of movies and series from Netflix, Prime Video, Disney+ and Hulu platforms on IMDB and Rotten Tomatoes.

This project will seek answers to the following questions:

2 Importing data

2.1 Data Sources

Data used are: “Movies on Netflix, Prime Video, Hulu and Disney+. A collection of movies found on these streaming platforms.” This data has been compiled by Ruchi Bhatia https://www.kaggle.com/ruchi798 and is hosted on Kaggle.

The link to access the data is: https://www.kaggle.com/ruchi798/movies-on-netflix-prime-video-hulu-and-disney

The license is “CC0: Public Domain” and has the tags “arts and entertainment, online communities, movies and tv shows”

The dataset is an amalgamation of:

  • data that was scraped, which comprised a comprehensive list of movies available on various streaming platforms
  • IMDb dataset

Data visualization of the above can be found on: https://public.tableau.com/profile/ruchi.bhatia#!/vizhome/Moviesavailableonstreamingplatforms/Moviesavailableonstreamingapplications

2.2 Description of the Data

The data is in csv format, and is composed of 16700 rows and 17 columns.

Each row is a unique record of a movie or series and each column is a different variable. In summary:

  • [, 1] -> Index of a row
  • [, 2] -> ID: Unique movie ID
  • [, 3] -> Title
  • [, 4] -> Year: The year in which the movie was produced
  • [, 5] -> Age: Target age group
  • [, 6] -> IMDB: IMDb rating
  • [, 7] -> Rotten Tomatoes: Rotten Tomatoes %
  • [, 8] -> Netflix: Whether the movie is found on Netflix
  • [, 9] -> Hulu: Whether the movie is found on Hulu
  • [, 10] -> Prime Video: Whether the movie is found on Prime Video
  • [, 11] -> Disney+: Whether the movie is found on Disney+
  • [, 12] -> Type: Movie or TV series
  • [, 13] -> Directors
  • [, 14] -> Genres
  • [, 15] -> Country
  • [, 16] -> Language
  • [, 17] -> Runtime

Fifty percent of the items are realized between 2012 and 2020 and the rest between 1902 and 2020.

The distribution of registrations by platform is:

  • Netflix -> 21 %
  • Hulu -> 5 %
  • Prime Video -> 74 %
  • Disney+ -> 3%

2.3 Import in R

We import the data with the “read_csv” function. We select the columns useful for the analysis. Rename the variables.

We clean the variable Rotten_Tomatoes, eliminating the character “%”. In the variable “Age” we delete the character “+”. We multiply the IMDb score by 10 so that it has the same scale as Rotten Tomatoes.

#### Use read_csv() or another function
df <- read_csv("./Data/MoviesOnStreamingPlatforms_updated.csv")
### We delete the columns that are not going to be used
df <- select(df, -c(X1, Type))

### Rename the columns to eliminate spaces
df <- df %>% rename(Rotten_Tomatoes = "Rotten Tomatoes",
                    Prime_Video = "Prime Video",
                    Disney = "Disney+")

### We eliminate the characters "%" and "+" from the Rotten and Age variables. 
df$Rotten_Tomatoes <- as.numeric(gsub("%", "", df$Rotten_Tomatoes))
df$Age <- gsub("\\+", "", df$Age)

### We multiply the IMDb rating by 10 to compare it with the Rotten Tomatoes scale.
df$IMDb <- df$IMDb*10


str(df)
## tibble [16,744 × 15] (S3: tbl_df/tbl/data.frame)
##  $ ID             : num [1:16744] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Title          : chr [1:16744] "Inception" "The Matrix" "Avengers: Infinity War" "Back to the Future" ...
##  $ Year           : num [1:16744] 2010 1999 2018 1985 1966 ...
##  $ Age            : chr [1:16744] "13" "18" "13" "7" ...
##  $ IMDb           : num [1:16744] 88 87 85 85 88 84 85 84 84 83 ...
##  $ Rotten_Tomatoes: num [1:16744] 87 87 84 96 97 97 95 87 95 89 ...
##  $ Netflix        : num [1:16744] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Hulu           : num [1:16744] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Prime_Video    : num [1:16744] 0 0 0 0 1 0 1 0 0 0 ...
##  $ Disney         : num [1:16744] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Directors      : chr [1:16744] "Christopher Nolan" "Lana Wachowski,Lilly Wachowski" "Anthony Russo,Joe Russo" "Robert Zemeckis" ...
##  $ Genres         : chr [1:16744] "Action,Adventure,Sci-Fi,Thriller" "Action,Sci-Fi" "Action,Adventure,Sci-Fi" "Adventure,Comedy,Sci-Fi" ...
##  $ Country        : chr [1:16744] "United States,United Kingdom" "United States" "United States" "United States" ...
##  $ Language       : chr [1:16744] "English,Japanese,French" "English" "English" "English" ...
##  $ Runtime        : num [1:16744] 148 136 149 116 161 117 150 165 115 153 ...
head(df)
## # A tibble: 6 x 15
##      ID Title  Year Age    IMDb Rotten_Tomatoes Netflix  Hulu Prime_Video Disney
##   <dbl> <chr> <dbl> <chr> <dbl>           <dbl>   <dbl> <dbl>       <dbl>  <dbl>
## 1     1 Ince…  2010 13       88              87       1     0           0      0
## 2     2 The …  1999 18       87              87       1     0           0      0
## 3     3 Aven…  2018 13       85              84       1     0           0      0
## 4     4 Back…  1985 7        85              96       1     0           0      0
## 5     5 The …  1966 18       88              97       1     0           1      0
## 6     6 Spid…  2018 7        84              97       1     0           0      0
## # … with 5 more variables: Directors <chr>, Genres <chr>, Country <chr>,
## #   Language <chr>, Runtime <dbl>
summary(df)
##        ID           Title                Year          Age           
##  Min.   :    1   Length:16744       Min.   :1902   Length:16744      
##  1st Qu.: 4187   Class :character   1st Qu.:2000   Class :character  
##  Median : 8372   Mode  :character   Median :2012   Mode  :character  
##  Mean   : 8372                      Mean   :2003                     
##  3rd Qu.:12558                      3rd Qu.:2016                     
##  Max.   :16744                      Max.   :2020                     
##                                                                      
##       IMDb       Rotten_Tomatoes     Netflix            Hulu        
##  Min.   : 0.00   Min.   :  2.00   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:51.00   1st Qu.: 44.00   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :61.00   Median : 71.00   Median :0.0000   Median :0.00000  
##  Mean   :59.03   Mean   : 65.43   Mean   :0.2126   Mean   :0.05393  
##  3rd Qu.:69.00   3rd Qu.: 88.00   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :93.00   Max.   :100.00   Max.   :1.0000   Max.   :1.00000  
##  NA's   :571     NA's   :11586                                      
##   Prime_Video         Disney         Directors            Genres         
##  Min.   :0.0000   Min.   :0.00000   Length:16744       Length:16744      
##  1st Qu.:0.0000   1st Qu.:0.00000   Class :character   Class :character  
##  Median :1.0000   Median :0.00000   Mode  :character   Mode  :character  
##  Mean   :0.7378   Mean   :0.03368                                        
##  3rd Qu.:1.0000   3rd Qu.:0.00000                                        
##  Max.   :1.0000   Max.   :1.00000                                        
##                                                                          
##    Country            Language            Runtime       
##  Length:16744       Length:16744       Min.   :   1.00  
##  Class :character   Class :character   1st Qu.:  82.00  
##  Mode  :character   Mode  :character   Median :  92.00  
##                                        Mean   :  93.41  
##                                        3rd Qu.: 104.00  
##                                        Max.   :1256.00  
##                                        NA's   :592

3 Analysis

3.1 Are there significant differences between IMDB and Rotten Tomatoes scores?

Let’s make a histogram and a box plot to answer the question: Are there significant differences between IMDB and Rotten Tomatoes scores?

We need to select the variables of IMDb and Rotten Tomatoes scores. We will filter out the rows containing NaN. We will transform the data frame to long format.

# We selected the variables "platforms" and the IMDb and Rotten Tomatoes scores.
data_1 <- df %>%
    select(IMDb, Rotten_Tomatoes)
# We eliminate the rows with any Nan value.
data_1 <- data_1 %>% drop_na()
# We transform the wide format
fig_dat_1 <- data_1 %>%
    pivot_longer(IMDb:Rotten_Tomatoes, names_to = "score", values_to = "value")
    # gather(score, value, IMDb:Rotten_Tomatoes, factor_key=TRUE)
records_fig_1 <- dim(fig_dat_1)[1]

As can be seen in the summary table, the number of records has been reduced to 10312.

# str(fig_dat_1)
score_median <- fig_dat_1 %>%
    group_by(score) %>%
    summarise(median = median(value))
IMDb_median <- score_median[1,2]
Rotten_median <- score_median[2,2]
summary(fig_dat_1)
##     score               value       
##  Length:10312       Min.   :  2.00  
##  Class :character   1st Qu.: 55.00  
##  Mode  :character   Median : 66.00  
##                     Mean   : 64.59  
##                     3rd Qu.: 77.00  
##                     Max.   :100.00

The median IMDb score is a 65 and the median Rotten Tomatoes score is a 71.

Is the difference in medians significant? Let’s plot the distribution of both scores.

The histogram shows that the IMDb scores seem to obey a Gaussian distribution, while the Rotten Tomatoes score is more flattened.

The box plot shows more clearly the difference between their medians and how the distribution of IMDb scores is narrower.

As a preview, the p-value of the Wilcoxon test is printed on the graph.

### from library(ggpubr)


fig1_1 <- fig_dat_1 %>% gghistogram(x = "value",
                                   add = "median", rug = TRUE,
                                   color = "score", fill = "score",
                                   palette = c("#00AFBB", "#E7B800"),
                                   title = "Histogram of IMDb and Rotten Tomatoes scores",
                                   legend.title="")+ 
                                   labs(x="score")
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
fig1_2 <- fig_dat_1 %>% ggboxplot(x = "score", y = "value",
                                  title = "Distribution of IMDb and Rotten Tomatoes scores",
                                  color = "score",
                                  palette = c("#00AFBB", "#E7B800"),
                                  legend.title="") +
                                  labs(x="Type of score",
                                       y="score")+
                        # Add pairwise comparisons p-value
                        stat_compare_means(comparisons = c("IMDb", 
                                                           "Rotten_Tomatoes")) +
                        # Add global p-value
                        stat_compare_means(label.y = 100)                   
ggarrange(fig1_1, fig1_2)
## Warning: Computation failed in `stat_signif()`:
## not enough 'y' observations

Applying the F-test we can check if the variances are equal. From the output of the test we can state that they are not equal.

var.test(data_1$IMDb, data_1$Rotten_Tomatoes)
## 
##  F test to compare two variances
## 
## data:  data_1$IMDb and data_1$Rotten_Tomatoes
## F = 0.14186, num df = 5155, denom df = 5155, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.1343198 0.1498190
## sample estimates:
## ratio of variances 
##          0.1418579
ttest_fig_1 <- t.test(data_1$IMDb, data_1$Rotten_Tomatoes)
ttest_fig_1
## 
##  Welch Two Sample t-test
## 
## data:  data_1$IMDb and data_1$Rotten_Tomatoes
## t = -4.2392, df = 6588.7, p-value = 2.274e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.4554389 -0.9025906
## sample estimates:
## mean of x mean of y 
##  63.75233  65.43134
wilcox.test(data_1$IMDb, data_1$Rotten_Tomatoes)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  data_1$IMDb and data_1$Rotten_Tomatoes
## W = 11284972, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Let’s apply the two-tailed t-test and Wicox test, assuming unpaired samples and different variances. Since the confidence interval (-2.4554389, -0.9025906) does not contain zero and the p-value (2.2738753^{-5}) is less than 0.05, we can state that the IMDb and Rotten Tomatoes scores are different and have means of: 63.75, 65.43, respectively.

The data seems to indicate that the average score on Rotten Tomatoes is slightly higher than on IMDb.

3.2 Which platform has the highest rated series/movies?

Let’s make a box plot to answer the question: Which platform has the highest rated series/movies?

We are going to select only the IMDb scores as the Rotten Tomatoes score is limited to one third of the records. We will select the variables IMDb, Netflix, Hulu, Prime_Video and Disney. We will filter out the rows containing NaN. We will transform the data frame to long format.

#fig_dat2
### We selected the variables "platforms" and the IMDb scores.
data_2 <- df %>%
    select(IMDb, Netflix, Hulu, Prime_Video, Disney)

data_2 <- data_2 %>% drop_na()

### We convert to long format
fig_dat_2 <- data_2  %>%
  pivot_longer(Netflix:Disney, names_to = "platforms", values_to = "value") %>%
                # gather(platforms, value, Netflix:Disney) %>% 
                filter(value==1) %>%
                select(platforms, IMDb)
### We convert the variable "platforms" into a categorical variable.
fig_dat_2$platforms <- as.factor(fig_dat_2$platforms)
### Platform sorted by median IMDb rating
platforms_median <-fig_dat_2 %>%
                                group_by(platforms) %>%
                                summarise(median = median(IMDb)) %>%
                                arrange(desc(median))

# We reorder the categories (platforms) by median distribution. 
fig_dat_2$platforms <- fct_reorder(fig_dat_2$platforms, fig_dat_2$IMDb, median)

knitr::kable(platforms_median,  caption = "Platform sorted by median IMDb rating")
Platform sorted by median IMDb rating
platforms median
Disney 65
Netflix 63
Hulu 62
Prime_Video 59

The platform with the highest median is Disney with 65 and the platform with the lowest median is Amazon Video with 59. As can be seen in the table above.

The number of films is not homogeneous across the four platforms. Amazon Video has the most movies with more than twice as many as the one with the least (Disney), as can be seen in the following summary table.

# str(fig_dat_2)
summary(fig_dat_2)
##        platforms          IMDb      
##  Prime_Video:11908   Min.   : 0.00  
##  Hulu       :  892   1st Qu.:51.00  
##  Netflix    : 3442   Median :61.00  
##  Disney     :  563   Mean   :59.12  
##                      3rd Qu.:69.00  
##                      Max.   :93.00

We will plot the distribution of IMDb scores across the four platforms. We have chosen a violin representation in which a box plot has been included.

A preliminary Kruskal-Wallis analysis by platform pairs has been performed and included in the graph. As is known, the Kruskal-Wallis test is the nonparametric alternative to the one-way ANOVA test for unpaired data.

From the p-values, we can infer that there are significant differences between all platform pairs.

Therefore we can state that Disney has the highest median scores, followed by Netflix and Hulu. Amazon Video has the lowest median.

### Violin plot of the distribution of scores by platform.
### We establish the comparison pairs.
my_comparisons <- list( c("Disney", "Hulu"), c("Disney", "Netflix"), 
                        c("Disney", "Prime_Video"),  c("Hulu", "Netflix"),
                        c("Hulu", "Prime_Video"),  c("Netflix", "Prime_Video") )
fig_dat_2 %>% ggviolin(x = "platforms", y = "IMDb", fill="platforms",
                  title = "Distribution of IMDb scores by platform",
                  add = "boxplot", add.params = list(fill = "white"),
                  legend.title="",
                  palette = "jco") +
                  # Add pairwise comparisons p-value
                  stat_compare_means(comparisons = my_comparisons, abel = "p.signif") +
                  # Add global p-value
                 stat_compare_means(label.y = 100)+ 
                 labs(y="IMDb score")

To show the differences between the distribution of scores per platform, we have resorted to a density representation of the platform. Vertical lines have also been included to mark the median value of each distribution. In this way the differences between them can be better observed.

fig_dat_2 %>% ggdensity(x = "IMDb",
                        color = "platforms", fill = "platforms",
                       add = "median", rug = TRUE,
                       title = "Density of scores by streaming platform",
                       legend.title="",
                       palette = "jco") + 
                       labs(x="IMDb score")

3.3 Does the length of a movie influence its score?

We are going to perform a scatter graph to answer the question: Does the length of a movie influence its score?

We are going to select only the IMDb and Runtime variables. We will delete the rows containing NaN.

data_3 <- df %>%
    select(IMDb, Runtime) %>% drop_na()
records_data_3 <- dim(data_3)[1]
min_runtime <- min(data_3$Runtime)
max_runtime <- max(data_3$Runtime)
str(data_3)
## tibble [15,819 × 2] (S3: tbl_df/tbl/data.frame)
##  $ IMDb   : num [1:15819] 88 87 85 85 88 84 85 84 84 83 ...
##  $ Runtime: num [1:15819] 148 136 149 116 161 117 150 165 115 153 ...

The records are reduced to 15819. Runtime values range from 1 to 1256. Both values seem unusual for a film, so let’s plot a histogram of the films duration.

summary(data_3)
##       IMDb          Runtime       
##  Min.   : 0.00   Min.   :   1.00  
##  1st Qu.:51.00   1st Qu.:  83.00  
##  Median :61.00   Median :  92.00  
##  Mean   :59.04   Mean   :  93.77  
##  3rd Qu.:69.00   3rd Qu.: 104.00  
##  Max.   :93.00   Max.   :1256.00
fig3_1 <- data_3 %>% gghistogram(x = "Runtime",
                                   add = "median", rug = TRUE,
                                   color = "#00AFBB", fill = "#00AFBB",
                                   #palette = c("#00AFBB", "#E7B800"),
                                   title = "Histogram of runtime movies",
                                   legend.title="")+ 
                                   labs(x= "runtime (min)")
fig3_1 

The distribution has a very long tail on the right. These extreme values may condition the result of the least squares fit.

As can be seen in the following interactive figure the fit line is conditioned by these extreme points. From the distribution of the points we see that there seem to be two groups of points. Since most of the movies are shorter than 300 minutes, we will filter the data, leaving only movies with a duration of less than 300 minutes.

p <- data_3 %>% 
      ggplot(aes(x=Runtime, y=IMDb))+
      geom_point() +
      geom_smooth(method=lm)+ 
      labs(y="IMDb score",
           x= "runtime (min)",
           title="\nIMDb score vs. runtime of movies")
fig_3_0 <- ggplotly(p)
## `geom_smooth()` using formula 'y ~ x'
fig_3_0
data_3_f <- data_3 %>%
    filter(Runtime < 300)
records_data_3_f <- dim(data_3_f)[1]
data_3_lost <- round((1 - records_data_3_f / records_data_3) * 100, 2)
min_runtime_f <- min(data_3_f $Runtime)
max_runtime_f <- max(data_3_f $Runtime)
summary(data_3_f)
##       IMDb          Runtime      
##  Min.   : 0.00   Min.   :  1.00  
##  1st Qu.:51.00   1st Qu.: 83.00  
##  Median :61.00   Median : 92.00  
##  Mean   :59.04   Mean   : 93.57  
##  3rd Qu.:69.00   3rd Qu.:104.00  
##  Max.   :93.00   Max.   :270.00

0.04 % of the data have been eliminated. The records are reduced to 15812. Runtime values range from 1 to 270.

As we can see in the following graph, the distribution still has a bit of a tail on the right, but overall it is much more centered.

fig3_1_b <- data_3_f %>% gghistogram(x = "Runtime",
                                   add = "median", rug = TRUE,
                                   color = "#00AFBB", fill = "#00AFBB",
                                   #palette = c("#00AFBB", "#E7B800"),
                                   title = "Histogram of runtime movies < 300 min",
                                   legend.title="")+ 
                                   labs(x= "runtime (min)")
fig3_1_b

p_f <- data_3_f %>% 
      ggplot(aes(x=Runtime, y=IMDb))+
      geom_point() +
      geom_smooth(method=lm)+ 
      labs(y="IMDb score",
           x= "runtime (min)",
           title="\nIMDb score vs. runtime of movies <300 min.")
fig_3_0_f <- ggplotly(p_f)
## `geom_smooth()` using formula 'y ~ x'
fig_3_0_f
runtime_lm <- lm(IMDb ~ Runtime, data = data_3_f)

runtime_lm_summary <- summary(runtime_lm)
fig_3_intercept <- runtime_lm$coefficients[1]
fig_3_slop <- runtime_lm$coefficients[2]

fig_3_r.squered <- runtime_lm_summary$r.squared
# runtime_lm_summary$coefficients[2,4]
runtime_lm_summary
## 
## Call:
## lm(formula = IMDb ~ Runtime, data = data_3_f)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.282  -8.109   1.778   9.778  35.718 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 54.134940   0.406439  133.19   <2e-16 ***
## Runtime      0.052449   0.004192   12.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.4 on 15810 degrees of freedom
## Multiple R-squared:  0.009807,   Adjusted R-squared:  0.009744 
## F-statistic: 156.6 on 1 and 15810 DF,  p-value: < 2.2e-16
# runtime_lm_summary$coefficients

The linear adjustment complies with the following equation:

\[\ Score IMDb = 54.1349397 + 0.0524491 * Runtime \] \(\ R^2\) takes the value of 0.0098065, and therefore justifies a 0.98 % variability.

3.4 Of the last 15 years, which one has the best score on IMDb?

To answer the question Of the last 15 years, which one has the best score on IMDb?, we will use a graph of

In the following interactive graph we can find that the year with the best average is 2020 and the worst year is 2007.

data_4 <- df %>%
    select(Year, IMDb) %>% drop_na() %>% filter(Year>2005)
fig_4_data <- data_4 %>% 
                        group_by(Year) %>%
                        summarise(mean = mean(IMDb))
pf_4 <- fig_4_data %>% ggplot(aes(x=Year, y=mean)) +
                        geom_point() +
                        geom_line(color="blue") +
                        ylim(0, max(fig_4_data$mean))+
                        labs(y="Average IMDb rating",
                             title="\nAverage IMDb score vs. years since 2006")

fig_4 <- ggplotly(pf_4)
fig_4

3.5 What is the genre with the best average score on IMDb? And the worst?

We select the IMDb and Genres variables. A movie can belong to several genres. In the variable “Genres” the values are separated by commas. We separate the values by creating new rows with the score and each “genre”.

data_5_0 <- df %>%
    select(IMDb, Genres)
data_5_0
## # A tibble: 16,744 x 2
##     IMDb Genres                                  
##    <dbl> <chr>                                   
##  1    88 Action,Adventure,Sci-Fi,Thriller        
##  2    87 Action,Sci-Fi                           
##  3    85 Action,Adventure,Sci-Fi                 
##  4    85 Adventure,Comedy,Sci-Fi                 
##  5    88 Western                                 
##  6    84 Animation,Action,Adventure,Family,Sci-Fi
##  7    85 Biography,Drama,Music,War               
##  8    84 Drama,Western                           
##  9    84 Action,Adventure                        
## 10    83 Adventure,Drama,War                     
## # … with 16,734 more rows
data_5 <- df %>%
    select(IMDb, Genres) %>%
    separate_rows(Genres, sep=",", convert=TRUE) %>% drop_na()
#unique(data_5$Genres)

data_5$Genres <- as.factor(data_5$Genres)
str(data_5)
## tibble [38,376 × 2] (S3: tbl_df/tbl/data.frame)
##  $ IMDb  : num [1:38376] 88 88 88 88 87 87 85 85 85 85 ...
##  $ Genres: Factor w/ 27 levels "Action","Adventure",..: 1 2 21 25 1 21 1 2 21 2 ...

From the following graph showing the histograms of IMDb scores by genre, ordered from lowest to highest, we can infer that the genre with the worst scores is horror and the genre with the best scores is news.

We can also observe that most of the films belong to the drama genre, followed by comedy.

p <- data_5 %>%
  mutate(Genres = fct_reorder(Genres, IMDb)) %>%
  ggplot( aes(x=IMDb, color=Genres, fill=Genres)) +
    geom_histogram(alpha=0.6, binwidth = 5) +
    scale_fill_viridis(discrete=TRUE) +
    scale_color_viridis(discrete=TRUE) +
    theme_ipsum() +
    theme(
      legend.position="none",
      panel.spacing = unit(0.5, "lines"),
      strip.text.x = element_text(size = 8)
    ) +
    xlab("IMDb score") +
    ylab("Number of films") +
    labs(title = "Histograms of movie numbers vs. IMDb score by genre") +
    theme(axis.text.x = element_text(size=6),
          axis.text.y = element_text(size=6),
          axis.title.y = element_text(hjust=0.5),
          axis.title.x = element_text(hjust=0.5)) +
    facet_wrap(~Genres)
mean_data_5 <- data_5 %>%
  group_by(Genres) %>%
  summarise(mean_IMDb= round(mean(IMDb), 2))


knitr::kable(arrange(mean_data_5, mean_IMDb))
Genres mean_IMDb
Horror 47.24
Sci-Fi 49.56
Thriller 53.52
Action 54.50
Reality-TV 56.00
Adventure 56.15
Mystery 56.46
Fantasy 56.85
Talk-Show 58.47
Crime 58.71
Western 59.01
Comedy 59.06
Family 60.23
Romance 60.45
Drama 60.75
Game-Show 61.33
Musical 62.37
War 62.50
Animation 62.52
Film-Noir 63.33
Sport 64.18
Short 64.41
Music 66.00
History 67.32
Biography 68.09
Documentary 69.43
News 71.10
p