IMDB rating

Homework 1

Sara Bračun Duhovnik

About the data:

I decided to analyse data on Top 1000 movies by IMDB ratings. I selected the dataset as a source for my analysis on Kaggle.com (source: https://www.kaggle.com/datasets/milanvaddoriya/imdb-movie-rating?select=imdb.csv). I filtered and modified the data beforehand as I am interested only in data from Comedy and Action genre.

data(package = .packages(all.available = TRUE))
library(readxl)
mydata <- read_excel("~/Desktop/IMDB_rating.xlsx")
colnames(mydata) <- c("ID", "Name", "Runtime", "Metascore", "GrossEarning", "Rating", "Genre")
head(mydata, 10)

## # A tibble: 10 × 7
##       ID Name              Runtime Metascore GrossEarning Rating Genre
##    <dbl> <chr>               <dbl> <chr>     <chr>        <chr>  <chr>
##  1     1 The Dark Knight       152 84.0      534.86       9.0    Acti…
##  2     2 The Lord of the …     201 94.0      377.85       9.0    Acti…
##  3     3 Inception             148 74.0      292.58       8.8    Acti…
##  4     4 The Lord of the …     179 87.0      342.55       8.8    Acti…
##  5     5 The Lord of the …     178 92.0      315.54       8.8    Acti…
##  6     6 The Matrix            136 73.0      171.48       8.7    Acti…
##  7     7 Star Wars: Episo…     124 82.0      290.48       8.7    Acti…
##  8     8 Life Is Beautiful     116 59.0      57.60        8.6    Come…
##  9     9 Terminator 2: Ju…     137 75.0      204.84       8.6    Acti…
## 10    10 Star Wars: Episo…     121 90.0      322.74       8.6    Acti…

Describtion:

Unit of observation: one movie
Sample size: 297 observations

Definitions of all variables:

Name: title of the movie
Runtime: length of a movie in minutes
Metascore: average rating of a movie based on reviews of world’s most respected critics on a scale from 0 to 100
Gross earning: gross earning from a movie in million $
Rating: average rating of a movie based on reviews of individual votes of IMDb registered users on the scale from 0 to 10
Genre:a type or category of a movie content or style

Data manipulation:

Before analyzing my data, I have decided to eliminate all observations that include missing values. Then I also decided to convert categorical genre variable to a factor one. From the selected data on Comedy and Action genre I decided to take a random sample of 80 units for greater accuracy and simplicity of the analysis.

library(tidyr)
mydata1 <- drop_na(mydata)

mydata1$Genre <- factor(mydata1$Genre,
        levels = c("Action", "Comedy"),
        labels = c("Action", "Comedy"))
head(mydata1)

## # A tibble: 6 × 7
##      ID Name               Runtime Metascore GrossEarning Rating Genre
##   <dbl> <chr>                <dbl> <chr>     <chr>        <chr>  <fct>
## 1     1 The Dark Knight        152 84.0      534.86       9.0    Acti…
## 2     2 The Lord of the R…     201 94.0      377.85       9.0    Acti…
## 3     3 Inception              148 74.0      292.58       8.8    Acti…
## 4     4 The Lord of the R…     179 87.0      342.55       8.8    Acti…
## 5     5 The Lord of the R…     178 92.0      315.54       8.8    Acti…
## 6     6 The Matrix             136 73.0      171.48       8.7    Acti…

set.seed(1)
mydata2 <- mydata1[sample(nrow(mydata1), 80), ]
library(psych)
describe(mydata2[ , c(-1, -2, -7)])

##               vars  n   mean    sd median trimmed   mad min max range
## Runtime          1 80 120.64 20.42  120.0  119.80 19.27  80 178    98
## Metascore*       2 80  18.69  8.64   19.0   18.75  7.41   1  36    35
## GrossEarning*    3 80  40.50 23.24   40.5   40.50 29.65   1  80    79
## Rating*          4 80   4.11  2.51    4.0    3.88  2.97   1  10     9
##                skew kurtosis   se
## Runtime        0.39    -0.08 2.28
## Metascore*    -0.09    -0.60 0.97
## GrossEarning*  0.00    -1.25 2.60
## Rating*        0.75    -0.50 0.28

Interpretation:

In my selected sample of 80 observations, average gross earning of a movie was $ 40.5 million. Additionally, 50% of observed movies earned less than $ 40.5 million and 50% earned more than that. Gross earnings have a high standard deviation of $ 23.24. million, indicating that different movies earned a very different amount of gross earnings. Gross earnings range between $ 1 million and $ 79 millions. The selected data on gross earnings is perfectly symmetrically distributed as skewness is 0. Negative kurtosis indicates a flatter or platykurtic distribution, resulting in fewer extreme positive or negative values. Average runtime of selected movies amounts to 120.64 minutes. Additionally, 50% of observed movies lasted less than 120 minutes and 50% lasted more than that. Runtime has a high standard deviation of 20.42 minutes, indicating that different movies lasted very different time. Runtime ranges between 80 and 178 minutes in my selected sample dataset. The selected data on runtime is slightly skewed to the right and has a slightly flatter distribution than normally. Metascore and rating are two different movie rating scales. Metascore in selected dataset ranges from 1 to 36 and has an average score of 18.69. Additionally, 50% of observed movies scored less than 19 points and 50% scored more than that. Metascore has a high standard deviation of 8.64 points, indicating that different movies obtained very different Metascore.The selected data on Metascore rating is slightly skewed to the left and has a slightly flatter distribution than normally. On the other hand, user’s rating ranges from 1 to 10 points, with an average score of 4.11 points. Additionally, 50% of observed movies scored less than 4 points and 50% scored more than that. Rating has a high standard deviation of 2.51 points, indicating that different movies obtained very different user ratings.The selected data on user rating is slightly skewed to the right and has a slightly flatter distribution than normally.

Research question: Is there a difference in Metascore ratings between Action and Comedy movies?

- H0: The average Metascore ratings for movies in comedy and action genre are the same.

- H1: The average Metascore ratings for movies in comedy and action genre are not the same.

My research includes formulation of hypothesis about the difference between two population arithmetic means. As I decided to compare Metascore rating in Comedy and Action movies where each movie is measured only once, I decided to use Independent samples t-test. I decided to analyze this research question with the use of parametric tests. They have greater statistical power than non-parametric tests but can be used only if certain assumptions are met.

Assumptions:

Variable is numeric: Metascore rating is a numeric variable
The distribution of the variable is normal in both populations (this will be checked with Shapiro-Wilk normality test)
Data needs to come from two independant populations: Comedy and Action genre are totally independent
Variable has the same variance in both populations (this is often violated so we will use Welch correction)

mydata2$Runtime <- as.numeric(mydata2$Runtime)
mydata2$GrossEarning <- as.numeric(mydata2$GrossEarning)
mydata2$Metascore <- as.numeric(mydata2$Metascore)
mydata2$Rating <- as.numeric(mydata2$Rating)

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(mydata2, aes(x = Metascore, fill = factor(Genre))) +
  geom_histogram(position = "dodge", bins = 10) +
  scale_fill_manual(values = c("cornflowerblue", "pink"), 
                    labels = c("Comedy", "Action")) +
  labs(title = "Distribution of Metascore for movies and TV shows by genre",
       x = "Metascore", y = "Frequency", fill = "Genre") +
  theme_classic()

Interpretation:

This histogram shows the frequency of Metascore ratings for Comedy and Action genre separately, resembling a normal distribution. Selected Comedy movies, colored in blue, on average display equal or higher ratings than Action movies, colored in pink. To determine normality, further testing is needed.

Test of normality:

I decided to use Shapiro-Wilk normality test. Before using the test, I filtered the data for both Comedy and Action genre and determined the hypothesis.

H0: The Metascore ratings for Comedy and Action genre are normally distributed.
H1: The Metascore ratings for Comedy and Action genre are not normally distributed.

Confidence interval: 95%

comedy_data <- mydata2[mydata2$Genre == "Comedy", ]
action_data <- mydata2[mydata2$Genre == "Action", ]

shapiro.test(comedy_data$Metascore)

## 
##  Shapiro-Wilk normality test
## 
## data:  comedy_data$Metascore
## W = 0.96462, p-value = 0.3474

shapiro.test(action_data$Metascore)

## 
##  Shapiro-Wilk normality test
## 
## data:  action_data$Metascore
## W = 0.97398, p-value = 0.3724

Interpretation:

From the results of Shapiro-Wilk normality test I can’t reject the H0, because p-values for both genres (0.3724 and 0.3474) are higher than 0.05. This supports my normality assumption and I can say that both Comedy and Action genre are most likely normally distributed. Because of that, I conducted also t-test.

T-test:

Recalling the hypothesis:

H0: The average Metascores for movies in comedy and action genre are the same.
H1: The average Metascores for movies in comedy and action genre are not the same.

Confidence interval: 95%

t.test(mydata2$Metascore ~ mydata2$Genre,
       paired = FALSE,
       var.equal = FALSE,
       alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  mydata2$Metascore by mydata2$Genre
## t = -0.11536, df = 72.812, p-value = 0.9085
## alternative hypothesis: true difference in means between group Action and group Comedy is not equal to 0
## 95 percent confidence interval:
##  -5.090588  4.533528
## sample estimates:
## mean in group Action mean in group Comedy 
##             72.08511             72.36364

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following object is masked from 'package:psych':
## 
##     phi

effectsize::cohens_d(mydata2$Metascore ~ mydata2$Genre,
                     pooled_sd = FALSE)

## Cohen's d |        95% CI
## -------------------------
## -0.03     | [-0.47, 0.42]
## 
## - Estimated using un-pooled SD.

interpret_cohens_d(0.03, rules = "sawilowsky2009")

## [1] "tiny"
## (Rules: sawilowsky2009)

Interpretation:

Based on the sample data, I can’t reject the H0, because p-value is too high (0.9085 and p-value > 0.05). The answer for this high p-value lies in the fact that averages are very close together (72.085 and 72.364). On top of that, the effect size is tiny and therefore I can say that I am not statistically interested in the difference in Metascore ratings of movies between Comedy and Action genre. Therefore, I was unable to demonstrate that the average Metascore rating differed between Comedy and Action genre.

Robustness check:

As a robustness check, I decided to perform a corresponding non-parametric test for independent samples which assumes that the assumption about normal distribution was not met. I used the Wilcoxon rank sum test.

H0: Location distribution of Metascore rating in movies in Comedy genre is the same as location distribution of Metascore rating in Action genre.
H1: Location distribution of Metascore rating in movies in Comedy genre is not the same as location distribution of Metascore rating in Action genre.

Confidence interval: 95%

wilcox.test(mydata2$Metascore ~ mydata2$Genre,
            paired = FALSE,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  mydata2$Metascore by mydata2$Genre
## W = 755, p-value = 0.841
## alternative hypothesis: true location shift is not equal to 0

effectsize(wilcox.test(mydata2$Metascore ~ mydata2$Genre,
                       paired = FALSE,
                       correct = FALSE,
                       exact = FALSE,
                       alternative = "two.sided"))

## r (rank biserial) |        95% CI
## ---------------------------------
## -0.03             | [-0.28, 0.23]

interpret_rank_biserial(0.03)

## [1] "tiny"
## (Rules: funder2019)

Interpretation:

With Wilcoxon rank sum test I got a p-value of 0.841, therefore I can’t reject the H0 (p-value > 0.05). Based on sample data I found out that Metascore rating does not differ in movies between Comedy and Action genre. The difference in location distribution of Metascore rating in movies in Comedy and Action genre is tiny.

Final conclusion:

In my analysis, I focused on finding out if the average Metascore ratings of movies in Comedy and Action genre in a selected sample are the same or not. I came to the conclusion that there is no difference between the average Metascore ratings in Comedy and Action genre in a selected sample as I was not able to reject the H0, both in parametric t-test and non-parametric Wilcoxon rank sum test. In both tests, p-values were higher then 0.05, therefore I was not able to reject the H0. Furthermore, a tiny effect size also contributes to the findings as the differences between Metascores ratings in a selected sample are extremely small. Therefore, genre doesn’t appear to have a statistically significant impact on the average Metascore ratings of a selected sample of Top 1000 movies by IMDb.

IMDB rating

2024-01-08

Homework 1

Sara Bračun Duhovnik

About the data:

Describtion:

Data manipulation:

Interpretation:

Research question: Is there a difference in Metascore ratings between Action and Comedy movies?

- H0: The average Metascore ratings for movies in comedy and action genre are the same.

- H1: The average Metascore ratings for movies in comedy and action genre are not the same.

Interpretation:

Test of normality:

Interpretation:

T-test:

Interpretation:

Robustness check:

Interpretation:

Final conclusion: