I decided to analyse data on Top 1000 movies by IMDB ratings. I selected the dataset as a source for my analysis on Kaggle.com (source: https://www.kaggle.com/datasets/milanvaddoriya/imdb-movie-rating?select=imdb.csv). I filtered and modified the data beforehand as I am interested only in data from Comedy and Action genre.
data(package = .packages(all.available = TRUE))
library(readxl)
mydata <- read_excel("~/Desktop/IMDB_rating.xlsx")
colnames(mydata) <- c("ID", "Name", "Runtime", "Metascore", "GrossEarning", "Rating", "Genre")
head(mydata, 10)
## # A tibble: 10 × 7
## ID Name Runtime Metascore GrossEarning Rating Genre
## <dbl> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 1 The Dark Knight 152 84.0 534.86 9.0 Acti…
## 2 2 The Lord of the … 201 94.0 377.85 9.0 Acti…
## 3 3 Inception 148 74.0 292.58 8.8 Acti…
## 4 4 The Lord of the … 179 87.0 342.55 8.8 Acti…
## 5 5 The Lord of the … 178 92.0 315.54 8.8 Acti…
## 6 6 The Matrix 136 73.0 171.48 8.7 Acti…
## 7 7 Star Wars: Episo… 124 82.0 290.48 8.7 Acti…
## 8 8 Life Is Beautiful 116 59.0 57.60 8.6 Come…
## 9 9 Terminator 2: Ju… 137 75.0 204.84 8.6 Acti…
## 10 10 Star Wars: Episo… 121 90.0 322.74 8.6 Acti…
Definitions of all variables:
Before analyzing my data, I have decided to eliminate all observations that include missing values. Then I also decided to convert categorical genre variable to a factor one. From the selected data on Comedy and Action genre I decided to take a random sample of 80 units for greater accuracy and simplicity of the analysis.
library(tidyr)
mydata1 <- drop_na(mydata)
mydata1$Genre <- factor(mydata1$Genre,
levels = c("Action", "Comedy"),
labels = c("Action", "Comedy"))
head(mydata1)
## # A tibble: 6 × 7
## ID Name Runtime Metascore GrossEarning Rating Genre
## <dbl> <chr> <dbl> <chr> <chr> <chr> <fct>
## 1 1 The Dark Knight 152 84.0 534.86 9.0 Acti…
## 2 2 The Lord of the R… 201 94.0 377.85 9.0 Acti…
## 3 3 Inception 148 74.0 292.58 8.8 Acti…
## 4 4 The Lord of the R… 179 87.0 342.55 8.8 Acti…
## 5 5 The Lord of the R… 178 92.0 315.54 8.8 Acti…
## 6 6 The Matrix 136 73.0 171.48 8.7 Acti…
set.seed(1)
mydata2 <- mydata1[sample(nrow(mydata1), 80), ]
library(psych)
describe(mydata2[ , c(-1, -2, -7)])
## vars n mean sd median trimmed mad min max range
## Runtime 1 80 120.64 20.42 120.0 119.80 19.27 80 178 98
## Metascore* 2 80 18.69 8.64 19.0 18.75 7.41 1 36 35
## GrossEarning* 3 80 40.50 23.24 40.5 40.50 29.65 1 80 79
## Rating* 4 80 4.11 2.51 4.0 3.88 2.97 1 10 9
## skew kurtosis se
## Runtime 0.39 -0.08 2.28
## Metascore* -0.09 -0.60 0.97
## GrossEarning* 0.00 -1.25 2.60
## Rating* 0.75 -0.50 0.28
In my selected sample of 80 observations, average gross earning of a movie was $ 40.5 million. Additionally, 50% of observed movies earned less than $ 40.5 million and 50% earned more than that. Gross earnings have a high standard deviation of $ 23.24. million, indicating that different movies earned a very different amount of gross earnings. Gross earnings range between $ 1 million and $ 79 millions. The selected data on gross earnings is perfectly symmetrically distributed as skewness is 0. Negative kurtosis indicates a flatter or platykurtic distribution, resulting in fewer extreme positive or negative values. Average runtime of selected movies amounts to 120.64 minutes. Additionally, 50% of observed movies lasted less than 120 minutes and 50% lasted more than that. Runtime has a high standard deviation of 20.42 minutes, indicating that different movies lasted very different time. Runtime ranges between 80 and 178 minutes in my selected sample dataset. The selected data on runtime is slightly skewed to the right and has a slightly flatter distribution than normally. Metascore and rating are two different movie rating scales. Metascore in selected dataset ranges from 1 to 36 and has an average score of 18.69. Additionally, 50% of observed movies scored less than 19 points and 50% scored more than that. Metascore has a high standard deviation of 8.64 points, indicating that different movies obtained very different Metascore.The selected data on Metascore rating is slightly skewed to the left and has a slightly flatter distribution than normally. On the other hand, user’s rating ranges from 1 to 10 points, with an average score of 4.11 points. Additionally, 50% of observed movies scored less than 4 points and 50% scored more than that. Rating has a high standard deviation of 2.51 points, indicating that different movies obtained very different user ratings.The selected data on user rating is slightly skewed to the right and has a slightly flatter distribution than normally.
My research includes formulation of hypothesis about the difference between two population arithmetic means. As I decided to compare Metascore rating in Comedy and Action movies where each movie is measured only once, I decided to use Independent samples t-test. I decided to analyze this research question with the use of parametric tests. They have greater statistical power than non-parametric tests but can be used only if certain assumptions are met.
Assumptions:
mydata2$Runtime <- as.numeric(mydata2$Runtime)
mydata2$GrossEarning <- as.numeric(mydata2$GrossEarning)
mydata2$Metascore <- as.numeric(mydata2$Metascore)
mydata2$Rating <- as.numeric(mydata2$Rating)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(mydata2, aes(x = Metascore, fill = factor(Genre))) +
geom_histogram(position = "dodge", bins = 10) +
scale_fill_manual(values = c("cornflowerblue", "pink"),
labels = c("Comedy", "Action")) +
labs(title = "Distribution of Metascore for movies and TV shows by genre",
x = "Metascore", y = "Frequency", fill = "Genre") +
theme_classic()
This histogram shows the frequency of Metascore ratings for Comedy and Action genre separately, resembling a normal distribution. Selected Comedy movies, colored in blue, on average display equal or higher ratings than Action movies, colored in pink. To determine normality, further testing is needed.
I decided to use Shapiro-Wilk normality test. Before using the test, I filtered the data for both Comedy and Action genre and determined the hypothesis.
Confidence interval: 95%
comedy_data <- mydata2[mydata2$Genre == "Comedy", ]
action_data <- mydata2[mydata2$Genre == "Action", ]
shapiro.test(comedy_data$Metascore)
##
## Shapiro-Wilk normality test
##
## data: comedy_data$Metascore
## W = 0.96462, p-value = 0.3474
shapiro.test(action_data$Metascore)
##
## Shapiro-Wilk normality test
##
## data: action_data$Metascore
## W = 0.97398, p-value = 0.3724
From the results of Shapiro-Wilk normality test I can’t reject the H0, because p-values for both genres (0.3724 and 0.3474) are higher than 0.05. This supports my normality assumption and I can say that both Comedy and Action genre are most likely normally distributed. Because of that, I conducted also t-test.
Recalling the hypothesis:
Confidence interval: 95%
t.test(mydata2$Metascore ~ mydata2$Genre,
paired = FALSE,
var.equal = FALSE,
alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: mydata2$Metascore by mydata2$Genre
## t = -0.11536, df = 72.812, p-value = 0.9085
## alternative hypothesis: true difference in means between group Action and group Comedy is not equal to 0
## 95 percent confidence interval:
## -5.090588 4.533528
## sample estimates:
## mean in group Action mean in group Comedy
## 72.08511 72.36364
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
effectsize::cohens_d(mydata2$Metascore ~ mydata2$Genre,
pooled_sd = FALSE)
## Cohen's d | 95% CI
## -------------------------
## -0.03 | [-0.47, 0.42]
##
## - Estimated using un-pooled SD.
interpret_cohens_d(0.03, rules = "sawilowsky2009")
## [1] "tiny"
## (Rules: sawilowsky2009)
Based on the sample data, I can’t reject the H0, because p-value is too high (0.9085 and p-value > 0.05). The answer for this high p-value lies in the fact that averages are very close together (72.085 and 72.364). On top of that, the effect size is tiny and therefore I can say that I am not statistically interested in the difference in Metascore ratings of movies between Comedy and Action genre. Therefore, I was unable to demonstrate that the average Metascore rating differed between Comedy and Action genre.
As a robustness check, I decided to perform a corresponding non-parametric test for independent samples which assumes that the assumption about normal distribution was not met. I used the Wilcoxon rank sum test.
Confidence interval: 95%
wilcox.test(mydata2$Metascore ~ mydata2$Genre,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: mydata2$Metascore by mydata2$Genre
## W = 755, p-value = 0.841
## alternative hypothesis: true location shift is not equal to 0
effectsize(wilcox.test(mydata2$Metascore ~ mydata2$Genre,
paired = FALSE,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ---------------------------------
## -0.03 | [-0.28, 0.23]
interpret_rank_biserial(0.03)
## [1] "tiny"
## (Rules: funder2019)
With Wilcoxon rank sum test I got a p-value of 0.841, therefore I can’t reject the H0 (p-value > 0.05). Based on sample data I found out that Metascore rating does not differ in movies between Comedy and Action genre. The difference in location distribution of Metascore rating in movies in Comedy and Action genre is tiny.
In my analysis, I focused on finding out if the average Metascore ratings of movies in Comedy and Action genre in a selected sample are the same or not. I came to the conclusion that there is no difference between the average Metascore ratings in Comedy and Action genre in a selected sample as I was not able to reject the H0, both in parametric t-test and non-parametric Wilcoxon rank sum test. In both tests, p-values were higher then 0.05, therefore I was not able to reject the H0. Furthermore, a tiny effect size also contributes to the findings as the differences between Metascores ratings in a selected sample are extremely small. Therefore, genre doesn’t appear to have a statistically significant impact on the average Metascore ratings of a selected sample of Top 1000 movies by IMDb.