Does the movie’s budget (column: budget_x) significantly impact its revenue (column: revenue)?
Does the genre of a movie (column: genre) significantly impact its IMDb score (column: score)?
Null Hypothesis (H0): There is no significant difference in movie revenue between different budget levels.
Alternative Hypothesis (H1): There is a significant difference in movie revenue between different budget levels.
Null Hypothesis (H0): There is no significant difference in IMDb scores between different movie genres.
Alternative Hypothesis (H1): There is a significant difference in IMDb scores between different movie genres.
# Perform linear regression
lm_result <- lm(revenue ~ budget_x, data = data)
# Print the summary of the linear regression
summary(lm_result)
##
## Call:
## lm(formula = revenue ~ budget_x, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.155e+09 -9.555e+07 -4.019e+07 8.152e+07 2.106e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.036e+07 3.081e+06 13.10 <2e-16 ***
## budget_x 3.280e+00 3.565e-02 91.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 205300000 on 10176 degrees of freedom
## Multiple R-squared: 0.454, Adjusted R-squared: 0.454
## F-statistic: 8463 on 1 and 10176 DF, p-value: < 2.2e-16
The linear regression analysis showed that the p-value associated with the budget_x coefficient is much less than the chosen significance level (alpha). We can reject the null hypothesis (H0) and conclude that there is a significant difference in movie revenue between different budget levels. The coefficient estimate for budget_x is 3.280e+00, indicating that, on average, for each unit increase in budget_x, the movie’s revenue is expected to increase by approximately $3.28.
# Perform ANOVA test
anova_result <- aov(score ~ genre, data = data)
# Summary of ANOVA results
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## genre 2303 571557 248.2 1.511 <2e-16 ***
## Residuals 7874 1293385 164.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value obtained from the ANOVA test is extremely small (close to 0), indicating that there is a significant difference in IMDb scores among different movie genres. In other words, we can reject the null hypothesis (H0) and conclude that there is a statistically significant difference in IMDb scores between movie genres.
For both hypothesis tests:
Alpha Level (): 0.05 (5% level of significance) To maintain a standard significance threshold for hypothesis testing, which balances Type I and Type II errors, we picked an alpha level of 0.05.
(80% power) Power Level (1 - ): 0.80 In order to have a decent probability of identifying a substantial effect, if one exists, we selected a power level of 0.80. A frequently accepted cutoff for hypothesis testing is a power of 0.80.
Minimum Effect Size (Cohen’s d): 0.30 We chose a minimum effect size of 0.30 based on practical significance. It represents a moderate effect size, indicating a meaningful difference in movie revenue & IMDb scores among different movie genres.
alpha <- 0.05
power <- 0.80
effect_size <- 0.30
critical_value <- qnorm(1 - alpha)
lambda <- (critical_value + qnorm(power))^2
required_sample_size <- (qnorm(alpha / 2) + qnorm(1 - power)) ^ 2 / effect_size ^ 2
observed_sample_size <- length(data$budget_x)
t_stat <- (mean(data$revenue) - mean(data$budget_x)) / (sd(data$revenue) / sqrt(observed_sample_size))
p_value <- pt(t_stat, df = observed_sample_size - 1)
if (observed_sample_size >= required_sample_size) {
cat("Sample size is sufficient for Neyman-Pearson test.\n")
cat("Observed t-statistic:", t_stat, "\n")
cat("Observed p-value:", p_value, "\n")
if (abs(t_stat) > critical_value) {
cat("Reject the null hypothesis (H0).\n")
cat("There is a significant difference in movie revenue between different budget levels.\n")
} else {
cat("Fail to reject the null hypothesis (H0).\n")
cat("There is no significant difference in movie revenue between different budget levels.\n")
}
} else {
cat("Sample size is insufficient for Neyman-Pearson test.\n")
cat("Consider increasing the sample size to achieve the desired power.\n")
}
## Sample size is sufficient for Neyman-Pearson test.
## Observed t-statistic: 68.37077
## Observed p-value: 1
## Reject the null hypothesis (H0).
## There is a significant difference in movie revenue between different budget levels.
alpha_level <- 0.05
power_level <- 0.80
effect_size <- 0.30
se_mean_difference <- sd(data$score) / sqrt(length(data$score))
critical_value <- qnorm(1 - alpha_level)
required_sample_size <- (critical_value * se_mean_difference / effect_size)^2
current_sample_size <- length(data$score)
if (current_sample_size >= required_sample_size) {
cat("Sample size is sufficient for Neyman-Pearson test.\n")
t_statistic <- (mean(data$score[data$genre == "Action"]) - mean(data$score[data$genre != "Action"])) / se_mean_difference
p_value <- 2 * pt(-abs(t_statistic), df = current_sample_size - 2)
cat("Observed t-statistic:", t_statistic, "\n")
cat("Observed p-value:", p_value, "\n")
if (p_value <= alpha_level) {
cat("Reject the null hypothesis (H0).\n")
cat("There is a significant difference in IMDb scores between different movie genres.\n")
} else {
cat("Fail to reject the null hypothesis (H0).\n")
cat("There is no significant difference in IMDb scores between different movie genres.\n")
}
} else {
cat("Sample size is not sufficient for Neyman-Pearson test.\n")
}
## Sample size is sufficient for Neyman-Pearson test.
## Observed t-statistic: -27.77
## Observed p-value: 1.141364e-163
## Reject the null hypothesis (H0).
## There is a significant difference in IMDb scores between different movie genres.
To perform a Fisher’s style test for both hypotheses, we can use the analysis of variance (ANOVA) test.
anova_result_hypothesis_1 <- aov(revenue ~ cut(budget_x, breaks = quantile(budget_x)), data = data)
p_value_hypothesis_1 <- summary(anova_result_hypothesis_1)[[1]]$`Pr(>F)`[1]
print(p_value_hypothesis_1)
## [1] 0
anova_result_hypothesis_2 <- aov(score ~ genre, data = data)
p_value_hypothesis_2 <- summary(anova_result_hypothesis_2)[[1]]$`Pr(>F)`[1]
print(p_value_hypothesis_2)
## [1] 1.198442e-37
For Hypothesis 1 (Movie Revenue vs. Budget Levels):
The p-value is 0, less than 0.05. We reject the null hypothesis. There is a significant difference in movie revenue between budget levels. For Hypothesis 2 (IMDb Scores vs. Movie Genres):
The p-value is very close to 0. We reject the null hypothesis. There is a significant difference in IMDb scores among different movie genres.
# Load necessary libraries (if not already loaded)
library(ggplot2)
# Create a bar plot for Hypothesis 1
ggplot(data, aes(x = cut(budget_x, breaks = quantile(budget_x)), y = revenue)) +
geom_bar(stat = "summary", fun = "mean", fill = "blue") +
labs(title = "Mean Movie Revenue by Budget Level", x = "Budget Level", y = "Mean Revenue") +
theme_minimal()
# Create a bar plot for Hypothesis 2
ggplot(data, aes(x = genre, y = score)) +
geom_bar(stat = "summary", fun = "var", fill = "green") +
labs(title = "Variance of IMDb Scores by Movie Genre", x = "Genre", y = "Variance of IMDb Scores") +
theme_minimal()
## Warning: Removed 1426 rows containing missing values (`position_stack()`).