After having exploring my dataset over the past few weeks,the questions i have are.

  1. Does the movie’s budget (column: budget_x) significantly impact its revenue (column: revenue)?

  2. Does the genre of a movie (column: genre) significantly impact its IMDb score (column: score)?

Devise at least two different null hypotheses based on two different aspects (e.g., columns) of your data

Null Hypothesis (H0): There is no significant difference in movie revenue between different budget levels.

Alternative Hypothesis (H1): There is a significant difference in movie revenue between different budget levels.

Null Hypothesis (H0): There is no significant difference in IMDb scores between different movie genres.

Alternative Hypothesis (H1): There is a significant difference in IMDb scores between different movie genres.

Two Hypothesis Tests

# Perform linear regression
lm_result <- lm(revenue ~ budget_x, data = data)

# Print the summary of the linear regression
summary(lm_result)
## 
## Call:
## lm(formula = revenue ~ budget_x, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.155e+09 -9.555e+07 -4.019e+07  8.152e+07  2.106e+09 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.036e+07  3.081e+06   13.10   <2e-16 ***
## budget_x    3.280e+00  3.565e-02   91.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 205300000 on 10176 degrees of freedom
## Multiple R-squared:  0.454,  Adjusted R-squared:  0.454 
## F-statistic:  8463 on 1 and 10176 DF,  p-value: < 2.2e-16

The linear regression analysis showed that the p-value associated with the budget_x coefficient is much less than the chosen significance level (alpha). We can reject the null hypothesis (H0) and conclude that there is a significant difference in movie revenue between different budget levels. The coefficient estimate for budget_x is 3.280e+00, indicating that, on average, for each unit increase in budget_x, the movie’s revenue is expected to increase by approximately $3.28.

# Perform ANOVA test
anova_result <- aov(score ~ genre, data = data)

# Summary of ANOVA results
summary(anova_result)
##               Df  Sum Sq Mean Sq F value Pr(>F)    
## genre       2303  571557   248.2   1.511 <2e-16 ***
## Residuals   7874 1293385   164.3                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value obtained from the ANOVA test is extremely small (close to 0), indicating that there is a significant difference in IMDb scores among different movie genres. In other words, we can reject the null hypothesis (H0) and conclude that there is a statistically significant difference in IMDb scores between movie genres.

alpha level, power level, and minimum effect size, and explain why you chose each value.

For both hypothesis tests:

Alpha Level (): 0.05 (5% level of significance) To maintain a standard significance threshold for hypothesis testing, which balances Type I and Type II errors, we picked an alpha level of 0.05.

(80% power) Power Level (1 - ): 0.80 In order to have a decent probability of identifying a substantial effect, if one exists, we selected a power level of 0.80. A frequently accepted cutoff for hypothesis testing is a power of 0.80.

Minimum Effect Size (Cohen’s d): 0.30 We chose a minimum effect size of 0.30 based on practical significance. It represents a moderate effect size, indicating a meaningful difference in movie revenue & IMDb scores among different movie genres.

Neyman-Pearson hypothesis test.

alpha <- 0.05  
power <- 0.80
effect_size <- 0.30 
critical_value <- qnorm(1 - alpha)
lambda <- (critical_value + qnorm(power))^2

required_sample_size <- (qnorm(alpha / 2) + qnorm(1 - power)) ^ 2 / effect_size ^ 2
observed_sample_size <- length(data$budget_x)

t_stat <- (mean(data$revenue) - mean(data$budget_x)) / (sd(data$revenue) / sqrt(observed_sample_size))
p_value <- pt(t_stat, df = observed_sample_size - 1)

if (observed_sample_size >= required_sample_size) {
  cat("Sample size is sufficient for Neyman-Pearson test.\n")
  cat("Observed t-statistic:", t_stat, "\n")
  cat("Observed p-value:", p_value, "\n")
  
  if (abs(t_stat) > critical_value) {
    cat("Reject the null hypothesis (H0).\n")
    cat("There is a significant difference in movie revenue between different budget levels.\n")
  } else {
    cat("Fail to reject the null hypothesis (H0).\n")
    cat("There is no significant difference in movie revenue between different budget levels.\n")
  }
} else {
  cat("Sample size is insufficient for Neyman-Pearson test.\n")
  cat("Consider increasing the sample size to achieve the desired power.\n")
}
## Sample size is sufficient for Neyman-Pearson test.
## Observed t-statistic: 68.37077 
## Observed p-value: 1 
## Reject the null hypothesis (H0).
## There is a significant difference in movie revenue between different budget levels.
alpha_level <- 0.05
power_level <- 0.80
effect_size <- 0.30
se_mean_difference <- sd(data$score) / sqrt(length(data$score))
critical_value <- qnorm(1 - alpha_level)
required_sample_size <- (critical_value * se_mean_difference / effect_size)^2
current_sample_size <- length(data$score)
if (current_sample_size >= required_sample_size) {
  cat("Sample size is sufficient for Neyman-Pearson test.\n")
  
  t_statistic <- (mean(data$score[data$genre == "Action"]) - mean(data$score[data$genre != "Action"])) / se_mean_difference
  p_value <- 2 * pt(-abs(t_statistic), df = current_sample_size - 2)
  
  cat("Observed t-statistic:", t_statistic, "\n")
  cat("Observed p-value:", p_value, "\n")
  if (p_value <= alpha_level) {
    cat("Reject the null hypothesis (H0).\n")
    cat("There is a significant difference in IMDb scores between different movie genres.\n")
  } else {
    cat("Fail to reject the null hypothesis (H0).\n")
    cat("There is no significant difference in IMDb scores between different movie genres.\n")
  }
} else {
  cat("Sample size is not sufficient for Neyman-Pearson test.\n")
}
## Sample size is sufficient for Neyman-Pearson test.
## Observed t-statistic: -27.77 
## Observed p-value: 1.141364e-163 
## Reject the null hypothesis (H0).
## There is a significant difference in IMDb scores between different movie genres.

Perform a Fisher’s style test for significance, and interpret the p-value.

To perform a Fisher’s style test for both hypotheses, we can use the analysis of variance (ANOVA) test.

anova_result_hypothesis_1 <- aov(revenue ~ cut(budget_x, breaks = quantile(budget_x)), data = data)

p_value_hypothesis_1 <- summary(anova_result_hypothesis_1)[[1]]$`Pr(>F)`[1]

print(p_value_hypothesis_1)
## [1] 0
anova_result_hypothesis_2 <- aov(score ~ genre, data = data)

p_value_hypothesis_2 <- summary(anova_result_hypothesis_2)[[1]]$`Pr(>F)`[1]

print(p_value_hypothesis_2)
## [1] 1.198442e-37

For Hypothesis 1 (Movie Revenue vs. Budget Levels):

The p-value is 0, less than 0.05. We reject the null hypothesis. There is a significant difference in movie revenue between budget levels. For Hypothesis 2 (IMDb Scores vs. Movie Genres):

The p-value is very close to 0. We reject the null hypothesis. There is a significant difference in IMDb scores among different movie genres.

Build two visualizations that best illustrate the results from the two pairs of hypothesis tests

# Load necessary libraries (if not already loaded)
library(ggplot2)

# Create a bar plot for Hypothesis 1
ggplot(data, aes(x = cut(budget_x, breaks = quantile(budget_x)), y = revenue)) +
  geom_bar(stat = "summary", fun = "mean", fill = "blue") +
  labs(title = "Mean Movie Revenue by Budget Level", x = "Budget Level", y = "Mean Revenue") +
  theme_minimal()

# Create a bar plot for Hypothesis 2
ggplot(data, aes(x = genre, y = score)) +
  geom_bar(stat = "summary", fun = "var", fill = "green") +
  labs(title = "Variance of IMDb Scores by Movie Genre", x = "Genre", y = "Variance of IMDb Scores") +
  theme_minimal()
## Warning: Removed 1426 rows containing missing values (`position_stack()`).