data <- read.csv("movies_metadata.csv", stringsAsFactors = FALSE)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tibble' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(pwr)
## Warning: package 'pwr' was built under R version 4.3.3

Approach:

Hypothesis 1:

  • Null Hypothesis (H₀): The average runtime of movies released in the 1970 and 1990 has not significantly changed over time.

  • Alternative Hypothesis (H₁): The average runtime of movies between 1970 and 1990 has significantly changed over time.

Data Preparation:

  • Here we convert the release_data column into a standard data format and extract the release year. We then filter films released between 1970-1990 and separate them into two groups for runtime.
data$release_date <- as.Date(data$release_date, format="%Y-%m-%d")
data$release_year <- as.numeric(format(data$release_date, "%Y"))

runtime_1970s <- na.omit(data$runtime[data$release_year >= 1970 & data$release_year < 1980])
runtime_1980s <- na.omit(data$runtime[data$release_year >= 1980 & data$release_year < 1990])

Defining Parameters:

  • Alpha: We choose in alpha level of 0.01 to ensure we only reject the null hypothesis for strong evidence.

  • Power: We choose an 80% power level to ensure a low chance of missing a real difference.

  • Delta: We choose delta to be 10 minutes to ensure a meaningful difference in movie runtimes.

alpha <- 0.01
power <- 0.80
delta <- 10

Calculate Required Sample Size:

  • Here we calculate the required sample size to ensure our test has enough power. We find that the required size is 421 movies.

  • We then check the data to and compare the actual sample sizes to the required sizes.

runtime_std <- sd(c(runtime_1970s, runtime_1980s))

effect_size <- delta / runtime_std

required_n <- ceiling(pwr.t.test(d = effect_size, power = power, sig.level = alpha, type = "two.sample")$n)

print(paste("Required Sample Size Per Group:", required_n))
## [1] "Required Sample Size Per Group: 421"
n_1970s <- length(runtime_1970s)
n_1980s <- length(runtime_1980s)

if (n_1970s >= required_n & n_1980s >= required_n) {
  print("We have enough data to perform the test!")
  print(paste("Sample Size (1970s):", n_1970s))
  print(paste("Sample Size (1980s):", n_1980s))
} else {
  print("Not enough data to perform the test.")
  print(paste("Sample Size (1970s):", n_1970s))
  print(paste("Sample Size (1980s):", n_1980s))
}
## [1] "We have enough data to perform the test!"
## [1] "Sample Size (1970s): 3446"
## [1] "Sample Size (1980s): 3913"
  • We find that we have well over the required amount so we can proceed with the test.

Performing the Two-Sample t-Test:

  • Since we have large sample sizes, we can skip normality and proceed directly with a t-test.
test_result <- t.test(runtime_1970s, runtime_1980s, var.equal = FALSE)

print(paste("Test Type: Two-Sample t-test"))
## [1] "Test Type: Two-Sample t-test"
print(paste("Test Statistic:", round(test_result$statistic, 3)))
## [1] "Test Statistic: 0.639"
print(paste("P-Value:", round(test_result$p.value, 3)))
## [1] "P-Value: 0.523"
# Decision rule
if (test_result$p.value < alpha) {
  print("Decision: Reject H₀ → The average runtime has significantly changed over time.")
} else {
  print("Decision: Fail to Reject H₀ → No significant change in average runtime.")
}
## [1] "Decision: Fail to Reject H₀ → No significant change in average runtime."

Results:

  • Test Statistic: 0.639, meaning there is a small difference in the means.

  • P-Value: 0.523, which is much higher than our value of 0.01 meaning we fail to reject H₀.

  • Conclusion: The test shows that there is not statistically significant difference in runtimes between the 1970s and 1980s.

Visualization:

runtime_df <- data.frame(
  runtime = c(runtime_1970s, runtime_1980s),
  decade = c(rep("1970s", length(runtime_1970s)), rep("1980s", length(runtime_1980s)))
)

ggplot(runtime_df, aes(x = runtime, fill = decade)) +
  geom_histogram(position = "identity", alpha = 0.5, bins = 50, color = "black") +  
  scale_fill_manual(values = c("1970s" = "red", "1980s" = "blue")) +  
  labs(title = "Distribution of Movie Runtimes (1970s vs. 1980s)",
       x = "Runtime (Minutes)", y = "Count") +
  theme_minimal()

From the histogram we can see that the distribution of runtimes is very similar across both decades, confirming that there is no dramatic changes between years. We accept the null hypothesis.

Conclusion:

The idea that '80s movies were longer than '70s movies is not supported by statistical evidence. While some individual films may have been longer, the overall trend does not show a significant shift in movie runtimes during this period.

Approach 2:

Hypothesis 2:

  • H₀ (Null Hypothesis): The average budgets of movies in the 1970s is equal to the average revenue of movies in the 1980s.

  • H₁ (Alternative Hypothesis): The average budgets of movies in the 1970s is different from the average revenue of movies in the 1980s.

Data Preparation:

  • Here we convert the release_data column into a standard data format and extract the release year. We then filter films released between 1970-1990 and separate them into two groups for budget.
data$budget <- as.numeric(data$budget)
## Warning: NAs introduced by coercion
budget_1970s <- na.omit(data$budget[data$release_year >= 1970 & data$release_year < 1980])
budget_1980s <- na.omit(data$budget[data$release_year >= 1980 & data$release_year < 1990])

Fisher’s Significance Test:

test_result <- t.test(budget_1970s, budget_1980s, var.equal = FALSE)

print(paste("Test Type: Two-Sample t-test (Fisher’s Significance Testing)"))
## [1] "Test Type: Two-Sample t-test (Fisher’s Significance Testing)"
print(paste("Test Statistic:", round(test_result$statistic, 3)))
## [1] "Test Statistic: -14.884"
print(paste("P-Value:", round(test_result$p.value, 3)))
## [1] "P-Value: 0"
if (test_result$p.value < 0.05) {
  print("Decision: Reject H₀ → The average budget of movies in the 1970s and 1980s is significantly different.")
} else {
  print("Decision: Fail to Reject H₀ → No significant difference in average budgets between the 1970s and 1980s.")
}
## [1] "Decision: Reject H₀ → The average budget of movies in the 1970s and 1980s is significantly different."
budget_df <- data.frame(
    budget = c(budget_1970s, budget_1980s),
    decade = c(rep("1970s", length(budget_1970s)), rep("1980s", length(budget_1980s)))
  )
 
ggplot(budget_df, aes(x = decade, y = budget, fill = decade)) +
  geom_boxplot(alpha = 0.6) +
  scale_fill_manual(values = c("1970s" = "red", "1980s" = "blue")) +
  labs(title = "Budget Comparison: 1970s vs. 1980s",
       x = "Decade", y = "Budget ($)") +
  theme_minimal()

  • The test confirms that budgets in the 1980s tend to be higher compared to the 1970s, this is likely due to blockbuster films.

  • The main body of the box plots seems close, suggesting that while the highest budgets increased, the majority of films may have remained similar.

  • Conclusion: Statistically, movie budgets in the 1980s were significantly different from the 1970s, with an overall trend toward higher budgets. Our p-level is also lower than our level of 0.05, further confirming that we reject the null hypothesis.