data <- read.csv("movies_metadata.csv", stringsAsFactors = FALSE)
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.3

## Warning: package 'ggplot2' was built under R version 4.3.3

## Warning: package 'tibble' was built under R version 4.3.3

## Warning: package 'tidyr' was built under R version 4.3.3

## Warning: package 'readr' was built under R version 4.3.3

## Warning: package 'purrr' was built under R version 4.3.3

## Warning: package 'dplyr' was built under R version 4.3.3

## Warning: package 'stringr' was built under R version 4.3.3

## Warning: package 'forcats' was built under R version 4.3.3

## Warning: package 'lubridate' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(pwr)

## Warning: package 'pwr' was built under R version 4.3.3

Approach:

We aim to test whether the average runtime of movies changed significantly between the 1970s and 1980s.
Having grown up hearing comparisons between classic films from my older siblings, I've often heard that movies from the '70s were shorter, while '80s films started getting longer with evolving storytelling and effects. This test will determine if that shift is real or just perception.
Using the Neymon-Pearson Framework we will:
- Set intentional error thresholds instead of using arbitrary values.
- Ensure we have enough data by calculating the required sample size.
- Choose the best statistical test based on our data
This will ensure our conclusion is based on our data and not assumptions.

Hypothesis 1:

Null Hypothesis (H₀): The average runtime of movies released in the 1970 and 1990 has not significantly changed over time.
Alternative Hypothesis (H₁): The average runtime of movies between 1970 and 1990 has significantly changed over time.

Data Preparation:

Here we convert the release_data column into a standard data format and extract the release year. We then filter films released between 1970-1990 and separate them into two groups for runtime.

data$release_date <- as.Date(data$release_date, format="%Y-%m-%d")
data$release_year <- as.numeric(format(data$release_date, "%Y"))

runtime_1970s <- na.omit(data$runtime[data$release_year >= 1970 & data$release_year < 1980])
runtime_1980s <- na.omit(data$runtime[data$release_year >= 1980 & data$release_year < 1990])

Defining Parameters:

Alpha: We choose in alpha level of 0.01 to ensure we only reject the null hypothesis for strong evidence.
Power: We choose an 80% power level to ensure a low chance of missing a real difference.
Delta: We choose delta to be 10 minutes to ensure a meaningful difference in movie runtimes.

alpha <- 0.01
power <- 0.80
delta <- 10

Calculate Required Sample Size:

Here we calculate the required sample size to ensure our test has enough power. We find that the required size is 421 movies.
We then check the data to and compare the actual sample sizes to the required sizes.

runtime_std <- sd(c(runtime_1970s, runtime_1980s))

effect_size <- delta / runtime_std

required_n <- ceiling(pwr.t.test(d = effect_size, power = power, sig.level = alpha, type = "two.sample")$n)

print(paste("Required Sample Size Per Group:", required_n))

## [1] "Required Sample Size Per Group: 421"

n_1970s <- length(runtime_1970s)
n_1980s <- length(runtime_1980s)

if (n_1970s >= required_n & n_1980s >= required_n) {
  print("We have enough data to perform the test!")
  print(paste("Sample Size (1970s):", n_1970s))
  print(paste("Sample Size (1980s):", n_1980s))
} else {
  print("Not enough data to perform the test.")
  print(paste("Sample Size (1970s):", n_1970s))
  print(paste("Sample Size (1980s):", n_1980s))
}

## [1] "We have enough data to perform the test!"
## [1] "Sample Size (1970s): 3446"
## [1] "Sample Size (1980s): 3913"

We find that we have well over the required amount so we can proceed with the test.

Performing the Two-Sample t-Test:

Since we have large sample sizes, we can skip normality and proceed directly with a t-test.

test_result <- t.test(runtime_1970s, runtime_1980s, var.equal = FALSE)

print(paste("Test Type: Two-Sample t-test"))

## [1] "Test Type: Two-Sample t-test"

print(paste("Test Statistic:", round(test_result$statistic, 3)))

## [1] "Test Statistic: 0.639"

print(paste("P-Value:", round(test_result$p.value, 3)))

## [1] "P-Value: 0.523"

# Decision rule
if (test_result$p.value < alpha) {
  print("Decision: Reject H₀ → The average runtime has significantly changed over time.")
} else {
  print("Decision: Fail to Reject H₀ → No significant change in average runtime.")
}

## [1] "Decision: Fail to Reject H₀ → No significant change in average runtime."

Results:

Test Statistic: 0.639, meaning there is a small difference in the means.
P-Value: 0.523, which is much higher than our value of 0.01 meaning we fail to reject H₀.
Conclusion: The test shows that there is not statistically significant difference in runtimes between the 1970s and 1980s.

Visualization:

runtime_df <- data.frame(
  runtime = c(runtime_1970s, runtime_1980s),
  decade = c(rep("1970s", length(runtime_1970s)), rep("1980s", length(runtime_1980s)))
)

ggplot(runtime_df, aes(x = runtime, fill = decade)) +
  geom_histogram(position = "identity", alpha = 0.5, bins = 50, color = "black") +  
  scale_fill_manual(values = c("1970s" = "red", "1980s" = "blue")) +  
  labs(title = "Distribution of Movie Runtimes (1970s vs. 1980s)",
       x = "Runtime (Minutes)", y = "Count") +
  theme_minimal()

From the histogram we can see that the distribution of runtimes is very similar across both decades, confirming that there is no dramatic changes between years. We accept the null hypothesis.

Conclusion:

The idea that '80s movies were longer than '70s movies is not supported by statistical evidence. While some individual films may have been longer, the overall trend does not show a significant shift in movie runtimes during this period.

Approach 2:

We aim to test whether movie budgets significantly changed between the 1970s and 1980s.
As somebody who grew up with many older siblings, I often found myself watching many classic films. They also often told me that the 1980s marked the rise of the blockbuster era, backed by franchises like Star Wars, Indiana Jones, and E.T. dominating the box office. This made me have a belief that movies from the ’80s had a higher budget than those of the ’70s, but is this perception also backed by data?
Using Fisher's Significance Testing Framework, we will:
- Compare the average movie revenues of the ’70s and ’80s.
- Make a conclusion based solely on data

Hypothesis 2:

H₀ (Null Hypothesis): The average budgets of movies in the 1970s is equal to the average revenue of movies in the 1980s.
H₁ (Alternative Hypothesis): The average budgets of movies in the 1970s is different from the average revenue of movies in the 1980s.

Data Preparation:

Here we convert the release_data column into a standard data format and extract the release year. We then filter films released between 1970-1990 and separate them into two groups for budget.

data$budget <- as.numeric(data$budget)

## Warning: NAs introduced by coercion

budget_1970s <- na.omit(data$budget[data$release_year >= 1970 & data$release_year < 1980])
budget_1980s <- na.omit(data$budget[data$release_year >= 1980 & data$release_year < 1990])

Fisher’s Significance Test:

test_result <- t.test(budget_1970s, budget_1980s, var.equal = FALSE)

print(paste("Test Type: Two-Sample t-test (Fisher’s Significance Testing)"))

## [1] "Test Type: Two-Sample t-test (Fisher’s Significance Testing)"

print(paste("Test Statistic:", round(test_result$statistic, 3)))

## [1] "Test Statistic: -14.884"

print(paste("P-Value:", round(test_result$p.value, 3)))

## [1] "P-Value: 0"

if (test_result$p.value < 0.05) {
  print("Decision: Reject H₀ → The average budget of movies in the 1970s and 1980s is significantly different.")
} else {
  print("Decision: Fail to Reject H₀ → No significant difference in average budgets between the 1970s and 1980s.")
}

## [1] "Decision: Reject H₀ → The average budget of movies in the 1970s and 1980s is significantly different."

budget_df <- data.frame(
    budget = c(budget_1970s, budget_1980s),
    decade = c(rep("1970s", length(budget_1970s)), rep("1980s", length(budget_1980s)))
  )
 
ggplot(budget_df, aes(x = decade, y = budget, fill = decade)) +
  geom_boxplot(alpha = 0.6) +
  scale_fill_manual(values = c("1970s" = "red", "1980s" = "blue")) +
  labs(title = "Budget Comparison: 1970s vs. 1980s",
       x = "Decade", y = "Budget ($)") +
  theme_minimal()

The test confirms that budgets in the 1980s tend to be higher compared to the 1970s, this is likely due to blockbuster films.
The main body of the box plots seems close, suggesting that while the highest budgets increased, the majority of films may have remained similar.
Conclusion: Statistically, movie budgets in the 1980s were significantly different from the 1970s, with an overall trend toward higher budgets. Our p-level is also lower than our level of 0.05, further confirming that we reject the null hypothesis.

Data Dive Week 7

2025-02-27

Approach:

Hypothesis 1:

Data Preparation:

Defining Parameters:

Calculate Required Sample Size:

Performing the Two-Sample t-Test:

Visualization:

Conclusion:

Approach 2:

Hypothesis 2:

Data Preparation:

Fisher’s Significance Test: