MTH 411-Final

Introduction

The film industry is a complex ecosystem where various factors—from distribution logistics to creative choices—influence a movie’s success and profile. This report utilizes a dataset of the top 100 most-viewed films to investigate how metadata characteristics, such as global availability and release timing, correlate with viewership and film duration.

Initially, I will first explore whether a film’s status as “globally available” serves as a predictor for higher viewership. This involves testing the differences in view counts between films with restricted versus worldwide access.

I examine the relationship between a film’s release month and its runtime. By categorizing movies by their release window, I aim to identify if specific times of the year (such as the summer blockbuster season or the year-end awards season) are associated with significantly longer or shorter film lengths.

# Sorting the data by views in descending order

# Creating a table of movies that rank in the top 100 most viewed in this data set. 
top_100 <- movies %>%
  arrange(desc(views)) %>%
  slice(1:100)

# Check the count
table(top_100$available_globally)

## 
##  No Yes 
##  31  69

# The "available globally" category is independent as there are only 2 possible answers (Yes,No) as One movie being globally available doesnt directly cause another movie to be available globally.

## # A tibble: 2 × 4
##   available_globally     n mean_views  sd_views
##   <chr>              <int>      <dbl>     <dbl>
## 1 No                    31  57235484. 10432339.
## 2 Yes                   69  70675362. 28345211.

# box plot comparing the spread of views across the movies that are available globally and those that aren't.
ggplot(top_100, aes(x = available_globally, y = views, fill = available_globally)) +
  geom_boxplot() +
  stat_summary(fun = mean, geom = "point", shape = 20, size = 3, color = "red") +
  labs(title = "Views by Global Availability (Top 100 Movies)",
       subtitle = "Red dot represents the mean") +
  theme_classic()

# Normality Check 
# If the dots fall on the diagonal line, it's likely Normally distributed
library(ggplot2)
ggplot(top_100, aes(sample = views, color = available_globally)) +
  stat_qq() +
  stat_qq_line() +
  facet_wrap(~available_globally) +
  labs(title = "Normal Q-Q Plot by Global Status")

# Levene's Test to check for equal variance
library(car)
# Assuming variance is equal as the null hypothesis. However, if the resulting # p value is below a 0.05 significance, I reject the null. 
leveneTest(views ~ available_globally, data = top_100)

## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value   Pr(>F)   
## group  1  7.8351 0.006172 **
##       98                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# The P value is smaller than 0.05, therefore the variance of the two categories are unequal.

# Attempting a log transformation for the data to potentially pass the
# requirements for any statistical test
# Create a new column for Log Views
top_100 <- top_100 %>%
  mutate(log_views = log(views))

# Check the new Q-Q plot
ggplot(top_100, aes(sample = log_views, color = available_globally)) +
  stat_qq() +
  stat_qq_line() +
  facet_wrap(~available_globally) +
  labs(title = "Q-Q Plot of Log-Transformed Views",
       subtitle = "The dots should stay closer to the line now")

# Values resemble more of a normal distribution

# Shapiro Test to confirm log data is normally distributed
# Test the "Yes" group
shapiro.test(top_100$log_views[top_100$available_globally == "Yes"])

## 
##  Shapiro-Wilk normality test
## 
## data:  top_100$log_views[top_100$available_globally == "Yes"]
## W = 0.91445, p-value = 0.000169

# Test the "No" group
shapiro.test(top_100$log_views[top_100$available_globally == "No"])

## 
##  Shapiro-Wilk normality test
## 
## data:  top_100$log_views[top_100$available_globally == "No"]
## W = 0.95184, p-value = 0.1755

Interpretation

Initial analysis of viewership data revealed significant positive skewness and heteroscedasticity (Levene’s Test, p = 0.006), which could not be fully corrected via log transformation (p < 0.001 for the ‘Yes’ group). To ensure a robust analysis, the focus was shifted to Movie Runtime, which exhibited more stable variance across groups.

# n=100 
# Can use the previous table and make a few edits 
# Check the means and standard deviations
top_100 %>%
  group_by(available_globally) %>%
  summarise(n = n(),
            mean_run = mean(runtime, na.rm = TRUE),
            sd_run = sd(runtime, na.rm = TRUE))

## Warning: There were 4 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `mean_run = mean(runtime, na.rm = TRUE)`.
## ℹ In group 1: `available_globally = "No"`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 3 remaining warnings.

## # A tibble: 2 × 4
##   available_globally     n mean_run sd_run
##   <chr>              <int>    <dbl>  <dbl>
## 1 No                    31       NA     NA
## 2 Yes                   69       NA     NA

To ensure statistical integrity, the raw data underwent significant cleaning. A primary challenge involved the runtime variable, which was originally recorded as a character string (e.g., “1H 54M”). I utilized the lubridate and dplyr packages to transform these entries into a uniform numeric format (minutes), allowing for mathematical comparison.

The following sections detail my exploratory data analysis (EDA), the testing of statistical assumptions (Normality and Homogeneity of Variance), and the subsequent application of ANOVA and Post-Hoc tests to identify significant trends.

library(dplyr)
library(lubridate) # Essential for handling 'Period' types


top_100 <- top_100 %>%
  mutate(
    # 1. Parse the character string (e.g., "1H 54M") into a duration
    # 2. Convert that duration into numeric minutes
    runtime_mins = as.numeric(duration(runtime), "minutes")
  )

# summary should work now
top_100_clean_stats <- top_100 %>%
  group_by(available_globally) %>%
  summarise(
    n = n(),
    mean_run = mean(runtime_mins, na.rm = TRUE),
    sd_run = sd(runtime_mins, na.rm = TRUE)
  )

print(top_100_clean_stats)

## # A tibble: 2 × 4
##   available_globally     n mean_run sd_run
##   <chr>              <int>    <dbl>  <dbl>
## 1 No                    31     94.5   9.23
## 2 Yes                   69    107.   18.8

Next Steps

With this data, I aim toward doing a two sample t test which requires the assumptions of independence, random sampling, normality, and equal variance.

The groups are independent as the answers “Yes” and “No” are mutually exclusive therefore, cannot be occurring at the same time and don’t interfere with each other.

# Allows for the month to be extracted from the "release_date" 
top_100 <- top_100 %>%
  mutate(release_month = month(release_date, label = TRUE, abbr = TRUE))

# Box plot to allow for visualization of the spread of movie runtime within each month
ggplot(top_100, aes(x = release_month, y = runtime_mins, fill = release_month)) +
  geom_boxplot(show.legend = FALSE) +
  labs(title = "Movie Runtime by Release Month (Top 100)",
       x = "Month",
       y = "Runtime (minutes)") +
  theme_minimal()

month_summary <- top_100 %>%
  group_by(release_month) %>%
  summarise(
    mean_run = mean(runtime_mins, na.rm = TRUE),
    se_run = sd(runtime_mins, na.rm = TRUE) / sqrt(n())
  )

ggplot(month_summary, aes(x = release_month, y = mean_run, fill = release_month)) +
  geom_bar(stat = "identity", alpha = 0.7) +
  # Add Error Bars (Mean +/- 1 SE)
  geom_errorbar(aes(ymin = mean_run - se_run, ymax = mean_run + se_run), width = 0.2) +
  labs(title = "Average Movie Runtime by Month",
       subtitle = "Error bars represent +/- 1 Standard Error",
       y = "Mean Runtime (minutes)", x = "Month") +
  theme_classic()

## Visual Analysis The Boxplot and Bar Chart highlight a distinct seasonal trend in movie lengths. While most months maintain a consistent median, October shows a notable decrease in runtime. In contrast, December and September feature the longest average runtimes and the greatest internal variety (spread). The error bars in the bar chart confirm that the dip in October is statistically distinct from the year-end peaks, suggesting a shift in content type during the fall.

## STATISTICAL TESTS 

# Fit the ANOVA model
month_anova <- aov(runtime_mins ~ release_month, data = top_100)

# See the results
summary(month_anova)

##               Df Sum Sq Mean Sq F value Pr(>F)  
## release_month 11   7111   646.5   2.247  0.023 *
## Residuals     60  17265   287.8                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 28 observations deleted due to missingness

# Checks the normality of the residuals
shapiro.test(residuals(month_anova))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(month_anova)
## W = 0.93369, p-value = 0.0009251

# 2. Run the test (Response ~ Predictor) to check for equal variance 
levene_results <- leveneTest(runtime_mins ~ release_month, data = top_100)

# 3. Print the results
print(levene_results)

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group 11  0.7302 0.7054
##       60

Result of Statistical Tests

While the Shapiro-Wilk test suggested a departure from normality (p < 0.001), the ANOVA was maintained due to the Central Limit Theorem’s application to my sample size (n = 72) and the high homogeneity of variance (p = 0.71). In such cases, the F-statistic remains a reliable indicator of group differences.

The ANOVA model (P= 0.023) passes a significance of (P<0.05) therefore there is a statistically significant difference in the mean runtime of movies based on the month they were released. At least one month is physically “longer” or “shorter” than the others.

# Running a Turkey HSD test to see the pairwise comparisons to find which months are driving the p value from the ANOVA toward significance.

tukey_output <- TukeyHSD(month_anova)

# 2. Convert the matrix to a data frame so it can be filtered
tukey_df <- as.data.frame(tukey_output$release_month)

# 3. Filter for rows where the adjusted p-value is less than 0.05
significant_comparisons <- tukey_df[tukey_df$`p adj` < 0.05, ]

# 4. Print the result
print(significant_comparisons)

##              diff        lwr       upr       p adj
## Oct-Sep -38.13333 -73.058358 -3.208309 0.020972178
## Dec-Oct  44.80000   9.874976 79.725024 0.002782854

Results

ANOVA Results: A p-value of 0.023 indicates that mean runtimes are not equal across all months.Tukey HSD (Post-Hoc): The significant differences are driven specifically by October. On average, October films are 38 minutes shorter than September releases and 45 minutes shorter than December releases. Assumptions: Despite slight non-normality, the high homogeneity of variance (p = 0.71) and sample size (n = 72) support the validity of these results.

Conclusion

My study of the top 100 films reveals that release month is a significant predictor of movie runtime. The data highlights a “Holiday Peak,” where December releases are significantly longer—likely reflecting high-budget prestige or family epics. Conversely, the significantly shorter runtimes in October suggest a seasonal preference for faster-paced genres. These findings demonstrate that film length is strategically aligned with specific theatrical release windows. Analysis based on the 72 observations with complete release data.

References

Richmond, Jen, and TidyTuesday. “Netflix Engagement Data (2023-2025).” GitHub, 29 July 2025, https://github.com/rfordatascience/tidytuesday/tree/main/data/2025/2025-07-29.