2026-06-03

Comedy Film Analysis (Simple Linear Regression)

  • Goal: To analyze the trends and factors influencing comedy movies.
  • Why: Comedy is a major genre that has evolved significantly over the decades.
  • Tools: R, ggplot2, plotly, and simple linear regression
  • Importance:
    • Provides evidence to support or challenge assumptions.
    • Enables data-driven decision-making for filmmakers and analysts.
    • Demonstrates the practical application of statistical modeling.

Simple Linear Regression

  • What is it: A statistical method to model the relationship between two continuous variables.
    • Independent Variable: The predictor (e.g., Year, Movie Length).
    • Dependent Variable: The outcome(e.g., Rating).
  • Crux: Calculate the Regression Line by minimizing the sum of squared residuals.
  • Why it matters:
    • It allows us to quantify the impact of one variable on another.
    • It provides a framework to make predictions and test hypotheses.

Mathematical Foundation

  • We model the relationship between variables using the simple linear regression equation: \[\hat{y} = \beta_0 + \beta_1x + \epsilon\]
  • To identify the regression line, we minimize the sum of squared residuals: \[SSR = \sum_{i=1}^{n} (y_i - (\beta_0 + \beta_1x_i))^2\]

Ordinary Least Squares

  • The objective is to find the values that minimize the sum of squared residuals.
  • To minimize the function, we take the partial derivatives with respect to \(\beta_0\) and \(\beta_1\): \[\frac{\partial SSR}{\partial \beta_0} = -2 \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1x_i) = 0\] \[\frac{\partial SSR}{\partial \beta_1} = -2 \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1x_i)x_i = 0\]
  • Solving these equations gives us the result: \[\hat{\beta}_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}, \quad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}\]

3D View: Year, Length, and Rating

Year vs. Rating

ggplot(comedy_clean, aes(x = year, y = rating)) +
  geom_point(alpha = 0.3, color = "red") +
  geom_smooth(method = "lm", formula = y ~ x, color = "blue") +
  labs(title = "Comedy Ratings Over Time", x = "Year", y = "Rating")

Year vs. Rating

Length vs. Rating

ggplot(comedy_clean, aes(x = length, y = rating)) +
  geom_point(alpha = 0.3, color = "purple") +
  geom_smooth(method = "lm", formula = y ~ x, color = "orange") +
  labs(title = "Comedy Ratings Over Length", x = "Length", y = "Rating")

Length vs. Rating