HW3 - Linear Regression

06/09/2026

What is Linear Regression?

Linear regression is a statistical method used to model the relationship between two variables by fitting a straight line through the data.

Dependent variable (Y): What we want to predict
Independent variable (X): What we use to make predictions

Two real-world examples using the ggplot2movies dataset:

Budget → Votes: Can a movie’s production budget predict how many audience votes it receives? More spend often means wider release and more viewers.
Budget → Rating: Does a bigger budget lead to a higher audience rating? Not always — this tests whether money buys quality.

Visualizing the Examples

Each point is a film. The black line is the fitted regression line.

The Math Behind Linear Regression

The generic simple linear regression formula is:

\[\hat{Y} = b_0 + b_1 X\]

Symbol	Meaning
$\hat{Y}$	Predicted value of the outcome
$b_0$	Intercept — value of $\hat{Y}$ when $X = 0$
$b_1$	Slope — change in $\hat{Y}$ per 1-unit increase in $X$
$X$	The predictor (independent variable)

Applied to our movie example:

\[\widehat{Votes} = b_0 + b_1 \times Budget_{(millions)}\]

The slope $b_1$ answers: for each extra $1M in budget, how many more votes does a movie tend to get?

Budget Predicting Votes

Each point is a film. The black line is the fitted regression line from lm(votes ~ budget_millions).

Estimated Regression Equation

Our fitted model from the ggplot2movies data is:

\[\widehat{Votes} = 1614.71 + 285.46 \times Budget_{(millions)}\]

Interpreting the coefficients:

Intercept ($b_0$ = 1614.71): A movie with a $0 budget is predicted to receive ~1,614.71 votes.
Slope ($b_1$ = 285.46): Each additional $1 million in budget is associated with about 285.46 more votes on average.
R² = 0.1084: Budget alone explains ~10.8% of the variation in votes. The remaining ~89.2% is driven by other factors.

The R Code

library(ggplot2)
library(ggplot2movies)
library(plotly)

data(movies)
movies_clean <- movies[!is.na(movies$budget) &
                         movies$budget > 0 & movies$votes > 0, ]
movies_clean$budget_millions <- movies_clean$budget / 1e6
movies_sub <- movies_clean[movies_clean$budget_millions < 50, ]

fit <- lm(votes ~ budget_millions, data = movies_sub)
movies_sub$predicted <- predict(fit, movies_sub)

line_data <- data.frame(
  budget_millions = seq(min(movies_sub$budget_millions),
                        max(movies_sub$budget_millions),
                        length.out = 100)
)
line_data$predicted <- predict(fit, newdata = line_data)

plot_ly() %>%
  add_markers(
    data = movies_sub,
    x = ~budget_millions, y = ~votes,
    marker = list(size = 4, color = "#8C1D40", opacity = 0.45),
    hoverinfo = "skip",
    name = "Movies"
  ) %>%
  add_lines(
    data = line_data,
    x = ~budget_millions, y = ~predicted,
    line = list(color = "black", width = 3),
    hoverinfo = "skip",
    name = "Regression line"
  ) %>%
  layout(
    title = "Linear Regression: Budget Predicting Votes",
    xaxis = list(title = "Budget (Millions $)"),
    yaxis = list(title = "Votes"),
    showlegend = FALSE
  ) %>%
  config(staticPlot = TRUE, displayModeBar = FALSE)

summary(fit)

Summary

What we covered:

Linear regression models the straight-line relationship between a predictor $X$ and an outcome $Y$.
Formula: $\hat{Y} = b_0 + b_1 X$.
Example: Budget predicts votes verdict | Every extra $1M in budget is about 285.46 more votes on average.
Tools: ggplot2 for static plots · plotly for the 2D non-interactive regression chart.

Symbol	Meaning
\(\hat{Y}\)	Predicted value of the outcome
\(b_0\)	Intercept — value of \(\hat{Y}\) when \(X = 0\)
\(b_1\)	Slope — change in \(\hat{Y}\) per 1-unit increase in \(X\)
\(X\)	The predictor (independent variable)