06/09/2026

What is Linear Regression?

Linear regression is a statistical method used to model the relationship between two variables by fitting a straight line through the data.

  • Dependent variable (Y): What we want to predict
  • Independent variable (X): What we use to make predictions

Two real-world examples using the ggplot2movies dataset:

  1. Budget → Votes: Can a movie’s production budget predict how many audience votes it receives? More spend often means wider release and more viewers.

  2. Budget → Rating: Does a bigger budget lead to a higher audience rating? Not always — this tests whether money buys quality.

Visualizing the Examples

Each point is a film. The black line is the fitted regression line.

The Math Behind Linear Regression

The generic simple linear regression formula is:

\[\hat{Y} = b_0 + b_1 X\]

Symbol Meaning
\(\hat{Y}\) Predicted value of the outcome
\(b_0\) Intercept — value of \(\hat{Y}\) when \(X = 0\)
\(b_1\) Slope — change in \(\hat{Y}\) per 1-unit increase in \(X\)
\(X\) The predictor (independent variable)

Applied to our movie example:

\[\widehat{Votes} = b_0 + b_1 \times Budget_{(millions)}\]

The slope \(b_1\) answers: for each extra $1M in budget, how many more votes does a movie tend to get?

Budget Predicting Votes

Each point is a film. The black line is the fitted regression line from lm(votes ~ budget_millions).

Estimated Regression Equation

Our fitted model from the ggplot2movies data is:

\[\widehat{Votes} = 1614.71 + 285.46 \times Budget_{(millions)}\]

Interpreting the coefficients:

  • Intercept (\(b_0\) = 1614.71): A movie with a $0 budget is predicted to receive ~1,614.71 votes.

  • Slope (\(b_1\) = 285.46): Each additional $1 million in budget is associated with about 285.46 more votes on average.

  • R² = 0.1084: Budget alone explains ~10.8% of the variation in votes. The remaining ~89.2% is driven by other factors.

The R Code

library(ggplot2)
library(ggplot2movies)
library(plotly)

data(movies)
movies_clean <- movies[!is.na(movies$budget) &
                         movies$budget > 0 & movies$votes > 0, ]
movies_clean$budget_millions <- movies_clean$budget / 1e6
movies_sub <- movies_clean[movies_clean$budget_millions < 50, ]

fit <- lm(votes ~ budget_millions, data = movies_sub)
movies_sub$predicted <- predict(fit, movies_sub)

line_data <- data.frame(
  budget_millions = seq(min(movies_sub$budget_millions),
                        max(movies_sub$budget_millions),
                        length.out = 100)
)
line_data$predicted <- predict(fit, newdata = line_data)

plot_ly() %>%
  add_markers(
    data = movies_sub,
    x = ~budget_millions, y = ~votes,
    marker = list(size = 4, color = "#8C1D40", opacity = 0.45),
    hoverinfo = "skip",
    name = "Movies"
  ) %>%
  add_lines(
    data = line_data,
    x = ~budget_millions, y = ~predicted,
    line = list(color = "black", width = 3),
    hoverinfo = "skip",
    name = "Regression line"
  ) %>%
  layout(
    title = "Linear Regression: Budget Predicting Votes",
    xaxis = list(title = "Budget (Millions $)"),
    yaxis = list(title = "Votes"),
    showlegend = FALSE
  ) %>%
  config(staticPlot = TRUE, displayModeBar = FALSE)

summary(fit)

Summary

What we covered:

  • Linear regression models the straight-line relationship between a predictor \(X\) and an outcome \(Y\).

  • Formula: \(\hat{Y} = b_0 + b_1 X\).

  • Example: Budget predicts votes verdict | Every extra $1M in budget is about 285.46 more votes on average.

  • Tools: ggplot2 for static plots · plotly for the 2D non-interactive regression chart.