2026-06-07

What is Simple Linear Regression?

  • Statistical method that models two variables:
    • Predictor (x): the input/independent variable
    • Response (y): the output/dependent variable
  • The goal is to find a line relation that best describes how X influences Y.

Examples:

  • Does movie length predict ratings?
  • Does study time predict exam scores?

Equations

\[Y = \beta_0 + \beta_1 X + \epsilon\]

  • \(Y\) is the response variable

  • \(X\) is the predictor variable

  • \(\beta_0\) is the intercept

  • \(\beta_1\) is the slope

  • \(\epsilon\) is the error term

Goodness of Fit

  • After fitting a regression line we want to know how good is this line actually?
  • So, we use R²:

\[R^2 = 1 - \frac{SSE}{SST} = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}\]

Where:

  • \(SST\) = total variance in Y

  • \(SSE\) = variance unexplained by the model

  • \(SST - SSE\) = variance explained by the model

Example: Does Movie Length Predict Rating?

Example 2: Does study time predict exam scores?

Fitting a Linear Model in R

movies_clean <- movies[movies$length >= 60 & movies$length <= 180, ]
model_movies <- lm(rating ~ length, data = movies_clean)
summary(model_movies)
## 
## Call:
## lm(formula = rating ~ length, data = movies_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7420 -0.9187  0.1402  1.0224  4.6929 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.1838120  0.0377631  110.79   <2e-16 ***
## length      0.0170546  0.0003895   43.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.492 on 47578 degrees of freedom
## Multiple R-squared:  0.03874,    Adjusted R-squared:  0.03872 
## F-statistic:  1917 on 1 and 47578 DF,  p-value: < 2.2e-16

Multiple Linear Regression -> a 3D Model