- Statistical method that models two variables:
- Predictor (x): the input/independent variable
- Response (y): the output/dependent variable
- The goal is to find a line relation that best describes how X influences Y.
2026-06-07
\[Y = \beta_0 + \beta_1 X + \epsilon\]
\(Y\) is the response variable
\(X\) is the predictor variable
\(\beta_0\) is the intercept
\(\beta_1\) is the slope
\(\epsilon\) is the error term
\[R^2 = 1 - \frac{SSE}{SST} = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}\]
Where:
\(SST\) = total variance in Y
\(SSE\) = variance unexplained by the model
\(SST - SSE\) = variance explained by the model
movies_clean <- movies[movies$length >= 60 & movies$length <= 180, ] model_movies <- lm(rating ~ length, data = movies_clean) summary(model_movies)
## ## Call: ## lm(formula = rating ~ length, data = movies_clean) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.7420 -0.9187 0.1402 1.0224 4.6929 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.1838120 0.0377631 110.79 <2e-16 *** ## length 0.0170546 0.0003895 43.79 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.492 on 47578 degrees of freedom ## Multiple R-squared: 0.03874, Adjusted R-squared: 0.03872 ## F-statistic: 1917 on 1 and 47578 DF, p-value: < 2.2e-16