2026-02-07
Linear regression models how a predictor variable \(x\) helps explain variation in a response variable \(y\).
\[ y = \beta_0 + \beta_1 x + \varepsilon \]
\(\beta_0\) = theoretical y‑intercept
\(\beta_1\) = theoretical slope
\(\varepsilon\) = error term (also called the disturbance or noise). It captures everything the model cannot explain — measurement error, randomness, natural variability, or omitted variables from the regression.
Linear regression relies on some key assumptions about the relationship between the predictor, the response, and the error term. The first time assumes linearity as shown below:
\(\mathbb{E}[Y \mid X = x] = \beta_0 + \beta_1 x\)
\(\mathbb{E}[Y \mid X = x]\) = the expected value of \(Y\) when the predictor \(X\) takes a specific value \(x\).
\(\varepsilon \sim i.i.d. N(0, \sigma^2)\)
\(i.i.d.\) = The errors are independent (one error does not influence another), and identically distributed (all errors come from the same distribution).
\(N(0, \sigma^2)\) = The error follows a normal distribution with mean \(0\) and variance \(\sigma^2\). Also known as the Constant Variance Assumption.
model <- lm(medv ~ rm, data = Boston) summary(model)
## ## Call: ## lm(formula = medv ~ rm, data = Boston) ## ## Residuals: ## Min 1Q Median 3Q Max ## -23.346 -2.547 0.090 2.986 39.433 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -34.671 2.650 -13.08 <2e-16 *** ## rm 9.102 0.419 21.72 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6.616 on 504 degrees of freedom ## Multiple R-squared: 0.4835, Adjusted R-squared: 0.4825 ## F-statistic: 471.8 on 1 and 504 DF, p-value: < 2.2e-16
sorted <- Boston[order(Boston$rm), ] predicted_values <- predict(model, newdata = sorted)
Average Number of Rooms vs. Median Housing Value This plotyly plot shows how room count relates to home value.
:::