What is Linear Regression?
- reflects relationship between two quantifyable variables
- data of normal distribution, assumes relationship is linear
- focus: statistical relationships
2025-10-16
\[ y = \beta_0 + \beta_1 x_i + \varepsilon_i \]
where:
\(y\) = dependent variable, often the observed result
\(x_i\) = independent variable, often the thing we change
\(\beta_0\) = intercept (when \(x\) is 0, what is the observed result, \(y\)?)
\(\beta_1\) = slope (change in \(y\) for a unit change in \(x\))
\(\varepsilon_i\) = the error value for the equation (noise, variation in the data)…
#ggplot is used to plot graphs. aes stands for aesthetic, and can #be used to specify your axes from the specified data. geom point #plots your data values, and geom_smooth is your line of fit (lm #means linear model). se represents standard error, but we'll get #back to that. ggplot(data = trees, aes(x = Girth, y = Volume))+ geom_point(color="darkorchid", size=2) + geom_smooth(method="lm", se=FALSE, color="black")
## `geom_smooth()` using formula = 'y ~ x'
The black line represents the simple linear regression model: \[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x + \varepsilon_i \]
In this case, our x value is girth.
The other values in the equation:
\(\hat{\beta}_0\): intercept — what is the predicted volume when girth is 0?
\(\hat{\beta}_1\): slope - what is the average rate of change of volume with increasing values of girth?
\(y\): output - what the corresponding volume of the tree is to the girth val.
\(\varepsilon_i\)?
When we fit a line to data, not all points lie exactly on the line. The more scattered these points have, the more noise they create.
The differences between the actual data and the predicted values on the line are called residuals.
\[ \varepsilon_i = y_i - \hat{y}_i \]
A confidence interval shows the range of values where the actual regression line might be optimally placed.
For the purposes of this presentation, linear models often have a confidence interval of 95%.
But,
Let’s look at the example on the next slide to see what that really means.
model <- lm(Volume ~ Girth, data = trees) #specifying our model again trees$predicted <- predict(model) #the points on the line trees$residuals <- residuals(model) #the differences between actual #and predicted ggplot(trees, aes(x = Girth, y = Volume)) + geom_point(color = "darkorchid", size = 2) + geom_smooth(method = "lm", se = TRUE, color = "black") + geom_segment(aes(xend = Girth, yend = predicted), color = "cornflowerblue")
## `geom_smooth()` using formula = 'y ~ x'
As you can see, the line of best fit here could be linear. But is linear the best choice? What other kinds of best fit models could work?
https://www.scribbr.com/statistics/simple-linear-regression/#:~:text=Simple%20linear%20regression%20is%20a%20regression%20model%20that%20estimates%20the,Both%20variables%20should%20be%20quantitative.