2026-02-11

Simple Linear Regression Defined

Scenario

Given a plot of multiple defined points, we can use a line of best fit to predict theoretical y-values for each x-value.

Required Assumptions

Before using linear regression, 2 facts must be made certain in the sample. The independence of errors is the idea that the error of one point cannot be dependent on another point. The second is that the errors should be centered around 0 as a general bell shape, what is generally called normal distribution.

For every point:

\[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]

Example: MTCars Dataset - Code

Let’s graph the relationship between cars’ efficiency (MPG) by its weight. Firstly, we must obtain a linear model for mpg as a function of wt by using the lm() function. After, we can use plot_ly to create a new scatter, and then add the line of best fit using the information of the linear model.

regressioninfo <- lm(mpg ~ wt, data = mtcars)

wtvsmpgplot <- plot_ly(
data = mtcars,
x = ~wt, y = ~mpg,
type = "scatter", mode = "markers", name="Plotted Points"
) %>%
add_lines(x = ~wt, y = fitted(regressioninfo), name = "Line of Best Fit") %>%
layout(
title = "MPG vs Weight",
xaxis = list(title = "Weight"),
yaxis = list(title = "Miles Per Gallon"),
showlegend = TRUE

)
wtvsmpgplot

Example: MTCars Dataset - Graph

Notice how the line of best fit doesn’t necessarily prioritize skimming through every point, but simply minimizes the sum of errors of ALL points.

How then does lm() find the right Y-Intersect and Slope?

In DAT300, derived from the linear equation per point as seen previously, we could isolate error, and using partial derivatives to find at what points of the y-intersect AND slope is it the smallest. \[ \small \text{Partial for intersect:} = \frac{\partial}{\partial \beta_0} \sum_{i=1}^{n} \left(y_i - \beta_0 - \beta_1 x_i\right)^2 \\ \small \text{Partial for slope} = \frac{\partial}{\partial \beta_1} \sum_{i=1}^{n} \left(y_i - \beta_0 - \beta_1 x_i\right)^2 \] After setting the partials to 0, we can use the system of equations to calculate both the y intersect and slope.

Linear Regression Shown Via ggplot2 Package - Code

There are some input parameters in ggplot_2 that represent other aspects of linear regression that plotly can’t show. For instance, geom_smooth() is a layer besides the others that includes adding “se=TRUE”, which actually shows the standard error, something that I couldn’t find in plotly.

ggplotmtvscars <- ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(size = 3) + 
  geom_smooth(method = "lm", se = TRUE, color = "orange") +
  labs(title="MPG VS Weight", x ="Weight", y="Miles per Gallon")+theme_light(base_size=16)
ggplotmtvscars

Linear Regression Shown Via ggplot2 Package - Plot

While we did find the line with the least error, this is purely theoretical. Thus, there is a level of uncertainty to be calculated to represent a range to which real values could be seen.

Linear Regression Using factor() Function

Alongside showing the standard error, we can organize the graph in ggplot2 easier than in plotly to show other factors, such as the number of car cylinders: