2026-02-04

Simple Linear Regression

Simple linear regression is a statistical approach used to model the relationship between one dependent variable and one independent variable. To do this, we use a straight line to estimate the relationship while minimizing discrepancies between predicted values and the actual data points.

There is a different version of this approach called multiple linear regression, which works with multiple independent variables instead of just one, but for now, we’ll just focus on simple linear regression.

Linear Regression Equation

The equation used for linear regression is: \({Y = \beta_{0} + \beta_{1}X + \epsilon}\)

The variables in this equation are as follows:

  • \({Y}\) is the dependent variable
  • \({X}\) is the independent variable
  • \({\beta_{0}}\) is the y-intercept
  • \({\beta_{1}}\) is the slope of the regression line
  • \({\epsilon}\) is the error term

Using Linear Regression in R

Using built-in functions in R, we can get the linear regression of a given dataset, as long as we provide an independent variable and a dependent variable.

We’ll take the built-in cars dataset as an example. First, let’s use the head() function to get an idea of what the data looks like.

head(cars)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Linear Regression with lm()

As we can see, the cars dataset has columns for speed and distance. Now, the lm() function in R can be used to fit linear models for a given formula and data. In this example, let’s use distance as the dependent variable and speed as the independent variable:

lmcars = lm(formula = dist ~ speed, data = cars)
summary(lmcars)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Linear Regression with lm() (cont.)

So, using the results of lm(), we can fill in some of the variables in our formula \({Y = \beta_{0} + \beta_{1}X + \epsilon}\):

  • \({\beta_{0} = -17.5791}\)
  • \({\beta_{1} = 3.9324}\)
  • \({\epsilon}\) is not directly provided by lm(), only showing the residuals instead.

So, the linear regression equation for this example would be \({Y = -17.5791 + 3.9324X}\). However, the best way to see linear regression in action would be through a plot, and luckily R has a few methods of doing just that.

Linear Regression with plot_ly()

The plotly package can be used to create a scatter plot of our data and add our regression line to it, as long as we provide the data and model. For example, using the cars dataset and the model we came up with a couple of slides back, we can create the following plot:

plot_ly(data = cars, x = ~speed, y = ~dist, type = "scatter", mode = 'markers') %>% 
  add_lines(x = ~speed, y = fitted(lmcars)) %>% layout(xaxis = list(title = "Speed"), 
  yaxis = list(title = "Distance"))

Linear Repression with ggplot()

The ggplot2 package can also be used to plot linear regression lines along with scatter plots. In fact, in this case, it’s even simpler, as you don’t even have to provide the linear model first; all you need is the dataset, the independent variable, and the dependent variable.

ggplot(cars, aes(x = speed, y = dist)) + geom_point() + geom_smooth(method = "lm", 
  se = FALSE) + theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

Linear Repression with ggplot() (cont.)

Lastly, if we want to include the fitted regression equation with the graph, we can use the stat_regline_equation() function from the ggpubr package.

ggplot(cars, aes(x = speed, y = dist)) + geom_point() + geom_smooth(method = "lm", 
  se = FALSE) + theme_bw() + stat_regline_equation()
## `geom_smooth()` using formula = 'y ~ x'

Note that the equation provided by stat_regline_equation() is roughly equal to the linear regression equation we calculated a few slides before.

The End

Any questions?