2024-11-18

Definition

Simple linear regression is a statistical technique allowing us to study the relationship between one dependent variable (denoted y) and one independent variable (denoted x). This relationship can be illustrated using the “best fitting line,” which is calculated using the equation \(\hat{y} = \beta_0 + \beta_1 x\) where:

  • \(\hat{y}\) is the expected value of the dependent variable,
  • \(\beta_0\) is the best fitting line’s intercept with the y axis,
  • \(\beta_1\) is the best fitting line’s slope, and
  • x is the value of the independent variable.

Best Fitting Line Identification

In cases where there are several lines that could possibly serve as the best fitting line, the sum of squares is used to measure each line’s total deviation from the mean value. The line with the lowest sum of squares, indicating the least variation from the data provided, is used as the best fitting line.

The sum of squares is calculated using the formula \(\sum |errors|^2\)

Miles per Gallon vs. Gross Horsepower

An example of simple linear regression can be created using the mtcars dataset packaged with RStudio. Applying the previous equation, the best fitting line is defined as \(MPG = \beta_0 + \beta_1 Horsepower\)

Miles per Gallon vs. Gross Horsepower

This plot was generated using the plotly library. The code for this plot is:

data(mtcars)
y = mtcars$mpg
x = mtcars$hp
mod = lm(y~x)

xax <- list(
  title = "Gross Horsepower",
  titlefont = list(family="Modern Computer Roman")
)

yax <- list(
  title = "Miles Per Gallon",
  titlefont = list(family="Modern Computer Roman")
)

plot_ly(x=x, y=y, type="scatter", mode="markers") %>%
   add_lines(x = x, y = fitted(mod)) %>%
   layout(xaxis = xax, yaxis = yax)

Positive and Negative Correlation

The dependent and independent variables can demonstrate a negative correlation, as shown in the previous plot, or a positive correlation. The following code plots displacement against horsepower using ggplot and demonstrates a positive correlation:

g <-  ggplot(data = mtcars, aes(x = disp, y = hp)) + geom_point()
g + geom_smooth(method="lm") + xlab("Displacement") + ylab("Gross Horsepower")

Positive and Negative Correlation

## `geom_smooth()` using formula = 'y ~ x'

Strong Relationship

Some variables represent a strong linear relationship:

## `geom_smooth()` using formula = 'y ~ x'

Weak Relationship

Alternatively, there may not be a strong linear relationship between two variables, as shown by this plot’s wide confidence interval and outlying values:

## `geom_smooth()` using formula = 'y ~ x'