2025-03-16

Linear Regression

Linear regression estimates a line of form \(y = m\cdot x + b\) to best fit given data where variables estimates are calculated according to the following equations.

\[m = \frac{S_{xy}}{S_{xx}} = \frac{\Sigma(x_i - \bar{x})(y_i-\bar{y})}{(x_i-\bar{x})^2}\] \[b = \bar{y} - m\cdot\bar{x}\] \[\bar{y} = \frac{\Sigma y_i}{n} \quad ; \quad \bar{x} = \frac{\Sigma x_i}{n}\]

Thus, \(\hat{y} = m\cdot x + b\), where \(\hat{y}\) is the estimated value of y using the model

Additionally, \(y_i = m\cdot x_i + b + \epsilon_i\), where \(\epsilon\) is the error of the model

Example using mtcars

We can use linear regression to find the extent of linear relationships between variables.

Using the sum of squares \(r^2\), we can see what variable the miles per gallon (MPG) of cars is most dependent on. Variables tested will be:

  1. Number of Cylinders

  2. Horsepower

  3. Drat (rear axle ratio)

  4. Weight

Linear Dependence

Linear dependence can be measured using the model’s \(r^2\) value, and can be calculated as follows

\[r^2 = 1 - \frac{SS_{residual}}{SS_{Total}}=1-\frac{\Sigma (y_i-\hat{y}_i)^2}{\Sigma (y_i - \bar{y})^2}\]

The closer \(r^2\) is to 1, the stronger the linear dependence, generally:

\(r^2\) > 0.7 = Strong linear dependence

0.7 > \(r^2\) > 0.4 = Moderate linear dependence

0.4 > \(r^2\) = Weak linear dependence

R code

Here is the code for linear regression in R and plot using ggplot2

data(mtcars) #load data

model <- lm(mpg~cyl, data=mtcars) #store linear model of mpg vs cyl in "model"

# This code obtains the slope, intercept, and r^2 value for labeling on the plot
info <- substitute(y == b + m %.% x*","~~r^2~"="~rsquare, 
        list(b = format(unname(coef(model)[1]), digits = 2),
              m = format(unname(coef(model)[2]), digits = 2),
             rsquare = format(summary(model)$r.squared, digits = 3)));
eq <- as.character(as.expression(info)) #coerce into character for labeling

## ggplot can also create a linear regression model of mpg vs cyl
ggplot(mtcars,aes(cyl, mpg)) +                       # define data set, x, and y
  geom_point() +                                     #scatter plot
  geom_smooth(method='lm') +                         #linear model
  geom_text(x = 8, y = 30, label = eq, parse = TRUE) #add in label
## `geom_smooth()` using formula = 'y ~ x'

Cylinders vs. Miles per Gallon with Regression

## `geom_smooth()` using formula = 'y ~ x'

Horsepower vs. Miles per Gallon with Regression

## `geom_smooth()` using formula = 'y ~ x'

Drat (real axle ratio) vs. Miles per Gallon with Regression

## `geom_smooth()` using formula = 'y ~ x'

Weight vs. Miles per Gallon with Regression

## `geom_smooth()` using formula = 'y ~ x'

Miles per gallon dependencies

  • Miles per gallon is most dependent on number of cylinders and weight according to the \(r^2\) values, 0.726 and 0.753, respectively.
  • Conducting 3 Dimensional Linear Regression can yield more a accurate model than just 2 dimensional regression and will be done with the independent variables “cylinders” and “weight”
    • The two models should not be combined directly due to a possible confounding relationship between the independent variables
  • 3 dimensional regression finds a plane of form \(z = m_1\cdot x + m_2\cdot y + b\) and can be done using lm() in R
  • The regression can be viewed using a plane on a 3 dimensional plot from plotly

Multiple Regression Plot

ML regression plotting code

model <- lm(mpg~ cyl+wt, data=mtcars)
b <- format(unname(coef(model)[1]), digits = 2)
m1 <- format(unname(coef(model)[2]), digits = 2)
m2 <- format(unname(coef(model)[3]), digits = 2)
rsquare <- format(summary(model)$r.squared, digits = 4) 
#old technique doesn't work, as plotly cannot parse the equation like ggplot
#equation and r^2 is rewritten below to be put into the plot
eq = ("-1.5*Cylinders - 3.2*Weight + 40 ; r^2 = 0.830")

x = seq(4,8, by=0.2); y = seq(1,6, by=0.25) # Vectors of same length
Cylinders = matrix(rep(x,length(y)),nrow=length(x),byrow = T)
Wt = matrix(rep(y,length(x)),ncol=length(y),byrow = F) # Create mesh grids
MPG = Cylinders*coef(model)[2] + Wt*coef(model)[3] + coef(model)[1]
xax = list(title = "Cylinders");yax = list(title = "Weight (1000 lbs)")
zax = list(title = "MPG")
fig <- plot_ly(x=Cylinders, y=Wt, z=MPG, showscale = F) %>%
layout(scene = list(xaxis = xax, yaxis = yax, zaxis = zax))
fig <- fig %>% add_surface(z = MPG) #add plane
fig <- fig %>% add_trace(data = mtcars, x = mtcars$cyl, y = mtcars$wt, z = mtcars$mpg,
mode = "markers", type = "scatter3d", marker = list(size = 4, color = "red",
  symbol = 10)) %>% # add data points 
  layout(title = 'MPG vs Weight vs Cylinders')
fig <- fig %>% add_annotations(x = 1.15, y = .9, z = 15, text = eq, showarrow = F) #label
fig

Conclusions

The MPG of a car appears to be strongly linearly dependent on number of cylinders and weight, with linear dependence increasing in multiple linear regression, demonstrated by the \(r^2\) value increasing from 0.726 and 0.753 to 0.830 when both variables are accounted for.