What Is Linear Regression?

  • A model that estimates the relationship between dependent and independent variables by fitting a linear equation to data
  • Used to predict outcomes of future events
  • Example: An analyst could use an employee + salary data set to create a model about how years of work experience (independent variable) impacts a person’s salary (dependent variable) and make predictions on people’s salaries based on their experience

Simple Linear Regression Formula

  • The formula for simple linear regression is shown below \[ y = \beta_0 + \beta_1 x \]

  • \(y\) = dependent variable

  • \(\beta_0\) = intercept

  • \(\beta_1\) = regression coefficient

  • \(x\) = independent variable

Sample Code for Linear Regression Model

Using ggplot2’s built-in linear regression model

library(ggplot2)

data(mtcars)

clean_mtcars = na.omit(mtcars[, c("mpg", "hp")])

fig_no_model <- ggplot(clean_mtcars,
                       aes(x = hp, y = mpg)) +
                       geom_point() +
                       labs(x = "Horsepower", y = "Miles Per Gallon")

fig_model <- ggplot(clean_mtcars, 
              aes(x = hp, y = mpg)) + 
              geom_point() +
              stat_smooth(method = "lm", col = "red") +
              labs(x = "Horsepower", y = "Miles Per Gallon")

Plot Without Fitted Model

fig_no_model

Plot With Fitted Model

fig_model

Multiple Linear Regression

The formula for multiple linear regression is shown below \[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n \]

  • \(y\) = dependent variable
  • \(\beta_0\) = intercept
  • \(\beta_1, \beta_2, \beta_n\) = regression coefficients
  • \(x_1, x_2, x_n\) = independent variables

Code for Multiple Linear Regression Model

library(MASS)
library(dplyr)
library(plotly)

data(Boston)

clean_Boston = na.omit(Boston[, c("medv", "rm", "age")])

model = lm(medv ~ rm + age, data = clean_Boston)

grid_rm  <- seq(min(clean_Boston$rm),  max(clean_Boston$rm), length.out = 40)
grid_age <- seq(min(clean_Boston$age), max(clean_Boston$age), length.out = 40)
grid <- expand.grid(rm = grid_rm, age = grid_age)

grid$pred_medv <- predict(model, newdata = grid)

fig <- plot_ly(
  data = clean_Boston, 
  x = ~rm, 
  y = ~age, 
  z = ~medv, 
  type = "scatter3d", 
  mode = "markers", 
  marker = list(size = 3, color = ~medv, opacity = 0.6)
)

Code for Multiple Linear Regression Model Continued

fig <- fig %>% add_trace(
  x = grid$rm, 
  y = grid$age, 
  z = grid$pred_medv, 
  type = "mesh3d", 
  opacity = 0.5
) %>%
layout(
  scene = list(
    xaxis = list(title = "Avg House Age"),
    yaxis = list(title = "Avg # of Rooms Per House"),
    zaxis = list(title = "Median House Price")
  ),
  title = "Multiple Linear Regression: Median Boston House Price vs Age & Rooms"
)

Interactive Plot of Multiple Linear Regression Model

Applications of Linear Regression

  • Useful for machine learning
  • Useful for forecasting and making predictions
  • Used to understand relationships between variables such as fertilizer on crops