2024-10-19

Simple Linear Regression Intro

  • Simple Linear Regression is used to model the expected relationship between two variables
  • The relationship is modeled by fitting a linear line to to the Y vs X data
  • Simple Linear Regression is often used to predict data and understand relationships between two variables
  • Because of its simplicity compared to other relationship modeling techniques it is a fundamental and often used technique

Math Behind Simple Linear Regression (Least Squares)

  • The goal is to find a linear equation, where \(\hat{y}\) is the the estimated value of \(y\), \(x\) is the independent variable, \(m\) is the slope, and \(b\) is the y-intercept, where the sum of the squares of \(\hat{y} - y\) for all observed \(x\) values is minimized (Least Squares Model)

\(\hat{y} = mx + b\)

\(b = \bar{y} - m\bar{x}\)

\(m = {\sum(x_i-\bar{x})(y_i-\bar{y}) \over \sum(x_i-\bar{x})^2}\)

Example of math Using Least Squares

  • Lets do a walk through on the following data set Salary vs Age
##   Salary Age
## 1  50000  23
## 2  80000  28
## 3 110000  36
  • First lets find \(m\), the slope of the linear regression model where \(\bar{x} = 29\) and \(\bar{y} = 80000\)

\(m = {\sum(x_i-\bar{x})(y_i-\bar{y}) \over \sum(x_i-\bar{x})^2}\)

\(= {(23 - 29)(50000-80000) + (28-29)(80000-80000)+(36-29)(110000 - 80000) \over (23 - 29)^2+(28-29)^2+(36-29)^2}\)

\(= {-30000+0+210000 \over 36+1+49} = {180000 \over 86} = 2093\)

  • cont. next page

Example of Math Cont.

  • Next lets find \(b\), the y-intercept of the linear regression model, where \(\bar{x} = 29\) and \(\bar{y} = 80000\) and \(m= 2093\)

\(b = \bar{y} - m\bar{x} = 80000 - 2093(29) = 19303\)

  • Now that we have found \(b\) and \(m\) we can format our equation for simple linear regression

\(\hat{y} = 2093x + 19303\)

Now Lets Use R and R Libraries to Present Linear Regression Models

  • In this slide and the Next we Will Use ggplot2 library to Model Linear Regression
  • We will use a built in data set, mpg, and model Highway MPG vs Engine Displacement

Understanding the Code Behind the Plot

library(ggplot2)
g <- ggplot(data = mpg, aes(x = displ, y = hwy)) + 
  geom_point() +
  geom_smooth(method ="lm", se = FALSE) +
  xlab("Engine Displacement (Liters)") +
  ylab("Highway MPG (Miles/Gallon)")
print(g)
  • First we import the ggplot2 library for use
  • Next we create a ggplot object and set the data, x axis, and y axis
  • Then we use geom_point() to add the data points on the graph, creating a scatter plot
  • After that, geom_smooth() adds a line using the lm method (linear regression)
  • Finally we add axis labels and display the plot

Lets Explore Some More Features of ggplot and linear regression

  • In this Slide we will spice up a simple linear regression plot by adding reference to other variables using color and point shape which can give more insight into trends in the data

Understanding the Code Behind the Plot

library(ggplot2)
g <- ggplot(data = mpg, aes(x = displ, y = hwy)) + 
  geom_point(aes(color = cty, shape = drv)) +
  geom_smooth(method ="lm", se = FALSE) +
  xlab("Engine Displacement (Liters)") +
  ylab("Highway MPG (Miles/Gallon)") +
  scale_color_gradient(low = "red", high = "blue")
print(g)
  • The largest difference in the code behind this plot is in the geom_point() function
  • This time we specify color as a function of cty values and shape as a function of drv values
  • We also specify the color scale using the scale_color_gradient() function
  • This graph provides insight into the City MPG and Drive Train data

Intoduction to Simple Linear Regression with Plotly

  • So far we have only looked at Simple linear regression with ggplot, there are other ways to do it in R, another common one being using the plotly library
  • In the next slide we will show an example of multiple linear regression implemented using the plotly library

Multiple Linear Regression Using Plotly

Understanding the Code of a Plotly Linear Regression

library(plotly)
library(dplyr)
data(Orange)
mod <- lm(Orange$circumference ~ Orange$age)

xax <- list(title = "Age of Tree", titlefont = list(family="Modern COmputer Roman"))
yax <- list(title = "Circumference of Tree", titlefont = list(family="Modern COmputer Roman"))

fig <- plot_ly(x = Orange$age, y = Orange$circumference, type = 'scatter', mode = 'markers', name = 'Circumference VS. Age') %>%
  add_lines(x = Orange$age, y = fitted(mod), name = 'Linear Regression Line') %>%
  layout(xaxis = xax, yaxis = yax, title = 'Tree Circumference VS. Age')
print(fig)

Understanding the Code of a Plotly Linear Regression Cont.

  • First we start by defining the linear regression formula using lm()
  • Next we setup the formatting for the X and Y axis
  • Then we plot the scatter plot by calling plot_ly and declaring the type to scatter as well as the point values
  • Later we call the add_lines() function to add the linear regression line on top of the scatter plot
  • Finally we add the xaxis and yaxis that we prevously defined and print the plot