Linear Regression

2024-09-21

Simple Linear Regression

Simple linear regression is used to show a relationship between two variables: \(x\) and \(y\). This relationship between an independent variable(\(x\)) and dependent variable(\(y\)) is assumed to be linear to a degree of error.

Determining this relationship is useful as it allows us to make a prediction of \(y\) based off a value \(x\).

Visualizing the Data

Plotting data can give a good idea whether or not there is any type of linear relationship between \(x\) and \(y\). A positive relationship means that as \(x\) increases so does \(y\). An inverse relationship means that as \(x\) increases, \(y\) will decrease.
The three graphs below are of dataset mtcars and demonstrate the types of linear relationships.

Simple Linear Regression Formula

The formula for a simple linear regression is:

\(y = \beta_0 + \beta_1\cdot x + \varepsilon\), where \(\varepsilon \sim \mathcal{N} (\mu=0;, \,\,\sigma^2)\)

Where:
\(y\) = dependent variable
\(\beta_0\) = intercept
\(\beta_1\) = regression coefficient (slope)
\(x\) = independent variable
\(\varepsilon\) = error

We can use this equation to help us predict \(y\) based on a value for \(x\). A line of best fit is used to represent a series of predictions. This regression line can be plotted on the graph with the data.

Calculating Simple Linear Regression

To calculate \(\beta_0\), \(\beta_1\), and \(\varepsilon\), the function lm() can be used. Here is an example comparing mpg and disp from dataset mtcars.

dvm = lm(mpg ~ disp, data = mtcars)
summary(dvm)

Call:
lm(formula = mpg ~ disp, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.8922 -2.2022 -0.9631  1.6272  7.2305 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 29.599855   1.229720  24.070  < 2e-16 ***
disp        -0.041215   0.004712  -8.747 9.38e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.251 on 30 degrees of freedom
Multiple R-squared:  0.7183,    Adjusted R-squared:  0.709 
F-statistic: 76.51 on 1 and 30 DF,  p-value: 9.38e-10

Calculating Simple Linear Regression

From the summary we can see that:
\(p-value < 9.38e-10\)
\(\beta_0 = 29.600\)
\(\beta_1 = -0.041\)
\(\varepsilon = 0.005\)

The first value to examine is the \(p-value\). It will give an indicator to if a relationship exists between the variables. A relationship exists if we can reject the null hypothesis: there is no relationship between \(x\) and \(y\). If the \(p-value\) is less than \(\alpha = 0.05\) we can reject the null hypothesis and say the relationship is statistically significant. In this case, \(9.38e-10 < 0.05\), meaning that a statistically significant relationship does exist.

Therefore, an equation for the simple linear regression of disp vs. mpg can be made:

\(y = 29.600 - 0.041\cdot x + 0.005\)

Dropping \(\varepsilon\) then gives the equation for the regression line:

\(y = 29.600 - 0.041\cdot x\)

The negative \(\beta_1\) indicates that disp and mpg are inversely related. When plotted, the data will form sloping downward. The \(y\)-intercept is \(29.600\), meaning that when \(x=0\), then \(y=29.00\).

Graphing Simple Linear Regression

This code plots the relationship between disp and mpg. A regression line is also included to represent our predictions for \(y\) based on values for \(x\).

dvm = lm(mpg ~ disp, data = mtcars)

xax= list(
  title = "Displacement")

yax= list(
  title = "Miles Per Gallon",
  range = c(5,40))

plot_ly(x = mtcars$disp, y = mtcars$mpg, type = "scatter", mode = "markers", 
name = "Data", width = 600, height = 322) %>%
add_lines(x = mtcars$disp, y = fitted(dvm), name = "Line of Best Fit") %>%
layout( xaxis = xax, yaxis = yax)

Graphing Simple Linear Regression

This graph shows the negative relationship between dist and mpg: as dist increases, mpg decreases.

As expected the negative slope of the regression line matches with the calculated slope of \(-0.041\). Note, that while this line offers a good prediction of \(y\) based on \(x\), the data points do not fall perfectly along this line. This is the error(\(\varepsilon\)).

Furthering Regression Analysis

Multiple linear regression is a way to further analysis of the data. Instead of one independent variable, several independent variables can be used to predict a \(y\) value.

Multiple linear regression equation:

\(y = \beta_0 + \beta_1\cdot x_1 + \beta_2\cdot x_2 + . . . + \beta_q\cdot x_q + \varepsilon\)

Where:
\(y\) = dependent variable
\(\beta_0\) = intercept
\(\beta_1\) = regression coefficient (slope)
\(x\) = independent variable
\(\beta_2\) = second regression coefficient
\(x_2\) = second independent variable
\(\beta_q\) = q regression coefficient
\(x_q\) = q independent variable
\(\varepsilon\) = error

This equation is the same as the one for simple linear regression, but with the added independent variables.

Graphing Multiple Linear Regresson

Like simple linear regression, multiple linear regression can also be graphed. Continuing the example from dataset mtcars of disp vs. mpg, a second independent variable hp is added as the \(z\) variable.