Simple Linear Regression

2023-10-15

Overview

This presentation will explain:
What simple linear regression is
Methods of finding the function for a given dataset
How to plot these functions to a dataset in R
This presentation will utilize the publicly available trees dataset for all examples.

Basis of Linear Regression

Simple linear regression is a parametric test and thus contains the following assumptions about the data:
Homogeneity of variance: error range is consistent throughout the data set
Independence of observations: sampling methods are valid and there are no hidden relationships between the variables in the data
Normality: data follows a normal distribution
An additional assumption is added for linear regression specifically:
A linear model is the most appropriate fit for the data

Linear Regression Equation Breakdown

\(y = \beta_0 + \beta_1x + \epsilon\)
\(x\) = independent variable
\(y\): predicted value of dependent variable
\(\beta_0\): predicted value of dependent variable when independent variable equals 0; intercept
\(\beta_1\): regression coefficient
\(\epsilon\): error of estimate; variation in estimate of regression coefficient

Examining Data

The data will first be examined to select suitable variables for a simple linear regression test.
This plot illustrates the relationship between the three variables of the trees dataset.

Conclusions from 3-D Plot

An exploration of the 3D plot makes it clear that of the 3 different variable combinations, girth and volume have the strongest linear relationship.
As such, girth and volume will be utilized as the two variables in this example simple linear regression, with girth as the independent variable and volume as the dependent variable.
As further proof of their relationship, the correlation coefficient is high between these two variables at 0.9671194. This can be calculated using the following code:

cor(trees$Girth, trees$Volume)

Scatterplot

Here is an initial look at the distribution of points plotting Girth against Volume for the trees dataset.

Data Coefficients

A summary table is useful in extrapolating the formula for line of best fit in a simple linear regression.
Note the Estimate (Intercept), the Estimate Girth, and the Error Girth values. These are the intercept, regression coefficient, and error, respectively.
The formula for this data is thus Volume = -36.9435. + 5.0659 * Girth ± 0.2474
Here is the code to acquire the summary table and a portion of the summary table for the trees dataset.

girth.volume.lm <- lm(Volume ~ Girth, data = trees)
summary(girth.volume.lm)

Alternative Method

Beyond utilizing R’s summary data, it is also possible to generate the line of best fit of a simple linear regression manually using least squares method (where \(n\) = total number of observations)
\(\beta_1 = n\Sigma(x*y) - \Sigma(y)\Sigma(x)/(\Sigma(x^2)-(\Sigma(x))^2)\)
\(\beta_0 = (\Sigma y - \beta_1\Sigma x)/n\)
The resulting function for the trees dataset is Volume = 5.215771 * Girth - 38.92958

Plotting Line(s) of Best Fit

The following plot illustrates the data with the “R-Generated Line” and “Least-Squares Line”

## `geom_smooth()` using formula = 'y ~ x'

Key Code to Remember

Here is the code used to generate the plot on the previous slide.

ggplot(trees, aes(Girth,Volume), main = "Volume vs Girth in Trees") +
  geom_point()  +
  stat_smooth(method = lm, col = "blue") +
  geom_abline(intercept = -38.92958, slope = 5.215771, col = "red")

And here is the code to calculate the slope and intercept through the least squares method.

#slope
m = (nrow(trees) - sum(trees$Volume)*sum(trees$Volume))/((nrow(trees))^2 - (sum(trees$Girth))^2) 
#intercept
b = (sum(trees$Volume) - (m*sum(trees$Girth)))/nrow(trees)

Overview

Basis of Linear Regression

Linear Regression Equation Breakdown

Examining Data

Conclusions from 3-D Plot

Scatterplot

Data Coefficients

Alternative Method

Plotting Line(s) of Best Fit

Key Code to Remember

Sources