2023-10-15

1

Overview

  • This presentation will explain:
  • What simple linear regression is
  • Methods of finding the function for a given dataset
  • How to plot these functions to a dataset in R
  • This presentation will utilize the publicly available trees dataset for all examples.

Basis of Linear Regression

  • Simple linear regression is a parametric test and thus contains the following assumptions about the data:
  • Homogeneity of variance: error range is consistent throughout the data set
  • Independence of observations: sampling methods are valid and there are no hidden relationships between the variables in the data
  • Normality: data follows a normal distribution
  • An additional assumption is added for linear regression specifically:
  • A linear model is the most appropriate fit for the data

Linear Regression Equation Breakdown

  • \(y = \beta_0 + \beta_1x + \epsilon\)
  • \(x\) = independent variable
  • \(y\): predicted value of dependent variable
  • \(\beta_0\): predicted value of dependent variable when independent variable equals 0; intercept
  • \(\beta_1\): regression coefficient
  • \(\epsilon\): error of estimate; variation in estimate of regression coefficient

Examining Data

  • The data will first be examined to select suitable variables for a simple linear regression test.
  • This plot illustrates the relationship between the three variables of the trees dataset.

Conclusions from 3-D Plot

  • An exploration of the 3D plot makes it clear that of the 3 different variable combinations, girth and volume have the strongest linear relationship.
  • As such, girth and volume will be utilized as the two variables in this example simple linear regression, with girth as the independent variable and volume as the dependent variable.
  • As further proof of their relationship, the correlation coefficient is high between these two variables at 0.9671194. This can be calculated using the following code:
cor(trees$Girth, trees$Volume)

Scatterplot

  • Here is an initial look at the distribution of points plotting Girth against Volume for the trees dataset.

Data Coefficients

  • A summary table is useful in extrapolating the formula for line of best fit in a simple linear regression.
  • Note the Estimate (Intercept), the Estimate Girth, and the Error Girth values. These are the intercept, regression coefficient, and error, respectively.
  • The formula for this data is thus Volume = -36.9435. + 5.0659 * Girth ± 0.2474
  • Here is the code to acquire the summary table and a portion of the summary table for the trees dataset.
girth.volume.lm <- lm(Volume ~ Girth, data = trees)
summary(girth.volume.lm)

Alternative Method

  • Beyond utilizing R’s summary data, it is also possible to generate the line of best fit of a simple linear regression manually using least squares method (where \(n\) = total number of observations)
  • \(\beta_1 = n\Sigma(x*y) - \Sigma(y)\Sigma(x)/(\Sigma(x^2)-(\Sigma(x))^2)\)
  • \(\beta_0 = (\Sigma y - \beta_1\Sigma x)/n\)
  • The resulting function for the trees dataset is Volume = 5.215771 * Girth - 38.92958

Plotting Line(s) of Best Fit

  • The following plot illustrates the data with the “R-Generated Line” and “Least-Squares Line”
## `geom_smooth()` using formula = 'y ~ x'

Key Code to Remember

  • Here is the code used to generate the plot on the previous slide.
ggplot(trees, aes(Girth,Volume), main = "Volume vs Girth in Trees") +
  geom_point()  +
  stat_smooth(method = lm, col = "blue") +
  geom_abline(intercept = -38.92958, slope = 5.215771, col = "red")
  • And here is the code to calculate the slope and intercept through the least squares method.
#slope
m = (nrow(trees) - sum(trees$Volume)*sum(trees$Volume))/((nrow(trees))^2 - (sum(trees$Girth))^2) 
#intercept
b = (sum(trees$Volume) - (m*sum(trees$Girth)))/nrow(trees)

Sources