2023-04-17

Simple Linear Regression Information

  • Simple linear regression is the analysis of linear relationships between two variables, a dependent variable (y) and independent variable (x)
  • Linear regressions are statistical relationships between variables with imperfect relationships, deterministic relationships (such as converting Fahrenheit to Celsius and vice versa) are not linear regressions
  • With R’s statistical capabilities it’s easily possible to code simple linear regressions using different packages in R. In this presentation, two different data sets built into R will be analyzed using simple linear regression to form predictive equations.

Old Faithful Dataset

R has many different data sets built into it. This presentation will first use the “faithful” data set. The “faithful” data set compares waiting time in minutes vs eruption length in minutes from the famous Old Faithful geyser in Yellowstone National Park. Here is some data from the data set.

eruptions waiting
3.6 79
1.8 54
3.333 74
2.283 62

Let’s create a simple linear regression plot in ggplot2 and see what there is to garner.

Simple Linear Regression Plot

There appears to be a positive correlation between eruption time length and waiting time (as waiting time increases, so does eruption length). The regression output analysis is needed to create an equation and predict the length of an eruption. Let’s take a look at the regression output data.

Faithful Dataset Regression Data

## 
## Call:
## lm(formula = eruptions ~ waiting, data = faithful)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.29917 -0.37689  0.03508  0.34909  1.19329 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.874016   0.160143  -11.70   <2e-16 ***
## waiting      0.075628   0.002219   34.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4965 on 270 degrees of freedom
## Multiple R-squared:  0.8115, Adjusted R-squared:  0.8108 
## F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16

Coefficient Data Explanation

Simple linear regression attempts to create an equation in the y = mx + b form that can predict the dependent variable value based on previous data between the dependent and independent variables. Here is the coefficient regression data in a neatly formatted matrix:

##                Estimate  Std. Error   t-value       p-value
## (Intercept) -1.87401599 0.160143302 -11.70212  7.359171e-26
## waiting      0.07562795 0.002218541  34.08904 8.129959e-100

The estimates are coefficients calculated through the least squares method. A higher t-value indicates more certainty that the coefficient isn’t zero, as does as a smaller p-value. Based on the calculated coefficients, the linear regression equation would be:

\[eruption\;(minutes) = -1.87401599 + 0.07562795(waiting)\] For every minute of waiting for an eruption, 0.07562795 minutes are predicted to be added to the next eruption’s length.

Distance To Stop

Here is another data set in R related to how long it takes a car to stop versus the speed the car is going called “cars”. This data was recorded in the 1920’s.

  • speed: Speed (mph)
  • dist: Distance to stop (feet)
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Let’s chart distance to stop versus speed to try and predict how much distance a car needs to stop based on the speed of the car.

Car Stopping Distance Chart

plot2 <- lm(dist ~ speed, data = cars)
ggplot(plot2, aes(x = speed, y = dist)) +
  stat_smooth(method = "lm", formula = y ~ x, color = "orange") +
  geom_point() +
  ggtitle("Distance To Stop In ggplot2") +
  labs(x = "Speed (mph)", y = "Distance to Stop (feet)")

Distance To Stop Regression Equation

##               Estimate Std. Error   t-value      p-value
## (Intercept) -17.579095  6.7584402 -2.601058 1.231882e-02
## speed         3.932409  0.4155128  9.463990 1.489836e-12

\[distance = -17.579095 + 3.932409(speed)\] For every 1 mph that the car is traveling, it will take 3.932409 more feet to stop. For instance, let’s do the math for a car traveling 80 mph:

\[\begin{align*} distance\;to\;stop\;(feet) &= -17.579095 + 3.932409(80) \\ &= -17.579095 + 314.59272 \\ &= 297.013625\;feet \end{align*}\]

A car traveling 80 mph in the 1920’s is predicted to take around 297 feet to fully stop.

Interactive Plotly Cars Chart

This is an interactive scatter plot chart using plotly and the “cars” data set with a regression line.