Simple Linear Regression

What is Simple Linear Regression?

The main idea behind simple linear regression is that we want to be able to identify the corresponding relationships between particular variables. It is with these relationships that we can fit a line through the points of data where we know that:

Variable X is the predictor or independent variable
Variable Y is the response or dependent variable

Simple Linear Regression Model Equation

Here is the corresponding general SLR Regression Model Equation:

\[ y = \beta_0 + \beta_1 x + \varepsilon \] This formula is what formalizes the empirical model!

We know that:

\(y\) is our response variable
\(x\) is the predictor variable
\(\beta_0\) and \(\beta_1\) are parameters of the function that happen to be unknown at first
\(\varepsilon\) is the random error

How to estimate the coefficients?

The coefficients are those unknown parameters defined in the SLR Equation, \(\beta_0\) and \(\beta_1\), respectively. With these coefficients, we are actually able to estimate them using something called the OLS method, or Ordinary Least Squares. What OLS does it is figures out a particular line that, effectively, minimizes what is known as the sum of squared vertical distances between the actual fitted line and the observed points. (This is what we also call residuals!) This gives us four equations to use:

Residuals: \(e_{i} = Y_{i} - \hat{Y_{i}}\)
Method of Least Squares: \(\sum_{i=1}^{n} (Y_{i} - \hat{\beta_0} - \hat{\beta_1}X_{i})^2\)
\(\hat{\beta_1} = \frac{\sum (X_{i} - \bar X)(Y_{i} - \bar Y)}{\sum (X_{i} - \bar X)^2}\)
\(\hat{\beta_0} = \bar Y - \hat{\beta_1} \bar X\)

Incorporation and Introduction of data

With simple linear regression in mind, the \(\textbf{cars}\) dataset is a spectacular choice that is built into R that can be observed.

# Shows the first couple observations of the cars dataset
head(cars)

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

nrow(cars)

## [1] 50

cars Dataset Components

From these points of data, we can see that there are two variables to make note of:

speed (miles per hour), which is simply the speed that the car is moving
dist (feet), which is simply the distance that the car travels before it stops

In our case, we know that speed will be considered as the predictor variable, or independent variable of the dataset. Fittingly, we then know that distance will be considered as the response variable, or dependent variable of the dataset.

Taking these components into account, we want to make it so that we can actually predict the distance that the car travels before it stops based on how fast the car is moving.

ggplot2 Scatterplot

Example R Code for Scatterplot

# Create a scatter plot using the cars dataset and the aesthetic property
# which makes use of the independent/dependent variables
ggplot(df, aes(x=speed, y=dist, color=speed)) + 
  # Set size of points
  geom_point(stat="identity", size=3) +
  # Establish labels
  labs(title="Speed against Stopping Distance", 
       x="Speed", 
       y="Stopping Distance",
       color="Speed (mph)") +
  # Create a gradient color scale for the legend
  scale_color_gradient(low="blue", high="red") +
  theme_bw()

ggplot2 Regression Line Plot

ggplot2 Residual Plot

Plotly Scatterplot

The Results!

summary(fittedModel)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Interpreting these results

After using the summary function on the fitted model of the cars dataset, it becomes apparent that we can then interpret some of the unknown values that were described in the original formula:

We can see that \(\beta_0\), or the intercept, is equal to -17.579. This indicates that when our speed is nothing, or 0 mph, the stopping distance is that specific value. (According to the formula)
We can also see that \(\beta_1\), or the slope, is equal to 3.932. This indicates that when we increase in speed, we are increasing by 3.932 feet for every increase in mph.
Some small things that we can also see are the \(R^2 = 0.6511\) and p-value = 1.49e-12
The R-Squared value simply tells us that roughly 65% of the variability in the stopping distance is caused strictly by speed (with other factors being the rest)
The p-value in the summary is there to answer the question of whether the independent variable is actually significant in predicting the dependent variable. (Typical alpha values are set to a value of 0.05, so if the resulting p-value is less than that, then the independent variable is significant)

Conclusion and Wrapping Up

It is known that simple linear regression is immensely powerful in showcasing the relationships between variables and modeling these relationships through various plots and diagrams. As we used the cars dataset, and have now gotten the results from the summary() call, the fitted model equation has been able to be crafted as so:

\[\widehat{Y} = -17.579 + 3.932X\]

It is with this fitted model equation that gives us insight for when we change the x-value, or speed. For every incremental increase in the speed variable, the ending stopping distance increases by 3.932 feet!