Basics of Simple Linear Regression

2024-06-02

Introduction

This presentation will focus on the topic of simple linear regression. Simple linear regression focuses on summarizing the relationship between two quantitative variables. One of these variables is called the “Predictor” variable that is an independent variable. The other variable is called the “Response” variable that is the dependent variable.

What is Simple Linear Regression?

Mathematically a simple linear regression model can be denoted as the following: \[ Y = \beta_0 + \beta_1 X + \epsilon \] where:

\(Y\) is the dependent variable (mpg)
\(X\) is the independent variable (hp)
\(\beta_0\) is the intercept
\(\beta_1\) is the slope
\(\epsilon\) is the error term

How to find the Slope and Intercept

In simple linear regression, the slope (\(\beta_1\)) and intercept (\(\beta_0\)) are calculated as:

\[ \beta_1 = \frac{\sum{(X_i - \bar{X})(Y_i - \bar{Y})}}{\sum{(X_i - \bar{X})^2}} \]

\[ \beta_0 = \bar{Y} - \beta_1 \bar{X} \]

where:

\(X_i\) and \(Y_i\) are the individual data points
\(\bar{X}\) and \(\bar{Y}\) are the means of \(X\) and \(Y\)

The Example Data

To demonstrate how Linear Regression works, I will be using the data set in R Studio called “mtcars”. This data set consists of only 32 entries with 11 columns that are car attributes such as horse power, gear, MPG, etc. For this specific presentation I will be looking at MPG as the response and horse power (HP) as the predictor.

Plotting MPG vs HP

## Warning: package 'ggplot2' was built under R version 4.3.3

Creating the model

Creating a simple linear regression model in R Studio, is very straightforward. You give what variable you would like to name the response variable and then do the same with which variable is the predictor. Once you have created a model you can have R Studio provide the summary. In the next slide, the summary of the MPG vs HP example will be displayed. The summary is where you can find if the predictor you have chosen is a significant one, meaning if it explains a lot of the response variable, in this case, the MPG.

Here is how we fit a linear regression model in R Studio:

model <- lm(mpg ~ hp, data = mtcars)

How to Plot The Regression Line:

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Regression Line of MPG vs HP",
       x = "Horsepower",
       y = "Miles Per Gallon")

Important Things to Note While Plotting Regression Line:

Putting method = “lm” is explicitly telling R Studio that you want to plot the regression line using linear regression
“lm” stands for linear model, which tells R Studio to use the least squares method to fit the data in a straight line
You can customize your regression visualization by setting se = TRUE to add a confidence interval by creating a shaded region around the line. The wider the region is, the less certain the model is about the estimated relationship between the variables

Understanding the Regression Line

The regression line in the previous slide shows the “best fit” linear relationship between the MPG and HP variables in the data set. Each point on the plot represents a car from the data. The slope indicates the average change in the dependent variable (MPG) for every one unit increase of the independent variable (HP). As you may have noticed the slope from our model is negative meaning the two variables have a negative linear regression relationship. In this case, our model suggests that as horsepower increases, miles per gallon decreases. Another thing I would like to add that may be important to understanding simple linear regression models is something called “goodness of fit”. If you look at each point of the plot and compare it to how close (or far) they are to the regression line, you can gauge the goodness of it. The closer the points are to the regression line, the better the fit. This is another way of checking to see how strong the relationship is between the variables. Having a high level of goodness of fit is very important for linear regression models, especially if your goal is to make predictions since if your model fits the data closely the your predictions become more accurate.

What Can the Summary of the Model Tell Us?

The summary is an important part of understanding how well your simple linear regression model fits. It tells us things like residuals, coefficient values, residual standard error, R-Squared values, F-Statistic, and the P-Value.

Here is how to interpret each item:

Residuals: This is how far each point on the plot is from the regression line. The summary tells us how close is the closest plot and how far the furthest one is, along with some other descriptive statistics.
Coefficients: This number represents the change in the mean value of the response for a one unit increase of the predictor. Negative suggests a decrease and positive suggests an increase
Residuals Standard Error: This provides the standard deviation of the residuals
R-Squared: The proportion of the variance in the response variable explained by the predictor variable
F-Stat & P-Value: This is a measure of overall significance of the regression model. A small P-Value indicates that the regression model as a whole is statistically significant

Summary of The Model

## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

Explore Possibility of Creating a More Complex Linear Regression Model

After exploring relationships you may have discovered with your simple linear regression model, perhaps you would like to make a linear regression model with more than one predictor (this would no longer be considered a “Simple Linear Regression” model. You could first explore the data visually by creating a 3D plot and identifying any relationships that may be apparent. The advantages for such a plot could be but not limited to: seeing multivariate relationships, identifying patterns or trends, exploring possible interactions among the multiple variables, etc. The next slide shows how to do this and what it may look like to have implemented.

Plotting in 3D for Different Relationships

library(plotly)
plot_ly(mtcars, x = ~hp, y = ~wt, z = ~mpg, 
        type = "scatter3d", mode = "markers")