Introduction to Simple Linear Regression

10/17/2024

Building and Understanding a Linear Model

In this introduction, we’ll explore simple linear regression using the built-in mtcars dataset, which includes attributes like car weight and fuel efficiency.

Our goal is to examine the relationship between car weight (wt) and miles per gallon (mpg). Why is this relationship important? Understanding how car weight affects fuel efficiency can provide insights into vehicle performance and energy consumption. If you’re in the market for a brand new car, this is data that could very well concern you. Let’s explore these insights using simple linear regression.

We start by loading the dataset and fitting a linear regression model where mpg is the dependent variable and wt is the independent variable.

# Load the 'mtcars' dataset
data("mtcars")

# Fit the linear model
model <- lm(mpg ~ wt, data = mtcars)

Let’s look at a summary of the model that we’ve constructed.

The Model Summary

# Display the summary of the model
summary(model)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

There’s a lot to look at here. We’ll break it down in the next slide. Feel free to bounce back and forth as you begin to understand the summary.

The Model Summary Explained

This model summary provides detailed statistics about the linear regression model we just built. Let’s break it down:

Residuals: These are the differences between the observed mpg values and the values predicted by the model. In a well-fitting model, these residuals should be randomly distributed.
Coefficients: The table of coefficients shows the intercept and slope of the regression line:
- The intercept is the estimated mpg when car weight is zero (which may not have a practical meaning in this context).
- The slope for wt is \(-5.34\), indicating that for every 1000 lbs increase in car weight, mpg decreases by approximately 5.34 units.
Significance: The p-value for the slope (\(wt\)) is extremely small (\(1.29e-10\)), indicating that the relationship between car weight and mpg is statistically significant.
R-squared: The R-squared value is 0.75, which means that about 75% of the variation in mpg can be explained by car weight.

These metrics confirm that weight is a strong predictor of fuel efficiency.

The Model Summary: Simplified

We’ve covered a lot of technical details, but for the purpose of this presentation, focus on these key points and only this portion of the summary:

# Display only the coefficients and R-squared of the model summary
summary(model)$coefficients

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 37.285126   1.877627 19.857575 8.241799e-19
## wt          -5.344472   0.559101 -9.559044 1.293959e-10

summary(model)$r.squared

## [1] 0.7528328

The Slope (\(\beta_1\)): This tells us that for every 1000 lbs increase in car weight, the fuel efficiency decreases by approximately 5.34 miles per gallon.
P-Value for \(\beta_1\): The extremely low p-value (\(1.29e^{-10}\)) shows that this relationship between car weight and fuel efficiency is statistically significant.
R-squared Value: The R-squared value of 0.75 means that 75% of the variation in fuel efficiency can be explained by car weight alone.

Key Takeaways from the Model

So, in simple terms: As cars get heavier, their fuel efficiency drops significantly. And this relationship is both strong and statistically significant.

Now that we understand what the model is saying, let’s understand more about how the model gets built in the first place.

Understanding the Linear Regression Model for Car Weight and MPG

A simple linear regression model attempts to find the best-fitting line that explains the relationship between the dependent variable (mpg) and the independent variable (wt). The model is represented by this equation:

\[ Y = \beta_0 + \beta_1X + \epsilon \]

Where:

\(Y\) is the dependent variable (Miles per Gallon, mpg),
\(X\) is the independent variable (Weight of the car, wt),
\(\beta_0\) is the intercept,
\(\beta_1\) is the slope, and
\(\epsilon\) is the error term that accounts for deviations from the line.

In the case of the mtcars dataset: \(\beta_1\) represents the expected decrease in MPG for every 1000 lbs increase in car weight, while the intercept \(\beta_0\) represents the estimated MPG when the car weight is zero (though this may not have a meaningful physical interpretation).

The Slope Coefficient Formula

Now, we delve a little deeper into how the slope of the regression line, \(\beta_1\), is calculated. The slope tells us how much the dependent variable (mpg) changes for a unit change in the independent variable (wt). This is computed using the following formula: The slope coefficient \(\beta_1\) is calculated as:

\[ \beta_1 = \frac{ \sum{(X_i - \bar{X})(Y_i - \bar{Y})} }{ \sum{(X_i - \bar{X})^2} } \]

Where:

\(X_i\) and \(Y_i\) are individual sample points,
\(\bar{X}\) and \(\bar{Y}\) are the means of \(X\) and \(Y\), respectively.

Understanding this formula helps explain the relationship we’re modeling. Now that we’ve established the theoretical foundation of our linear regression model, let’s visualize how the model fits the data from the mtcars dataset. We’ll use an interactive plot to explore the relationship between car weight and miles per gallon.

Visualizing Our Model

Let’s now visualize our regression model. In this interactive Plotly plot, each point represents a car from the mtcars dataset, with its weight on the x-axis and miles per gallon on the y-axis. The regression line indicates the relationship between the two variables. The slope of the line shows how fuel efficiency decreases as car weight increases.

Note: If the interactive plot is not working, a static plot modeling the same data will be provided later.

Looking at the Data Through a Different Lense

In the event that the interactivity is not working, here is ggplot2 to create a static scatter plot that also includes the fitted regression line. This plot provides yet another clear and concise visualization of the relationship between car weight and miles per gallon, reinforcing the negative correlation we’ve observed.

Observing Residuals

To assess the quality of our model, we can look at the residuals, which are the differences between the observed values and the values predicted by the model. A residuals vs. fitted plot helps us evaluate how well the regression model fits the data. Ideally, the residuals should be randomly scattered around zero, which would indicate that the model assumptions hold and the fit is good.

Interpreting the Results

The residuals help us identify any systematic deviations between the predicted and observed values. If patterns emerge, such as a funnel shape or curved lines, it could indicate that the model is not properly capturing the relationship between the variables or that certain assumptions are violated (such as homoscedasticity).

So from our regression model, we can draw several important conclusions. First, the negative slope of the regression line confirms that as the weight of a car increases, its fuel efficiency (measured in miles per gallon) decreases. The model summary provides further insights, such as the statistical significance of the relationship, the value of the intercept, and the slope of the regression line.

The R-squared value from the model summary indicates how much of the variance in mpg is explained by wt. While the model seems statistically significant, further model diagnostics or additional predictors could be introduced to improve the model’s accuracy.

Key Takeaways

Negative Correlation: We found a strong negative correlation between car weight and miles per gallon (MPG).
Slope Interpretation: For every additional 1000 lbs of car weight, the expected MPG decreases by approximately \(\beta_1\) units.
Model Diagnostics: The residuals plot shows that the model assumptions are reasonably met, with no strong patterns in the residuals.
Model Fit: The R-squared value indicates that the model explains a significant portion of the variance in MPG.

Conclusion

In conclusion, we’ve successfully applied a simple linear regression model to investigate the relationship between car weight and fuel efficiency. The results confirm a significant inverse relationship between these two variables. In future analyses, we could consider extending the model to include other variables from the mtcars dataset to further improve our predictions.

I hope you learned something today about making and using simple linear regression models. Have a good one!