Code-Through:Explaining a Simple Linear Regression in R

Introduction

This analysis uses linear regression to examine how selected predictor variables affect miles per gallon (MPG) in the mtcars dataset. After fitting a multiple regression model, I evaluated the statistical significance of the predictors, interpreted the coefficients substantively, and assessed the validity of the model using standard diagnostic plots. These steps ensure that the regression results are both meaningful and statistically reliable.
This type of workflow is similar to the regression models used in our labs on neighborhood change, but applied here to a simpler and self-contained dataset.

Step 1: Load Packages and Data

# Load packages (only base R is really needed for this example)
# but we will use ggplot2 for nicer plots.
library(ggplot2)

# mtcars is built into R, so we can access it directly
data("mtcars")

# Take a quick look at the first few rows
head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Interpretation

We will focus on two variables:
mpg – miles per gallon (fuel efficiency), our outcome or dependent variable
wt – car weight (in 1000s of pounds), our predictor or independent variable

Step 2: Explore the Relationship Visually

Before running a regression, it is good practice to look at the data.

# Basic scatterplot of mpg vs. weight

plot(mtcars$wt, mtcars$mpg,
xlab = "Weight (1000 lbs)",
ylab = "Miles per Gallon (mpg)",
main = "Scatterplot of MPG vs. Vehicle Weight",
pch = 19, col = "darkgray")

# Add a simple linear trend line using base R

abline(lm(mpg ~ wt, data = mtcars), col = "red", lwd = 2)

Interpretation

This plot shows that as vehicle weight increases, fuel efficiency generally decreases. The red line is the best-fitting straight line summarizing this pattern.

Step 3: Fit a Linear Regression Model

Now we formally estimate the relationship using the lm() function.

# Fit a simple linear regression: mpg = b0 + b1 * wt + error

model1 <- lm(mpg ~ wt, data = mtcars)

# View the model summary

summary(model1)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Interpretation

The key pieces of the output are:

Coefficients:
(Intercept) – the expected mpg when weight is zero (a theoretical value, but useful for the equation).
wt – the slope, representing the change in mpg for a one-unit increase in weight (1000 lbs).
R-squared – the proportion of variation in mpg explained by weight.
p-values – tests of whether each coefficient is statistically different from zero.

Step 4: Interpret the Results

We can extract and restate the coefficients in plain language.

coef(model1)

## (Intercept)          wt 
##   37.285126   -5.344472

### Interpretation

Output
Intercept (b0) ≈ 37.29
Slope for weight (b1) ≈ -5.34

-This means: A car with weight 0 (hypothetically) is predicted to have about 37.3 mpg. For each additional 1000 lbs of weight, the model predicts mpg decreases by about 5.3 points, on average.

In words: Heavier cars tend to get lower gas mileage, and the relationship is approximately linear.

We can also check the model fit:

summary(model1)$r.squared

## [1] 0.7528328

### Interpretation

R² ≈ 0.75, then about 75% of the variation in fuel efficiency across cars is explained by weight alone.

Step 5: Visualize the Fitted Line with ggplot2

We can build a more polished plot with ggplot2.

ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point(color = "gray30", size = 2) +
geom_smooth(method = "lm", se = TRUE, color = "blue", fill = "lightblue") +
labs(
title = "Linear Regression of MPG on Vehicle Weight",
x = "Weight (1000 lbs)",
y = "Miles per Gallon (mpg)"
) +
theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Interpretation

The blue line is the regression line, and the shaded area is the confidence band.This visualization reinforces the negative relationship between weight and fuel efficiency.

Step 6: Check Residuals (Model Diagnostics)

Finally, we briefly examine residuals to see whether the linear model is reasonable.

par(mfrow = c(1, 2))

# Residuals vs. fitted values

plot(model1$fitted.values, resid(model1),
xlab = "Fitted Values (Predicted MPG)",
ylab = "Residuals",
main = "Residuals vs. Fitted")
abline(h = 0, col = "red", lwd = 2)

# Normal Q-Q plot of residuals

qqnorm(resid(model1),
main = "Normal Q-Q Plot of Residuals")
qqline(resid(model1), col = "red", lwd = 2)

par(mfrow = c(1, 1))

Interpretation of the Residuals vs. Fitted Plot and Normal Q–Q Plot

After estimating the linear regression model, I examined two key diagnostic plots—the Residuals vs. Fitted plot and the Normal Q–Q plot—to evaluate whether the assumptions underlying ordinary least squares (OLS) regression were reasonably satisfied. These plots help determine whether the model is appropriate, unbiased, and reliable for inference.

Residuals vs. Fitted Plot

The Residuals vs. Fitted plot assesses two major assumptions: (a) linearity, and (b) homoscedasticity (constant error variance).

-In the plot, the residuals are scattered around the horizontal zero line without forming any clear pattern. There is no systematic curvature, clustering, or funnel shape. This indicates that:

The linearity assumption is reasonably met. There is no strong evidence that the relationship between predictors and the outcome is non-linear.
The variance of the residuals appears roughly constant. The spread of residuals remains fairly consistent across all predicted values, suggesting homoscedasticity.

While a few points fall farther from the zero line than others, these do not create a visible pattern, nor do they appear numerous enough to undermine model reliability.

Conclusion: The model does not exhibit meaningful violations of linearity or constant variance.

Normal Q–Q Plot of Residuals

The Q–Q plot evaluates the assumption of normally distributed residuals, which matters for hypothesis testing, confidence intervals, and overall model stability.

In this plot:

Most of the points fall very close to the diagonal reference line. This indicates that the central distribution of residuals is approximately normal.
The extreme points at both ends deviate slightly from the line. These indicate minor heavy tails, meaning the model has a few larger-than-expected residuals.

However, the deviations are modest and common in small-to-moderate sample regression. They do not substantially threaten the validity of the model.

Conclusion: Residuals are mostly normal, with only slight deviations in the tails—acceptable for OLS regression.
The regression model is statistically appropriate, reasonably well-behaved, and suitable for inference. The assumptions underlying OLS are met well enough that coefficient estimates, p-values, and confidence intervals can be trusted.

Recommendations for improving the model

Although the model performs reasonably well, several refinements could improve its predictive accuracy and adherence to assumptions:

Check for Influential Points

Use Cook’s distance or leverage plots to identify cars like the Maserati Bora or Ferrari that may unduly affect the model.

Consider Nonlinear Terms

Engine performance relationships are often nonlinear. Adding:

squared terms (e.g., wt^2)

interactions (e.g., horsepower × weight)

may capture curvature not visible in the basic model.

Standardize Predictors

Standardization helps compare predictors measured in different units and reduces multicollinearity.

Try Alternative Models

Ridge, LASSO, or decision tree–based models may capture more complex patterns.

Conclusion

This analysis demonstrates a solid application of regression modeling. The predictors behave as expected, the model fits well, and the diagnostics support the validity of the results. While there is room to refine the model through additional terms or influence diagnostics, the current model provides meaningful and interpretable insights into the factors affecting vehicle fuel efficiency.

Code-Through:Explaining a Simple Linear Regression in R

Marvellous Egberuare

2025-12-04

Introduction

Step 1: Load Packages and Data

Interpretation

Step 2: Explore the Relationship Visually

Interpretation

Step 3: Fit a Linear Regression Model

Interpretation

Step 4: Interpret the Results

Step 5: Visualize the Fitted Line with ggplot2

Interpretation

Step 6: Check Residuals (Model Diagnostics)

Interpretation of the Residuals vs. Fitted Plot and Normal Q–Q Plot

Recommendations for improving the model

Conclusion