# Load packages (only base R is really needed for this example)
# but we will use ggplot2 for nicer plots.
library(ggplot2)
# mtcars is built into R, so we can access it directly
data("mtcars")
# Take a quick look at the first few rows
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Before running a regression, it is good practice to look at the data.
# Basic scatterplot of mpg vs. weight
plot(mtcars$wt, mtcars$mpg,
xlab = "Weight (1000 lbs)",
ylab = "Miles per Gallon (mpg)",
main = "Scatterplot of MPG vs. Vehicle Weight",
pch = 19, col = "darkgray")
# Add a simple linear trend line using base R
abline(lm(mpg ~ wt, data = mtcars), col = "red", lwd = 2)
Now we formally estimate the relationship using the lm() function.
# Fit a simple linear regression: mpg = b0 + b1 * wt + error
model1 <- lm(mpg ~ wt, data = mtcars)
# View the model summary
summary(model1)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
The key pieces of the output are:
Coefficients:
(Intercept) – the expected mpg when weight is zero (a theoretical value, but useful for the equation).
wt – the slope, representing the change in mpg for a one-unit increase in weight (1000 lbs).
R-squared – the proportion of variation in mpg explained by weight.
p-values – tests of whether each coefficient is statistically different from zero.
We can extract and restate the coefficients in plain language.
coef(model1)
## (Intercept) wt
## 37.285126 -5.344472
### Interpretation
Output
Intercept (b0) ≈ 37.29
Slope for weight (b1) ≈ -5.34
-This means: A car with weight 0 (hypothetically) is predicted to have about 37.3 mpg. For each additional 1000 lbs of weight, the model predicts mpg decreases by about 5.3 points, on average.
We can also check the model fit:
summary(model1)$r.squared
## [1] 0.7528328
### Interpretation
We can build a more polished plot with ggplot2.
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point(color = "gray30", size = 2) +
geom_smooth(method = "lm", se = TRUE, color = "blue", fill = "lightblue") +
labs(
title = "Linear Regression of MPG on Vehicle Weight",
x = "Weight (1000 lbs)",
y = "Miles per Gallon (mpg)"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Finally, we briefly examine residuals to see whether the linear model
is reasonable.
par(mfrow = c(1, 2))
# Residuals vs. fitted values
plot(model1$fitted.values, resid(model1),
xlab = "Fitted Values (Predicted MPG)",
ylab = "Residuals",
main = "Residuals vs. Fitted")
abline(h = 0, col = "red", lwd = 2)
# Normal Q-Q plot of residuals
qqnorm(resid(model1),
main = "Normal Q-Q Plot of Residuals")
qqline(resid(model1), col = "red", lwd = 2)
par(mfrow = c(1, 1))
The Residuals vs. Fitted plot assesses two major assumptions: (a) linearity, and (b) homoscedasticity (constant error variance).
-In the plot, the residuals are scattered around the horizontal zero line without forming any clear pattern. There is no systematic curvature, clustering, or funnel shape. This indicates that:
The linearity assumption is reasonably met. There is no strong evidence that the relationship between predictors and the outcome is non-linear.
The variance of the residuals appears roughly constant. The spread of residuals remains fairly consistent across all predicted values, suggesting homoscedasticity.
While a few points fall farther from the zero line than others, these do not create a visible pattern, nor do they appear numerous enough to undermine model reliability.
In this plot:
Most of the points fall very close to the diagonal reference line. This indicates that the central distribution of residuals is approximately normal.
The extreme points at both ends deviate slightly from the line. These indicate minor heavy tails, meaning the model has a few larger-than-expected residuals.
However, the deviations are modest and common in small-to-moderate sample regression. They do not substantially threaten the validity of the model.
Conclusion: Residuals are mostly normal, with only slight deviations in the tails—acceptable for OLS regression.
The regression model is statistically appropriate, reasonably well-behaved, and suitable for inference. The assumptions underlying OLS are met well enough that coefficient estimates, p-values, and confidence intervals can be trusted.
Although the model performs reasonably well, several
refinements could improve its predictive accuracy and adherence to
assumptions:
Engine performance relationships are often nonlinear. Adding:
squared terms (e.g., wt^2)
interactions (e.g., horsepower × weight)
may capture curvature not visible in the basic model.
Standardization helps compare predictors measured in different units and reduces multicollinearity.
Ridge, LASSO, or decision tree–based models may capture more complex patterns.