In this activity, we will learn how to predict the fuel efficiency of a car based on its weight.
The mtcars
data frame is a classic go-to R data frame. It was extracted from the 1974 Motor Trend US magazine and comprises fuel consumption and 10 design and performance features for 32 automobiles (1973–74 models).
We can create a table of the entire data set in a new tab with the View()
function.
# Check out the full data set
View(mtcars)
Each row of mtcars
is an automobile, and each column is a performance feature. For example:
mpg
is miles per gallonwt
is weightdrat
is rear axle ratioScatterplots plot one variable against another. They work best for continuous data.
# Make a data matrix for continuous variable
data <- mtcars[,c(1,3:7)]
# Do an all-on-all scatter plot
pairs(data)
In many cases we can see come kind of “linear” relationship: One variable is proportional to another.
We can see in greater detail which features have relationships with others. Scatterplots help us find correlations.
Correlations are perhaps better visualized with corrplot()
.
# load the corrplot package
library(corrplot)
## corrplot 0.84 loaded
# all-on-all correlation plot
corrplot(cor(data), method = "circle")
By comparing the scatterplot and correlation plot for mpg
and wt
, we see that a scatterplot with a negative slope corresponds to a correlation of close to -1 (dark red circle).
We will use linear regression to model the relationship between a dependent variable, such as mpg
, and independent (predictor) variables, such as wt
and drat
.
In the scatterplots, it looks like mpg
is linearly dependent on wt
and drat
and that:
mpg
decreases as wt
increases (negative correlation)mpg
increases as drat
increases (positive correlation)lm()
functionThe function lm()
calculates the coefficients (intercept and slopes) that best fit a linear equation to the data.
The first argument to lm()
is the formula for our equation: mpg ~ wt
We want to model mpg
as a function of wt
.
The second argument to lm()
is the data frame, mtcars
.
# the first argument is the formula
# lm() takes a data frame as input
model_mpg_wt <- lm(mpg ~ wt, data = mtcars)
mpg_wt_summary <- summary(model_mpg_wt)
mpg_wt_summary
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
lm()
the p-value, which tells us how statistically significant our model is, and
the slope, which tells us the degree and direction of dependence: Does the dependent variable increase or decrease with the independent variable.
Here is how we interpret the linear regression results for our example:
The coefficient for wt (weight) is -5.34447, which means that mpg
decreases as wt
increases in the ratio of 5.34 mpg per 1000 lbs.
The p-value is less than 0.05, indicating that the relationship between wt
and mpg
is statistically significant.
The R-squared value is 0.7528, which means that 75% of the variation in mpg
is explained by wt
.
The correlation coefficient is the square root of 0.7528 (multiple R-squared) = 0.868. But this misses the direction.
# Calculate the correlation between mpg and wt
cor(mtcars$mpg,mtcars$wt)
## [1] -0.8676594
This is the same value, and the minus sign indicates that the correlation is inverse.
Aside: A p-value of 0.01 means that, on average, we will a similar result in 100% x (0.01) = 1% of random cases.
We are usually interested in only the slope and the p-value, and we can get this information with the function coefficients()
.
# View the summary results of our model output
View(coefficients(mpg_wt_summary))
# Pull the coefficients from our model
coef_mpg_wt = coefficients(mpg_wt_summary)
# Get the slope and p-value
slope_mpg_wt = coef_mpg_wt[2, 1]
pval_mpg_wt = coef_mpg_wt[2, 4]
# We can also get R-squared
r_squared_value <- mpg_wt_summary$r.squared
slope_mpg_wt
## [1] -5.344472
r_squared_value
## [1] 0.7528328
pval_mpg_wt
## [1] 1.293959e-10
What do our results tell us?
First, because the slope is negative, there is a linear inverse relationship between mpg
and wt
.
Second, our model explains 75% of the variance in mpg
with wt
as the independent variable.
Third, we have a very low p-value, so our result is statistically significant. Our model is doing a good job of explaining the data.
Just using numbers without visualizing them in a graph is not fun. We can plot the regression line in a graph using plot()
.
# make a scatterplot of mpg versus wt
plot(mtcars$mpg ~ mtcars$wt,
ylab="Miles per gallon", xlab="Weight")
# add the line from our model and color it red
abline(model_mpg_wt, col = "red")
This is the same relationship we obtained in the pairwise scatterplots and correlation plot.
We’ll use the function predict()
to make predictions based on our linear regression model.
# Let's predict mpg for two new weight values
# remember, wt is in units of 1,000 lbs
# Predict mpg for 1,000 lbs and 6,000 lbs
pred_mpg_wt <- predict.lm(model_mpg_wt, newdata=data.frame(wt=c(1,6)))
pred_mpg_wt
## 1 2
## 31.940655 5.218297
The argument newdata
specifies the values of wt
for which we want to make predictions. The values are passed in as a data frame using the data.frame()
function.
We get the results:
Keep in mind that based on the correlation we observe, we are assuming that wt
helps determined mpg
. This is reasonable based on our experiences and physical intuition.
However, it is not always the case that correlation and the ability to construct a predictive model tell us about how things work!
library(imager)
## Loading required package: magrittr
##
## Attaching package: 'imager'
## The following object is masked from 'package:magrittr':
##
## add
## The following objects are masked from 'package:stats':
##
## convolve, spectrum
## The following object is masked from 'package:graphics':
##
## frame
## The following object is masked from 'package:base':
##
## save.image
corr_img<-load.image("/home/data/Comics_Vets.jpg")
plot(corr_img,axes=FALSE)
mpg
on drat
drat
is the rear axle ratio. You can learn more about drat
here.
# the first argument is the formula
# lm() takes a data frame as input
# Uncomment and replace FORMULA and FUNCTION in the code below
model_mpg_drat <- lm(mpg ~ drat, data = mtcars)
mpg_drat_summary <- summary(model_mpg_drat)
mpg_drat_summary
##
## Call:
## lm(formula = mpg ~ drat, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0775 -2.6803 -0.2095 2.2976 9.0225
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.525 5.477 -1.374 0.18
## drat 7.678 1.507 5.096 1.78e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.485 on 30 degrees of freedom
## Multiple R-squared: 0.464, Adjusted R-squared: 0.4461
## F-statistic: 25.97 on 1 and 30 DF, p-value: 1.776e-05
We are usually interested in only the slope and the p-value.
# Uncomment and replace FUNCTION to get the coefficients
coef_mpg_drat = coefficients(mpg_drat_summary)
slope_mpg_drat = coef_mpg_drat[2, 1]
pval_mpg_drat = coef_mpg_drat[2, 4]
r_squared_value <- mpg_drat_summary$r.squared
slope_mpg_drat
## [1] 7.678233
r_squared_value
## [1] 0.4639952
pval_mpg_drat
## [1] 1.77624e-05
What does the value of slope_mpg_drat
indicate?
# make a scatterplot of mpg versus drat
# Uncomment and replace FUNCTION, FEATURE and "Feature Name" in the code:
plot(mtcars$mpg ~ mtcars$drat,
ylab="Miles per gallon", xlab="Rear axle ratio")
# Uncomment and replace OBJECT to add the line from our model and color it red
abline(model_mpg_drat, col = "red")
Let’s use our model to predict mpg
for new drat
values.
# Let's predict mpg for two new drat values
# what are the units for drat???
# no units
# check out some drat values in the mtcars data frame
pred_mpg_drat <- predict(model_mpg_drat, newdata=data.frame(drat=c(2,5)))
pred_mpg_drat
## 1 2
## 7.831847 30.866545
What are your conclusions? For a drat of 2.0, the fuel efficiency is 7.83 and for a drat of 5.0, the fuel efficiency is 30.87.