In this activity, we will learn how to predict the fuel efficiency of a car based on its weight.
The mtcars data frame is a classic go-to R data frame. It was extracted from the 1974 Motor Trend US magazine and comprises fuel consumption and 10 design and performance features for 32 automobiles (1973–74 models).
We can create a table of the entire data set in a new tab with the View() function.
# Check out the full data set
View(mtcars)
Each row of mtcars is an automobile, and each column is a performance feature. For example:
mpg is miles per gallonwt is weightdrat is rear axle ratioScatterplots plot one variable against another. They work best for continuous data.
# Make a data matrix for continuous variable
data <- mtcars[,c(1,3:7)]
# Do an all-on-all scatter plot
pairs(data)
In many cases we can see come kind of “linear” relationship: One variable is proportional to another.
We can see in greater detail which features have relationships with others. Scatterplots help us find correlations.
Correlations are perhaps better visualized with corrplot().
# load the corrplot package
library(corrplot)
## corrplot 0.84 loaded
# all-on-all correlation plot
corrplot(cor(data), method = "circle")
By comparing the scatterplot and correlation plot for mpg and wt, we see that a scatterplot with a negative slope corresponds to a correlation of close to -1 (dark red circle).
We will use linear regression to model the relationship between a dependent variable, such as mpg, and independent (predictor) variables, such as wt and drat.
In the scatterplots, it looks like mpg is linearly dependent on wt and drat and that:
mpg decreases as wt increases (negative correlation)mpg increases as drat increases (positive correlation)lm() functionThe function lm() calculates the coefficients (intercept and slopes) that best fit a linear equation to the data.
The first argument to lm() is the formula for our equation: mpg ~ wt We want to model mpg as a function of wt.
The second argument to lm() is the data frame, mtcars.
# the first argument is the formula
# lm() takes a data frame as input
model_mpg_wt <- lm(mpg ~ wt, data = mtcars)
mpg_wt_summary <- summary(model_mpg_wt)
mpg_wt_summary
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
lm()the p-value, which tells us how statistically significant our model is, and
the slope, which tells us the degree and direction of dependence: Does the dependent variable increase or decrease with the independent variable.
Here is how we interpret the linear regression results for our example:
The coefficient for wt (weight) is -5.34447, which means that mpg decreases as wt increases in the ratio of 5.34 mpg per 1000 lbs.
The p-value is less than 0.05, indicating that the relationship between wt and mpg is statistically significant.
The R-squared value is 0.7528, which means that 75% of the variation in mpg is explained by wt.
The correlation coefficient is the square root of 0.7528 (multiple R-squared) = 0.868. But this misses the direction.
# Calculate the correlation between mpg and wt
cor(mtcars$mpg,mtcars$wt)
## [1] -0.8676594
This is the same value, and the minus sign indicates that the correlation is inverse.
Aside: A p-value of 0.01 means that, on average, we will a similar result in 100% x (0.01) = 1% of random cases.
We are usually interested in only the slope and the p-value, and we can get this information with the function coefficients().
# View the summary results of our model output
View(coefficients(mpg_wt_summary))
# Pull the coefficients from our model
coef_mpg_wt = coefficients(mpg_wt_summary)
# Get the slope and p-value
slope_mpg_wt = coef_mpg_wt[2, 1]
pval_mpg_wt = coef_mpg_wt[2, 4]
# We can also get R-squared
r_squared_value <- mpg_wt_summary$r.squared
slope_mpg_wt
## [1] -5.344472
r_squared_value
## [1] 0.7528328
pval_mpg_wt
## [1] 1.293959e-10
What do our results tell us?
First, because the slope is negative, there is a linear inverse relationship between mpg and wt.
Second, our model explains 75% of the variance in mpg with wt as the independent variable.
Third, we have a very low p-value, so our result is statistically significant. Our model is doing a good job of explaining the data.
Just using numbers without visualizing them in a graph is not fun. We can plot the regression line in a graph using plot().
# make a scatterplot of mpg versus wt
plot(mtcars$mpg ~ mtcars$wt,
ylab="Miles per gallon", xlab="Weight")
# add the line from our model and color it red
abline(model_mpg_wt, col = "red")
This is the same relationship we obtained in the pairwise scatterplots and correlation plot.
We’ll use the function predict() to make predictions based on our linear regression model.
# Let's predict mpg for two new weight values
# remember, wt is in units of 1,000 lbs
# Predict mpg for 1,000 lbs and 6,000 lbs
pred_mpg_wt <- predict.lm(model_mpg_wt, newdata=data.frame(wt=c(1,6)))
pred_mpg_wt
## 1 2
## 31.940655 5.218297
The argument newdata specifies the values of wt for which we want to make predictions. The values are passed in as a data frame using the data.frame() function.
We get the results:
Keep in mind that based on the correlation we observe, we are assuming that wt helps determined mpg. This is reasonable based on our experiences and physical intuition.
However, it is not always the case that correlation and the ability to construct a predictive model tell us about how things work!
library(imager)
## Loading required package: magrittr
##
## Attaching package: 'imager'
## The following object is masked from 'package:magrittr':
##
## add
## The following objects are masked from 'package:stats':
##
## convolve, spectrum
## The following object is masked from 'package:graphics':
##
## frame
## The following object is masked from 'package:base':
##
## save.image
corr_img<-load.image("/home/data/Comics_Vets.jpg")
plot(corr_img,axes=FALSE)
mpg on dratdrat is the rear axle ratio. You can learn more about drat here.
# the first argument is the formula
# lm() takes a data frame as input
model_mpg_drat <- lm(mpg ~ drat, data = mtcars)
mpg_drat_summary <- summary(model_mpg_drat)
mpg_drat_summary
##
## Call:
## lm(formula = mpg ~ drat, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0775 -2.6803 -0.2095 2.2976 9.0225
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.525 5.477 -1.374 0.18
## drat 7.678 1.507 5.096 1.78e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.485 on 30 degrees of freedom
## Multiple R-squared: 0.464, Adjusted R-squared: 0.4461
## F-statistic: 25.97 on 1 and 30 DF, p-value: 1.776e-05
We are usually interested in only the slope and the p-value.
coef_mpg_drat = coefficients(mpg_drat_summary)
slope_mpg_drat = coef_mpg_drat[2, 1]
pval_mpg_drat = coef_mpg_drat[2, 4]
r_squared_value <- mpg_drat_summary$r.squared
slope_mpg_drat
## [1] 7.678233
r_squared_value
## [1] 0.4639952
pval_mpg_drat
## [1] 1.77624e-05
What does the value of slope_mpg_drat indicate?
# make a scatterplot of mpg versus drat
plot(mtcars$mpg ~ mtcars$drat,
ylab="Miles per gallon", xlab="Rear Axle Ratio")
abline(model_mpg_drat, col = "red")
Let’s use our model to predict mpg for new drat values.
# Let's predict mpg for two new drat values
# what are the units for drat???
# check out some drat values in the mtcars data frame
pred_mpg_drat <- predict(model_mpg_drat, newdata=data.frame(drat=c(2,5)))
pred_mpg_drat
## 1 2
## 7.831847 30.866545
What are your conclusions? A smaller rear axle ratio leads to a lighter weight car. The data shows that a lighter car will consume gas more efficiently.