Introduction

In this activity, we will learn how to predict the fuel efficiency of a car based on its weight.


mtcars data

The mtcars data frame is a classic go-to R data frame. It was extracted from the 1974 Motor Trend US magazine and comprises fuel consumption and 10 design and performance features for 32 automobiles (1973–74 models).

We can create a table of the entire data set in a new tab with the View() function.

# Check out the full data set
View(mtcars)

Each row of mtcars is an automobile, and each column is a performance feature. For example:


Scatterplots

Scatterplots plot one variable against another. They work best for continuous data.

  • mgp
  • disp
  • hp
  • drat
  • wt
  • qsec
# Make a data matrix for continuous variable
data <- mtcars[,c(1,3:7)]

# Do an all-on-all scatter plot
pairs(data)

In many cases we can see come kind of “linear” relationship: One variable is proportional to another.

We can see in greater detail which features have relationships with others. Scatterplots help us find correlations.


Correlation plots

Correlations are perhaps better visualized with corrplot().

# load the corrplot package
library(corrplot)
## corrplot 0.84 loaded
# all-on-all correlation plot
corrplot(cor(data), method = "circle")

By comparing the scatterplot and correlation plot for mpg and wt, we see that a scatterplot with a negative slope corresponds to a correlation of close to -1 (dark red circle).


Linear regression

We will use linear regression to model the relationship between a dependent variable, such as mpg, and independent (predictor) variables, such as wt and drat.

In the scatterplots, it looks like mpg is linearly dependent on wt and drat and that:


The lm() function

The function lm() calculates the coefficients (intercept and slopes) that best fit a linear equation to the data.

The first argument to lm() is the formula for our equation: mpg ~ wt We want to model mpg as a function of wt.

The second argument to lm() is the data frame, mtcars.

# the first argument is the formula
# lm() takes a data frame as input

model_mpg_wt <- lm(mpg ~ wt, data = mtcars)

mpg_wt_summary <- summary(model_mpg_wt)

mpg_wt_summary
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Output of lm()

  • the p-value, which tells us how statistically significant our model is, and

  • the slope, which tells us the degree and direction of dependence: Does the dependent variable increase or decrease with the independent variable.

Here is how we interpret the linear regression results for our example:

  • The coefficient for wt (weight) is -5.34447, which means that mpg decreases as wt increases in the ratio of 5.34 mpg per 1000 lbs.

  • The p-value is less than 0.05, indicating that the relationship between wt and mpg is statistically significant.

  • The R-squared value is 0.7528, which means that 75% of the variation in mpg is explained by wt.

The correlation coefficient is the square root of 0.7528 (multiple R-squared) = 0.868. But this misses the direction.

# Calculate the correlation between mpg and wt

cor(mtcars$mpg,mtcars$wt)
## [1] -0.8676594

This is the same value, and the minus sign indicates that the correlation is inverse.

Aside: A p-value of 0.01 means that, on average, we will a similar result in 100% x (0.01) = 1% of random cases.


Slope and p-value

We are usually interested in only the slope and the p-value, and we can get this information with the function coefficients().

# View the summary results of our model output

View(coefficients(mpg_wt_summary))
# Pull the coefficients from our model

coef_mpg_wt = coefficients(mpg_wt_summary)

# Get the slope and p-value
slope_mpg_wt = coef_mpg_wt[2, 1]
pval_mpg_wt = coef_mpg_wt[2, 4]

# We can also get R-squared
r_squared_value <- mpg_wt_summary$r.squared

slope_mpg_wt
## [1] -5.344472
r_squared_value
## [1] 0.7528328
pval_mpg_wt
## [1] 1.293959e-10

What do our results tell us?

First, because the slope is negative, there is a linear inverse relationship between mpg and wt.

Second, our model explains 75% of the variance in mpg with wt as the independent variable.

Third, we have a very low p-value, so our result is statistically significant. Our model is doing a good job of explaining the data.


Plot the regression line

Just using numbers without visualizing them in a graph is not fun. We can plot the regression line in a graph using plot().

# make a scatterplot of mpg versus wt
plot(mtcars$mpg ~ mtcars$wt, 
     ylab="Miles per gallon", xlab="Weight")

# add the line from our model and color it red
abline(model_mpg_wt, col = "red")

This is the same relationship we obtained in the pairwise scatterplots and correlation plot.


Prediction

We’ll use the function predict() to make predictions based on our linear regression model.

# Let's predict mpg for two new weight values 
# remember, wt is in units of 1,000 lbs
# Predict mpg for 1,000 lbs and 6,000 lbs

pred_mpg_wt <- predict.lm(model_mpg_wt, newdata=data.frame(wt=c(1,6)))
pred_mpg_wt
##         1         2 
## 31.940655  5.218297

The argument newdata specifies the values of wt for which we want to make predictions. The values are passed in as a data frame using the data.frame() function.

We get the results:


Correlation is not causation!

Keep in mind that based on the correlation we observe, we are assuming that wt helps determined mpg. This is reasonable based on our experiences and physical intuition.

However, it is not always the case that correlation and the ability to construct a predictive model tell us about how things work!

library(imager)
## Loading required package: magrittr
## 
## Attaching package: 'imager'
## The following object is masked from 'package:magrittr':
## 
##     add
## The following objects are masked from 'package:stats':
## 
##     convolve, spectrum
## The following object is masked from 'package:graphics':
## 
##     frame
## The following object is masked from 'package:base':
## 
##     save.image
corr_img<-load.image("/home/data/Comics_Vets.jpg")

plot(corr_img,axes=FALSE)

Exercise: Model mpg on drat

drat is the rear axle ratio. You can learn more about drat here.

# the first argument is the formula
# lm() takes a data frame as input
# Uncomment and replace FORMULA and FUNCTION in the code below

model_mpg_drat <- lm(mpg ~ drat, data = mtcars)

mpg_drat_summary <- summary(model_mpg_drat)

mpg_drat_summary
## 
## Call:
## lm(formula = mpg ~ drat, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0775 -2.6803 -0.2095  2.2976  9.0225 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -7.525      5.477  -1.374     0.18    
## drat           7.678      1.507   5.096 1.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.485 on 30 degrees of freedom
## Multiple R-squared:  0.464,  Adjusted R-squared:  0.4461 
## F-statistic: 25.97 on 1 and 30 DF,  p-value: 1.776e-05

Slope and p-value

We are usually interested in only the slope and the p-value.

# Uncomment and replace FUNCTION to get the coefficients

coef_mpg_drat = coefficients(mpg_drat_summary)

slope_mpg_drat = coef_mpg_drat[2, 1]

pval_mpg_drat = coef_mpg_drat[2, 4]

r_squared_value <- mpg_drat_summary$r.squared

slope_mpg_drat
## [1] 7.678233
r_squared_value
## [1] 0.4639952
pval_mpg_drat
## [1] 1.77624e-05

What does the value of slope_mpg_drat indicate?

Plot your model

# make a scatterplot of mpg versus drat
# Uncomment and replace FUNCTION, FEATURE and "Feature Name" in the code: 

plot(mtcars$mpg ~ mtcars$drat, 
    ylab="Miles per gallon", xlab="Rear axle ratio")

# Uncomment and replace OBJECT to add the line from our model and color it red
abline(model_mpg_drat, col = "red")


Prediction

Let’s use our model to predict mpg for new drat values.

# Let's predict mpg for two new drat values 
# what are the units for drat???  
# no units
# check out some drat values in the mtcars data frame

pred_mpg_drat <- predict(model_mpg_drat, newdata=data.frame(drat=c(2,5)))
pred_mpg_drat
##         1         2 
##  7.831847 30.866545

What are your conclusions? For a drat of 2.0, the fuel efficiency is 7.83 and for a drat of 5.0, the fuel efficiency is 30.87.