M. Drew LaMar
February 10, 2020
Male lizards in the species Crotaphytus collaris use their jaws as weapons during territorial interactions. Lappin and Husak (2005) tested whether weapon performance (bite force) predicted territory size in this species.
Computing best-fit line using R
(biteRegression <- lm(territory.area ~ bite, data = biteData))
Call:
lm(formula = territory.area ~ bite, data = biteData)
Coefficients:
(Intercept) bite
-31.54 11.68
Bonus!!! With lm
, can add best-fit line to plot.
# Need to adjust margins to see axis labels
par(mar=c(4.5,5.0,2,2))
# Scatter plot
plot(biteData, pch=16, col="firebrick", cex=1.5, cex.lab=1.5, xlab="Bite force (N)", ylab=expression("Territory area" ~ (m^2)))
# Add in the best-fit line
abline(biteRegression, lwd=3)
Bonus!!! With lm
, can add best-fit line to plot.
Definition: The
predicted value of \( Y \) (denoted \( \hat{Y} \), or \( \mu_{Y\, |\, X} \)) from a regression line estimates the mean value of \( Y \) for all individuals having a given value of \( X \).
Definition:
Residuals measure the scatter of points above and below the least-squares regression line, and are denoted by
\[ r_{i} = \hat{Y}_{i} - Y_{i}, \]
where \( \hat{Y}_{i} = a + bX_{i} \).
We can predict what the mean value of \( Y \) is for values of the explanatory variable \( X \) not represented in our data, as long as we are within the range of values of the data.
The function predict
accomplishes this, and even gives us a standard error for our estimate.
(pred_5.1 <- predict(biteRegression, data.frame(bite = 5.1), se.fit = TRUE))
$fit
1
28.01492
$se.fit
[1] 2.163259
$df
[1] 9
$residual.scale
[1] 5.788413
Definition: a
residual plot is a scatter plot of the residuals \( (\hat{Y}_{i}-Y_{i}) \) against the \( X_{i} \), the values of the explanatory variable.
# Get residuals from regression output
biteData$res = resid(biteRegression)
# Plot residuals
plot(res ~ bite, data=biteData, pch=16, cex=1.5, cex.lab=1.5, col="firebrick", xlab="Bite force (N)", ylab="Residuals")
# Add a horizontal line at zero
abline(h=0, lty=2)
hist(biteData$res)
qqnorm(biteData$res)
summary(biteRegression)
Call:
lm(formula = territory.area ~ bite, data = biteData)
Residuals:
Min 1Q Median 3Q Max
-7.0472 -4.5101 -0.5504 3.6689 10.2237
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -31.539 23.513 -1.341 0.2127
bite 11.677 4.848 2.409 0.0393 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.788 on 9 degrees of freedom
Multiple R-squared: 0.3919, Adjusted R-squared: 0.3244
F-statistic: 5.801 on 1 and 9 DF, p-value: 0.03934
\( P \)-value is less than 0.05, so we can reject the null hypothesis that the slope \( \beta = 0 \).
We can measure how well the line “fits” the data by estimating the \( R^{2} \) value, i.e.
\[ R^{2} = \frac{\sigma^{2}_{\mathrm{regression}}}{\sigma^{2}_{\mathrm{response}}} = \frac{\sigma^{2}_{\mathrm{response}} - \sigma^{2}_{\mathrm{residual}}}{\sigma^{2}_{\mathrm{response}}}. \]
This also can be said to measure the fraction of variation in \( Y \) that is “explained” by \( X \).
Basic idea is:
For the lizard example,
biteRegSummary <- summary(biteRegression)
biteRegSummary$r.squared
[1] 0.3919418
Thus, 39% of the variation in territory area is explained by bite force.
# Need to adjust margins to see axis labels
par(mar=c(4.5,5.0,2,2))
# Scatter plot
plot(territory.area ~ bite, pch=16, col="firebrick", cex=1.5, cex.lab=1.5, xlab="Bite force (N)", ylab=expression("Territory area" ~ (m^2)), data=biteData)
# Add in the best-fit line
abline(biteRegression, lwd=3)
# Text
text(5, 30.5, expression(R^{2} ~ "=" ~ 0.39), cex=1.5)
text(5.14, 31.7, expression(y ~ "=" -31.54 + 11.68), cex=1.5)
summary(biteRegression)
Call:
lm(formula = territory.area ~ bite, data = biteData)
Residuals:
Min 1Q Median 3Q Max
-7.0472 -4.5101 -0.5504 3.6689 10.2237
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -31.539 23.513 -1.341 0.2127
bite 11.677 4.848 2.409 0.0393 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.788 on 9 degrees of freedom
Multiple R-squared: 0.3919, Adjusted R-squared: 0.3244
F-statistic: 5.801 on 1 and 9 DF, p-value: 0.03934
Definition: Multiple regression extends simple two-variable regression to the case that still has one response but many predictors (denoted \( x_1 \), \( x_2 \), \( x_3 \), …).
The method is motivated by scenarios where many variables may be simultaneously connected to an output.
We're going to look at auction data for the Mario Kart Wii game.
mario_kart <- marioKart %>%
dplyr::select(price = totalPr,
cond,
stock_photo = stockPhoto,
duration, wheels) %>%
mutate(cond = forcats::fct_relevel(cond, c("used", "new"))) %>%
filter(price < 100)
str(mario_kart)
'data.frame': 141 obs. of 5 variables:
$ price : num 51.5 37 45.5 44 71 ...
$ cond : Factor w/ 2 levels "used","new": 2 1 2 2 2 2 1 2 1 1 ...
$ stock_photo: Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
$ duration : int 3 7 3 3 1 3 1 1 3 7 ...
$ wheels : int 1 1 1 1 2 0 0 2 1 1 ...
mario_kart %>%
ggplot(aes(x = cond, y = price)) +
geom_point(position = position_jitter(width = 0.1))
summary(mdl_cond)
Call:
lm(formula = price ~ cond, data = mario_kart)
Residuals:
Min 1Q Median 3Q Max
-13.8911 -5.8311 0.1289 4.1289 22.1489
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.871 0.814 52.668 < 2e-16 ***
condnew 10.900 1.258 8.662 1.06e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.371 on 139 degrees of freedom
Multiple R-squared: 0.3506, Adjusted R-squared: 0.3459
F-statistic: 75.03 on 1 and 139 DF, p-value: 1.056e-14
A multiple regression model is a linear model with many predictors. In general, we write the model as \[ \hat{y} = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \cdots + \beta_{k}x_{k} \] when there are \( k \) predictors.
mdl_full <- lm(price ~ cond + stock_photo + duration + wheels, data = mario_kart)
summary(mdl_full)
Call:
lm(formula = price ~ cond + stock_photo + duration + wheels,
data = mario_kart)
Residuals:
Min 1Q Median 3Q Max
-11.3788 -2.9854 -0.9654 2.6915 14.0346
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.21097 1.51401 23.917 < 2e-16 ***
condnew 5.13056 1.05112 4.881 2.91e-06 ***
stock_photoyes 1.08031 1.05682 1.022 0.308
duration -0.02681 0.19041 -0.141 0.888
wheels 7.28518 0.55469 13.134 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.901 on 136 degrees of freedom
Multiple R-squared: 0.719, Adjusted R-squared: 0.7108
F-statistic: 87.01 on 4 and 136 DF, p-value: < 2.2e-16