Regression and Correlation

M. Drew LaMar
April 20, 2022


https://xkcd.com/605/

Regression

Definition: Regression is the method used to predict values of one numerical variable (response) from values of another (explanatory).

Note: Regression can be done on data from an observational or experimental study.

We will discuss 3 types:

  • Linear regression
  • Nonlinear regression
  • Logistic regression

Linear regression

Definition: Linear regression draws a straight line through the data to predict the response variable from the explanatory variable.

Slope determines rate of change of response with explanatory - humans lose 0.076 units of genetic diversity with every 10,000 km from East Africa.

Formula for the line

Definition: For the population, the regression line is

\[ Y = \alpha + \beta X, \]
where \( \alpha \) (the intercept) and \( \beta \) (the slope) are population parameters.

Definition: For a sample, the regression line is

\[ Y = a + b X, \]
where \( a \) and \( b \) are estimates of \( \alpha \) and \( \beta \), respectively.

Graph of the line

  • \( a \): intercept
  • \( b \): slope

Assumptions of linear regression

Note: At each value of \( X \), there is a population of \( Y \)-values whose mean lies on the true regression line (this is the linear assumption).

Assumptions of linear regression

  • Linearity
  • Residuals are normally distributed
  • Constant variance of residuals
  • Independent observations

alt text

Linear regression is a statistical model

Linear regression is a model formulation

Usually (but not always) it is reserved for situations where you assert evidence of causation (e.g. A causes B)

Correlation, in contrast, describes relationships (e.g. A and B are positively correlated)

Linear correlation coefficient

Variables: For a correlation, our data consist of two numerical variables (continuous or discrete).

Definition: The (linear) correlation coefficient \( \rho \) measures the strength and direction of the association between two numerical variables in a population.

The linear (Pearson) correlation coefficient measures the tendency of two numerical variables to co-vary in a linear way.

The symbol \( r \) denotes a sample estimate of \( \rho \).

Sample correlation coefficient

\[ r = \frac{\sum_{i}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sqrt{\sum_{i}(X_{i}-\bar{X})^2}\sqrt{\sum_{i}(Y_{i}-\bar{Y})^2}} \]

\[ -1 \leq r \leq 1 \]

Sample correlation coefficient

\[ r = \frac{\frac{1}{n-1}\sum_{i}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sqrt{\frac{1}{n-1}\sum_{i}(X_{i}-\bar{X})^2}\sqrt{\frac{1}{n-1}\sum_{i}(Y_{i}-\bar{Y})^2}} \]

Sample correlation coefficient

\[ r = \frac{\mathrm{Covariance}(X,Y)}{s_{X}s_{Y}} \]

Spurious correlations

Important!

Technically, the linear regression equation is

\[ \mu_{Y\, |\, X=X^{*}} = \alpha + \beta X^{*}, \]

were \( \mu_{Y\, |\, X=X^{*}} \) is the mean of \( Y \) in the sub-population with \( X=X^{*} \) (called predicted values).

You are predicting the mean of Y given X.

How do you find the "best fit" line?

Method of least squares

Definition: The least-squares regression line is the line for which the sum of all the squared deviations in \( Y \) is smallest.

How do you find the "best fit" line?

The method of least-squares leads to the following estimates for intercept and slope:

\[ \begin{align} b & = \frac{\sum_{i}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sum_{i}(X_{i}-\bar{X})^2} \\ a & = \bar{Y}-b\bar{X} \end{align} \]

Note:

\[ b = \frac{\mathrm{Covariance(X,Y)}}{s_{X}^2} = r\frac{s_{Y}}{s_{X}}, \]

where \( r \) is the correlation coefficient!

Example: Biting lizards

Example: Biting lizards

Example: Biting lizards

Example: Biting lizards

Male lizards in the species Crotaphytus collaris use their jaws as weapons during territorial interactions. Lappin and Husak (2005) tested whether weapon performance (bite force) predicted territory size in this species.

Example: Biting lizards

plot of chunk unnamed-chunk-2

Example: Biting lizards

Compute best-fit line: Slope

\[ b = \frac{\mathrm{Covariance(X,Y)}}{s_{X}^2} \]

# Slope
(b <- cov(biteData$bite, biteData$territory.area)/var(biteData$bite))
[1] 11.6773

Example: Biting lizards

Compute best-fit line: Intercept

\[ a = \bar{Y}-b\bar{X} \]

# Intercept
(a <- mean(biteData$territory.area) - b*mean(biteData$bite))
[1] -31.53929

Example: Biting lizards

Faster!!! Use lm

(biteRegression <- lm(territory.area ~ bite, data = biteData))

Call:
lm(formula = territory.area ~ bite, data = biteData)

Coefficients:
(Intercept)         bite  
     -31.54        11.68  

Example: Biting lizards

Bonus!!! With lm, can add best-fit line to plot.

# Need to adjust margins to see axis labels
par(mar=c(4.5,5.0,2,2))

# Scatter plot
plot(biteData, pch=16, col="firebrick", cex=1.5, cex.lab=1.5, xlab="Bite force (N)", ylab=expression("Territory area" ~ (m^2)))

# Add in the best-fit line
abline(biteRegression, lwd=3)

Example: Biting lizards

Bonus!!! With lm, can add best-fit line to plot.

plot of chunk unnamed-chunk-7

Predicted values and residuals

Definition: The predicted value of \( Y \) (denoted \( \hat{Y} \), or \( \mu_{Y\, |\, X} \)) from a regression line estimates the mean value of \( Y \) for all individuals having a given value of \( X \).

Definition: Residuals measure the scatter of points above and below the least-squares regression line, and are denoted by

\[ r_{i} = \hat{Y}_{i} - Y_{i}, \]
where \( \hat{Y}_{i} = a + bX_{i} \).

Predicted values and residuals

Prediction values

We can predict what the mean value of \( Y \) is for values of the explanatory variable \( X \) not represented in our data, as long as we are within the range of values of the data.

The function predict accomplishes this, and even gives us a standard error for our estimate.

(pred_5.1 <- predict(biteRegression, data.frame(bite = 5.1), se.fit = TRUE))
$fit
       1 
28.01492 

$se.fit
[1] 2.163259

$df
[1] 9

$residual.scale
[1] 5.788413

Prediction values

plot of chunk unnamed-chunk-9

Prediction values - Extrapolation

Definition: Extrapolation is the prediction of the value of a response variable outside the range of \( X \)-values in the data.

Regression should not be used to predict the value of the response values for an \( X \)-value that lies well outside the range of the data.

Residual plot

Definition: a residual plot is a scatter plot of the residuals \( (\hat{Y}_{i}-Y_{i}) \) against the \( X_{i} \), the values of the explanatory variable.

plot of chunk unnamed-chunk-10

Residual plots

# Get residuals from regression output
biteData$res = resid(biteRegression)

# Plot residuals
plot(res ~ bite, data=biteData, pch=16, cex=1.5, cex.lab=1.5, col="firebrick", xlab="Bite force (N)", ylab="Residuals")

# Add a horizontal line at zero
abline(h=0, lty=2)

Residual plots to check assumptions

hist(biteData$res)

plot of chunk unnamed-chunk-12

qqnorm(biteData$res)

plot of chunk unnamed-chunk-13

Linear regression

summary(biteRegression)

Linear regression


Call:
lm(formula = territory.area ~ bite, data = biteData)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.0472 -4.5101 -0.5504  3.6689 10.2237 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  -31.539     23.513  -1.341   0.2127  
bite          11.677      4.848   2.409   0.0393 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.788 on 9 degrees of freedom
Multiple R-squared:  0.3919,    Adjusted R-squared:  0.3244 
F-statistic: 5.801 on 1 and 9 DF,  p-value: 0.03934

\( P \)-value is less than 0.05, so we can reject the null hypothesis that the slope \( \beta = 0 \).

Variation explained by explanatory variable

We can measure how well the line “fits” the data by estimating the \( R^{2} \) value, i.e.

\[ R^{2} = \frac{\sigma^{2}_{\mathrm{regression}}}{\sigma^{2}_{\mathrm{response}}} = \frac{\sigma^{2}_{\mathrm{response}} - \sigma^{2}_{\mathrm{residual}}}{\sigma^{2}_{\mathrm{response}}}. \]

This also can be said to measure the fraction of variation in \( Y \) that is “explained” by \( X \).

Variation explained by explanatory variable

Basic idea is:

  • If \( R^{2} \) is close to 1, then \( X \) is explaining most of the variation in \( Y \), and any other variation which could be caused by other sources is negligible in comparison.
  • If \( R^{2} \) is close to 0, then \( X \) is explaining very little of the variation in \( Y \), and the remaining variation is caused by other sources not accounted for or measured in the system of study.

Variation explained by explanatory variable

For the lizard example,

biteRegSummary <- summary(biteRegression)
biteRegSummary$r.squared
[1] 0.3919418

Thus, 39% of the variation in territory area is explained by bite force.

Annotate plot

# Need to adjust margins to see axis labels
par(mar=c(4.5,5.0,2,2))

# Scatter plot
plot(territory.area ~ bite, pch=16, col="firebrick", cex=1.5, cex.lab=1.5, xlab="Bite force (N)", ylab=expression("Territory area" ~ (m^2)), data=biteData)

# Add in the best-fit line
abline(biteRegression, lwd=3)

# Text
text(5, 30.5, expression(R^{2} ~ "=" ~ 0.39), cex=1.5)
text(5.14, 31.7, expression(y ~ "=" -31.54 + 11.68), cex=1.5)

Annotate plot

plot of chunk unnamed-chunk-18

Summary of a regression in R

summary(biteRegression)

Summary of a regression in R


Call:
lm(formula = territory.area ~ bite, data = biteData)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.0472 -4.5101 -0.5504  3.6689 10.2237 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  -31.539     23.513  -1.341   0.2127  
bite          11.677      4.848   2.409   0.0393 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.788 on 9 degrees of freedom
Multiple R-squared:  0.3919,    Adjusted R-squared:  0.3244 
F-statistic: 5.801 on 1 and 9 DF,  p-value: 0.03934

Logistic regression

Definition: Logistic regression is nonlinear regression developed for a categorical response variable with two levels.

Caption: Mortality of guppies in relation to duration of exposure to a temperature of 5\( ^{\circ} \) C. (Estimating probability of an event)

Logistic regression

Properties:

  • Outcomes at every \( X \) have a binomial distribution (not a normal distribution).
  • Probability of an event is given by corresponding predicted value on the regression curve.
  • To correct for differences in variance of residuals at different \( X \) values, residuals are weighted by estimated variance obtained from binomial distribution.

Logistic regression

The following curve is fit in logistic regression:

\[ \textrm{log-odds}(Y) = \alpha + \beta X, \]

where

\[ \textrm{log-odds}(Y) = \ln\left[\frac{Y}{1-Y}\right] \]

and \( \alpha, \beta \) are population parameters.

Note: \( \textrm{Log-odds}(Y) \) is called the logit link function.

Logistic regression

The estimates \( a \) and \( b \) for \( \alpha \) and \( \beta \), respectively, describe the linear relationship between \( X \) and the predicted log odds of \( Y \), i.e.

\[ \textrm{log-odds}(\hat{Y}) = a + b X. \]

To obtain the original predicted values, we need to solve for \( \hat{Y} \), obtaining

\[ \hat{Y} = \frac{e^{a+bX}}{1+e^{a+bX}} = \frac{1}{1 + e^{-(a + bX)}}. \]

This is a sigmoidal (S-shaped) function!! When \( b > 0 \), function is increasing. When \( b < 0 \), function is decreasing. When \( b = 0 \), function is constant.

Example: Logistic regression

The threat of bioterrorism makes it necessary to quantify the risk of exposure to infectious agents such as anthrax (Bacillus anthracis). Hans (2002) measured the mortality of rhesus monkeys in an exposure chamber to aerosolized anthrax spores of varying concentrations.

Example: Logistic regression

Look at data

anthrax <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter17/chap17q35AnthraxMortality.csv")
str(anthrax)
'data.frame':   9 obs. of  3 variables:
 $ anthraxConcentration: int  29300 32100 45300 57300 64800 67000 100000 125000 166000
 $ survived            : int  7 4 3 2 3 5 0 1 0
 $ died                : int  1 4 5 6 5 3 8 7 8

Example: Logistic regression

From wide table to long

str(anthrax_long)
'data.frame':   72 obs. of  2 variables:
 $ anthraxConcentration: int  29300 32100 32100 32100 32100 45300 45300 45300 45300 45300 ...
 $ mortality           : num  0 0 0 0 0 0 0 0 0 0 ...

Example: Logistic regression

Show data

plot(jitter(mortality, 
            factor = 0.1) ~ 
     jitter(anthraxConcentration, 
            factor = 6), 
     data = anthrax_long, 
     pch=16, 
     cex=1.5, 
     cex.lab=1.5, 
     col="firebrick", 
     xlab="Anthrax concentration (spores/l)", 
     ylab="Mortality")

Example: Logistic regression

Show data

plot of chunk unnamed-chunk-24

Example: Logistic regression

Generalized linear models

anthraxGlm <- glm(mortality ~ anthraxConcentration, 
                  data = anthrax_long, 
                  family = binomial(link = logit))

glm stands for generalized linear model.

Notice the family = binomial(link = logit) argument.

Example: Logistic regression

Add regression curve

# Plot data
plot(jitter(mortality, factor = 0.1) ~ jitter(anthraxConcentration, factor = 6), data = anthrax_long, pch=16, cex=1.5, cex.lab=1.5, col="firebrick", xlab="Anthrax concentration (spores/l)", ylab="Mortality")

# Plot fit
xpts <- seq(min(anthrax_long$anthraxConcentration), 
            max(anthrax_long$anthraxConcentration), 
            length.out = 100)
ypts <- predict(anthraxGlm, 
                newdata = data.frame(anthraxConcentration = xpts), 
                type = "response")
lines(xpts, ypts)

Example: Logistic regression

Add regression curve

plot of chunk unnamed-chunk-27

Example: Logistic regression

Table of parameter estimates

summary(anthraxGlm)

Example: Logistic regression

Table of parameter estimates


Call:
glm(formula = mortality ~ anthraxConcentration, family = binomial(link = logit), 
    data = anthrax_long)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4750  -0.9292  -0.3420   0.9161   2.3951  

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)   
(Intercept)           1.745e+00  6.921e-01   2.521  0.01171 * 
anthraxConcentration -3.643e-05  1.119e-05  -3.255  0.00113 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 92.982  on 71  degrees of freedom
Residual deviance: 73.962  on 70  degrees of freedom
AIC: 77.962

Number of Fisher Scoring iterations: 5