Lecture 3 - Correlation and linear regression

Author

Jan C Greyling

Getting started

Go to the Teams site, download the possum data and load it into R.

Code
remove(list=ls()) #clear all from memory
require("data.table")
require("ggplot2")
require("stargazer")


dat <- fread("Data/possum.csv")

Figure 1: A possum!

Correlation between body and tail length

Using ggplot, plot the relationship between tail length and total length.

  1. plot both male and female animals simultaneously
  2. add a linear trend without a confidence interval
  3. change its colour to red
  4. make the line dotted
  5. correct the axis titles
  6. add a black and white theme
  7. fix the facet names

From the figure, there is a positive correlation between body and tail length, but is it different between male and female animals? Are you basing your answer on the fact that the slopes of the respective lines are different? That’s a start, but let’s quantify it.

Figure 2: Pearsons correlation coefficient values

To test if there is a statistical difference, you will need to calculate the Pearson correlation coefficient for each group. The Pearson correlation coefficient is defined as follows:

\[\ P_{XY} = \frac{cov(X,Y)}{\sigma_X\sigma_Y} \]

where variance is defined as \[\ Var(X) = E[(X-\mu)^2] = E[(X-\overline{X})^2]\]

and standard deviation is defined as \[\ \sigma_X = \sqrt{Var(X)}\]

where covariance is defined as \[\ Cov(X,Y) = E[(X-\mu_X)(Y-\mu_Y)] = E[(X-\overline{X})(Y-\overline{Y})]\]

mu(\(\mu\)) is the average, mean or expected value defined as \[\overline{X} = \mu_X = \frac{1}{n} \sum_{i=1}^{n}X_i = \sum_{i=1}^{n}X_ip_i\]

Now we can calculate the Pearson correlation coefficient between body and tail length of male and female possums using the cor function. Are they the same? How should we interpret the values?

Code
dat[, list(pearson = cor(x = tailL, y = totalL, method = c("pearson"))), by = .(sex)]
   sex   pearson
1:   m 0.5524124
2:   f 0.5921158

Predicting tail lengths

Models

To predict the tail length of male and female possums, we have to construct a regression model.

Code
m.female <- dat[sex=="f", lm(totalL ~ tailL)]
summary(m.female)

Call:
lm(formula = totalL ~ tailL)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.7076 -2.0154 -0.1758  2.7346  6.2635 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.7190    10.6800   3.532  0.00104 ** 
tailL         1.3526     0.2875   4.705 2.88e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.411 on 41 degrees of freedom
Multiple R-squared:  0.3506,    Adjusted R-squared:  0.3348 
F-statistic: 22.14 on 1 and 41 DF,  p-value: 2.883e-05

How good is the fit of this model? The adjusted R-squared is 0.33, meaning that tail length explains 33% of the variation of total body length.

How do you interpret the coefficients of the female model’s intercept and tail length values? The average body length of the possums is 37.72cm with the predicted body length, and total body length increases by 1.35 cm per 1 additional cm of tail length measured.

In other words total body length = 37.72 + 1.35(tail length).

I wonder if adding head length will improve the model.

Code
m.female1 <- dat[sex=="f", lm(totalL ~ tailL + headL)]
summary(m.female1)

Call:
lm(formula = totalL ~ tailL + headL)

Residuals:
   Min     1Q Median     3Q    Max 
-5.644 -1.264  0.010  1.635  4.067 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40.2713    12.8030  -3.145  0.00313 ** 
tailL         0.8271     0.2036   4.063  0.00022 ***
headL         1.0580     0.1447   7.309 6.99e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.26 on 40 degrees of freedom
Multiple R-squared:  0.7219,    Adjusted R-squared:  0.708 
F-statistic: 51.93 on 2 and 40 DF,  p-value: 7.629e-12

Yes! The adjusted R-squared jumps to 0.708! It makes one wonder how easy it is to know where the head ends and the body begins…

I wonder if controlling for skull width would improve the model.

Code
m.female2 <- dat[sex=="f", lm(totalL ~ tailL + skullW)]
summary(m.female2)

Call:
lm(formula = totalL ~ tailL + skullW)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.6925 -2.1581 -0.6501  2.5832  6.3612 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  16.0651    12.4920   1.286 0.205827    
tailL         1.0594     0.2852   3.714 0.000622 ***
skullW        0.5749     0.2033   2.828 0.007279 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.153 on 40 degrees of freedom
Multiple R-squared:  0.4588,    Adjusted R-squared:  0.4318 
F-statistic: 16.96 on 2 and 40 DF,  p-value: 4.643e-06

OK, so skull width is statistically significant thus, we can conclude that the relationship between total length and skull width is caused by something other than chance. However, given the smaller R-squared value, skull width is a poorer predictor of body length than head length. Now lets combine all the variables.

Code
m.female3 <- dat[sex=="f", lm(totalL ~ tailL + headL + skullW)]
summary(m.female3)

Call:
lm(formula = totalL ~ tailL + headL + skullW)

Residuals:
   Min     1Q Median     3Q    Max 
-5.735 -1.253  0.030  1.688  3.953 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40.33275   12.94871  -3.115 0.003444 ** 
tailL         0.81203    0.21070   3.854 0.000423 ***
headL         1.02962    0.16898   6.093 3.85e-07 ***
skullW        0.05709    0.17011   0.336 0.738947    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.285 on 39 degrees of freedom
Multiple R-squared:  0.7228,    Adjusted R-squared:  0.7014 
F-statistic: 33.89 on 3 and 39 DF,  p-value: 5.959e-11

O, now skull width is not statistically significant and the R-squared is not much better than the model with only head length. When you have coefficients that “flip” like this, a possible cause is multicollinearity, a statistical phenomenom where independent variables are correlated to each other. Read more about multicollinearity here.

Multicollinearity

Do you think there is a relationship between head length and width? Let’s test for it; you can read more here.

Code
library(GGally)
ggpairs(dat[, .(tailL,headL,skullW)])

Good, it is clear that head length and skull width is strongly correlated (0.71) - hence we should not use it for our prediction. The best model thus is m.female1.

Predicting body length

To predict body length, you have two options: create a function to calculate it or use the predict predict function. Let’s start by defining a function which you could also use in the case of male possums. To do so we will have to extract the coefficients directly from the regression model.

Pull the coefficients out, we can use the coefficient function.

Code
coef(m.female1)
(Intercept)       tailL       headL 
-40.2712950   0.8270772   1.0579608 

Note that it prints a list of all coefficients; the get a specific one we have to state its position. For example

Code
coef(m.female1)[1]
(Intercept) 
   -40.2713 

Now we can define a function to predict body length and apply it to the female subset of the data.

Code
body.pred <- function(model,x1,x2){
  
  coef(model)[1] + coef(model)[2]*x1 + coef(model)[3]*x2
}

dat[sex == "f", bpred.1 := body.pred(m.female1,tailL,headL)]

head(dat)
   site pop sex age headL skullW totalL tailL  bpred.1
1:    1 Vic   m   8  94.1   60.4   89.0  36.0       NA
2:    1 Vic   f   6  92.5   57.6   91.5  36.5 87.77840
3:    1 Vic   f   6  94.0   60.0   95.5  39.0 91.43303
4:    1 Vic   f   6  93.2   57.1   92.0  38.0 89.75959
5:    1 Vic   f   2  91.5   56.3   85.5  36.0 86.30690
6:    1 Vic   f   1  93.1   54.8   90.5  35.5 87.58610

The easy way to predict body length it to use the predict function. We can also calculate the difference between the actual and predicted values.

Code
dat[sex == "f", bpred.2 := predict(m.female1)]
head(dat)
   site pop sex age headL skullW totalL tailL  bpred.1  bpred.2
1:    1 Vic   m   8  94.1   60.4   89.0  36.0       NA       NA
2:    1 Vic   f   6  92.5   57.6   91.5  36.5 87.77840 87.77840
3:    1 Vic   f   6  94.0   60.0   95.5  39.0 91.43303 91.43303
4:    1 Vic   f   6  93.2   57.1   92.0  38.0 89.75959 89.75959
5:    1 Vic   f   2  91.5   56.3   85.5  36.0 86.30690 86.30690
6:    1 Vic   f   1  93.1   54.8   90.5  35.5 87.58610 87.58610

Let’s test to see if my function and the predict function yield the same results:

Code
dat[!is.na(bpred.1), bpred.1==bpred.2]
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Yes!

Let’s plot the actuals vs the predicted

Code
ggplot(dat,                                     # Draw plot using ggplot2 package
       aes(x = bpred.1,
           y = totalL)) +
  geom_point() +
  geom_smooth(method = lm,se = F,color="red") +
  labs(x="Predicted",y="Observed")

Mmm, is it good that the actuals and predicted values are correlated? To read more about linear regression and diagnostics, click here.

Code
par(mfrow = c(2, 2))
plot(m.female1)

What do you think?

Summarising regression results

You can use the stargazer package for summarising your results.

Code
library("stargazer")

stargazer(m.female, m.female1,m.female2,m.female3, type = "html", 
          title            = "Predicting body length",
          covariate.labels = c("Tail length","Head length","Age","Intercept"),
          dep.var.caption  = "Model",
          column.labels = c("Model 1", "Model 2","Model 3","Model 4"),
          dep.var.labels   = "Total body length",
           column.sep.width = "+15pt",
           digits = 2,
           font.size = "small",
          align = TRUE,
          header = FALSE)
Predicting body length
Model
Total body length
Model 1 Model 2 Model 3 Model 4
(1) (2) (3) (4)
Tail length 1.35*** 0.83*** 1.06*** 0.81***
(0.29) (0.20) (0.29) (0.21)
Head length 1.06*** 1.03***
(0.14) (0.17)
Age 0.57*** 0.06
(0.20) (0.17)
Intercept 37.72*** -40.27*** 16.07 -40.33***
(10.68) (12.80) (12.49) (12.95)
Observations 43 43 43 43
R2 0.35 0.72 0.46 0.72
Adjusted R2 0.33 0.71 0.43 0.70
Residual Std. Error 3.41 (df = 41) 2.26 (df = 40) 3.15 (df = 40) 2.29 (df = 39)
F Statistic 22.14*** (df = 1; 41) 51.93*** (df = 2; 40) 16.96*** (df = 2; 40) 33.89*** (df = 3; 39)
Note: p<0.1; p<0.05; p<0.01

Assignment

Before next week’s class, please answer the following questions using a new script and email the script to me.

  1. construct a model wherein you estimate a model for the length of male possums using their tail and head length

  2. Is tail length a better or worse indicator of possum body length in male compared to female possums?

  3. Predict the male possum’s body length using a function defined by you (thus, you may not use the predict function).

  4. calculate the difference between the predicted and actual body lengths and plot these on an appropriate scatter plot.