Getting started

Go to SUNLearn, download the possum data and load it into R.

## Loading required package: data.table

## Loading required package: ggplot2

Figure 1: A possum!

Correlation between body and tail length

Using ggplot, plot the relationship between tail length and total length

plot both male and female animals simultaneously
add a linear trend without a confidence interval
change its colour to red
make the line dotted
correct the axis titles
add a black and white theme
fix the facet names

From the figure it is clear that there is a positive correlation between body and tail length, but is it different between male and female animals? Are you basing your answer on the fact that the slopes of the respective lines are different? If so, you are wrong.

Figure 2: Pearsons correlation coefficient values

To test if there is a difference you will need to calculate the Pearson correlation coefficient for each group.Recall that the Pearson correlation coefficient is defined as follow:

\[\ P_{XY} = \frac{cov(X,Y)}{\sigma_X\sigma_Y} \]

where variance is defined as \[\ Var(X) = E[(X-\mu)^2] = E[(X-\overline{X})^2]\]

and standard deviation is defined as \[\ \sigma_X = \sqrt{Var(X)}\]

where covariance is defined as \[\ Cov(X,Y) = E[(X-\mu_X)(Y-\mu_Y)] = E[(X-\overline{X})(Y-\overline{Y})]\]

mu(\(\mu\)) is the average, mean or expected value defined as \[\overline{X} = \mu_X = \frac{1}{n} \sum_{i=1}^{n}X_i = \sum_{i=1}^{n}X_ip_i\]

Now we can calculate the Pearson correlation coefficient between body and tail length of male and female possums using the cor function.

##    sex   pearson
## 1:   m 0.5524124
## 2:   f 0.5921158

Predicting tail lengths

To predict the tail length of male and female possums we have to construct a regression model.

m.female <- dat[sex=="f", lm(totalL ~ tailL)]
summary(m.female)

## 
## Call:
## lm(formula = totalL ~ tailL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7076 -2.0154 -0.1758  2.7346  6.2635 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.7190    10.6800   3.532  0.00104 ** 
## tailL         1.3526     0.2875   4.705 2.88e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.411 on 41 degrees of freedom
## Multiple R-squared:  0.3506, Adjusted R-squared:  0.3348 
## F-statistic: 22.14 on 1 and 41 DF,  p-value: 2.883e-05

How good is the fit of this model? The adjusted R-squared is 0.33 which means that tail length explains 33% of the variation of total body length.

How do you interpret the coefficients of the intercept and tail length values of the female model? The average body length of the possums is 37.72cm with the predicted body length and total body length increases by 1.35 cm per 1 additional cm of tail length measured.

In other words total body length = 37.72 + 1.35(tail length).

What is the difference between the actual and predicted values for body length? To calculate this you have two options, you can either create a function to calculate it or you can use the predict predict function. Lets start by defining a formula which you would also be able to use in the case of male possums, to do so we will have to pull the coefficients directly from the model.

Pull the coefficients out we can use the coefficient function

coef(m.female)

## (Intercept)       tailL 
##   37.719006    1.352606

Note that it prints a list of all coefficients, the get a specific one we have to state its position

coef(m.female)[1]

## (Intercept) 
##    37.71901

Now we can define a function to predict body length and apply it to the female subset of the data.

body.pred <- function(model,x){
  
  coef(model)[1] + coef(model)[2]*x
}

dat[sex == "f", bpred.1 := body.pred(m.female,tailL)]

head(dat)

##    site pop sex age headL skullW totalL tailL  bpred.1
## 1:    1 Vic   m   8  94.1   60.4   89.0  36.0       NA
## 2:    1 Vic   f   6  92.5   57.6   91.5  36.5 87.08912
## 3:    1 Vic   f   6  94.0   60.0   95.5  39.0 90.47064
## 4:    1 Vic   f   6  93.2   57.1   92.0  38.0 89.11803
## 5:    1 Vic   f   2  91.5   56.3   85.5  36.0 86.41282
## 6:    1 Vic   f   1  93.1   54.8   90.5  35.5 85.73652

The easy way to predict body length it to use the predict function. We can also calculate the difference between the actual and predicted values.

dat[sex == "f", bpred.2 := predict(m.female)]
head(dat)

##    site pop sex age headL skullW totalL tailL  bpred.1  bpred.2
## 1:    1 Vic   m   8  94.1   60.4   89.0  36.0       NA       NA
## 2:    1 Vic   f   6  92.5   57.6   91.5  36.5 87.08912 87.08912
## 3:    1 Vic   f   6  94.0   60.0   95.5  39.0 90.47064 90.47064
## 4:    1 Vic   f   6  93.2   57.1   92.0  38.0 89.11803 89.11803
## 5:    1 Vic   f   2  91.5   56.3   85.5  36.0 86.41282 86.41282
## 6:    1 Vic   f   1  93.1   54.8   90.5  35.5 85.73652 85.73652

dat[, diff := totalL-bpred.2]
head(dat)

##    site pop sex age headL skullW totalL tailL  bpred.1  bpred.2      diff
## 1:    1 Vic   m   8  94.1   60.4   89.0  36.0       NA       NA        NA
## 2:    1 Vic   f   6  92.5   57.6   91.5  36.5 87.08912 87.08912  4.410878
## 3:    1 Vic   f   6  94.0   60.0   95.5  39.0 90.47064 90.47064  5.029363
## 4:    1 Vic   f   6  93.2   57.1   92.0  38.0 89.11803 89.11803  2.881969
## 5:    1 Vic   f   2  91.5   56.3   85.5  36.0 86.41282 86.41282 -0.912819
## 6:    1 Vic   f   1  93.1   54.8   90.5  35.5 85.73652 85.73652  4.763484

I wonder if adding head length will improve the model?

m.female1 <- dat[sex=="f", lm(totalL ~ tailL + headL)]
summary(m.female1)

## 
## Call:
## lm(formula = totalL ~ tailL + headL)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.644 -1.264  0.010  1.635  4.067 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -40.2713    12.8030  -3.145  0.00313 ** 
## tailL         0.8271     0.2036   4.063  0.00022 ***
## headL         1.0580     0.1447   7.309 6.99e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.26 on 40 degrees of freedom
## Multiple R-squared:  0.7219, Adjusted R-squared:  0.708 
## F-statistic: 51.93 on 2 and 40 DF,  p-value: 7.629e-12

Yes! The adjusted R-squared jumps to 0.708! Makes one wonder how easy it is to know where the head ends and the body begins…

I wonder if controlling for age would improve the model?

m.female2 <- dat[sex=="f", lm(totalL ~ tailL + headL + age)]
summary(m.female2)

## 
## Call:
## lm(formula = totalL ~ tailL + headL + age)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.836 -1.436 -0.027  1.659  3.776 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -35.8196    13.2424  -2.705 0.010076 *  
## tailL         0.7649     0.2087   3.665 0.000735 ***
## headL         1.0246     0.1465   6.995 2.17e-08 ***
## age           0.2328     0.1913   1.217 0.231046    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.246 on 39 degrees of freedom
## Multiple R-squared:  0.7321, Adjusted R-squared:  0.7115 
## F-statistic: 35.53 on 3 and 39 DF,  p-value: 3.067e-11

Nope, age is not statistically significant thus even though its coefficient is positive as expected we cannot prove statistically that it has impact of body length if we control for both head and tail length. Note that the adjusted R-squared did improve a little bit by including the age of the possums. Also note tht age would become statistically significant if it is the only varialbe included in the model.

m.female3 <- dat[sex=="f", lm(totalL ~ age)]
summary(m.female3)

## 
## Call:
## lm(formula = totalL ~ age)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.4090  -2.4178  -0.2657   2.1539   9.4127 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  84.5699     1.3653  61.944  < 2e-16 ***
## age           0.8392     0.3091   2.715  0.00965 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.897 on 41 degrees of freedom
## Multiple R-squared:  0.1524, Adjusted R-squared:  0.1317 
## F-statistic: 7.371 on 1 and 41 DF,  p-value: 0.009654

Note that you can use the stargazer package to summarise your results.

require("stargazer")

## Loading required package: stargazer

## 
## Please cite as:

##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer

stargazer(m.female, m.female1,m.female2,m.female3, type = "html", 
          title            = "Predicting possum length",
          covariate.labels = c("Intercept",
                               "Tail length (cm)",
                               "Head length (cm)",
                               "Age (years)"),
          dep.var.caption  = "Possum length (cm)",
          column.labels = c("Model 1", "Model 2","Model 3","Model 4"),
          dep.var.labels   = "Relative yield",
          header = FALSE)

## 
## <table style="text-align:center"><caption><strong>Predicting possum length</strong></caption>
## <tr><td colspan="5" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="4">Possum length (cm)</td></tr>
## <tr><td></td><td colspan="4" style="border-bottom: 1px solid black"></td></tr>
## <tr><td style="text-align:left"></td><td colspan="4">Relative yield</td></tr>
## <tr><td style="text-align:left"></td><td>Model 1</td><td>Model 2</td><td>Model 3</td><td>Model 4</td></tr>
## <tr><td style="text-align:left"></td><td>(1)</td><td>(2)</td><td>(3)</td><td>(4)</td></tr>
## <tr><td colspan="5" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Intercept</td><td>1.353<sup>***</sup></td><td>0.827<sup>***</sup></td><td>0.765<sup>***</sup></td><td></td></tr>
## <tr><td style="text-align:left"></td><td>(0.287)</td><td>(0.204)</td><td>(0.209)</td><td></td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">Tail length (cm)</td><td></td><td>1.058<sup>***</sup></td><td>1.025<sup>***</sup></td><td></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.145)</td><td>(0.146)</td><td></td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">Head length (cm)</td><td></td><td></td><td>0.233</td><td>0.839<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td>(0.191)</td><td>(0.309)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">Age (years)</td><td>37.719<sup>***</sup></td><td>-40.271<sup>***</sup></td><td>-35.820<sup>**</sup></td><td>84.570<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(10.680)</td><td>(12.803)</td><td>(13.242)</td><td>(1.365)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td></td></tr>
## <tr><td colspan="5" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>43</td><td>43</td><td>43</td><td>43</td></tr>
## <tr><td style="text-align:left">R<sup>2</sup></td><td>0.351</td><td>0.722</td><td>0.732</td><td>0.152</td></tr>
## <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.335</td><td>0.708</td><td>0.712</td><td>0.132</td></tr>
## <tr><td style="text-align:left">Residual Std. Error</td><td>3.411 (df = 41)</td><td>2.260 (df = 40)</td><td>2.246 (df = 39)</td><td>3.897 (df = 41)</td></tr>
## <tr><td style="text-align:left">F Statistic</td><td>22.135<sup>***</sup> (df = 1; 41)</td><td>51.929<sup>***</sup> (df = 2; 40)</td><td>35.529<sup>***</sup> (df = 3; 39)</td><td>7.371<sup>***</sup> (df = 1; 41)</td></tr>
## <tr><td colspan="5" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="4" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
## </table>

Assignment

Before next weeks class, answer the following questions using a new script and email the script to me.

construct a model wherein you estimate a model for length of male possums using their tail and head length
Is tail length a better or worse indicator of possom body lenght in male compared to female possums?
Predict the male possum body length using a function defined by you (thus you may not use the predict function).
calculate the difference between the predicted and actual body lengths and plot these on an appropriate scatter plot.

Lecture 3 - Correlation and linear regression

Jan C Greyling

Getting started

Correlation between body and tail length

Predicting tail lengths

Assignment