Go to SUNLearn, download the possum data and load it into R.
## Loading required package: data.table
## Loading required package: ggplot2
Figure 1: A possum!
Using ggplot, plot the relationship between tail length and total length
From the figure it is clear that there is a positive correlation between body and tail length, but is it different between male and female animals? Are you basing your answer on the fact that the slopes of the respective lines are different? If so, you are wrong.
Figure 2: Pearsons correlation coefficient values
To test if there is a difference you will need to calculate the Pearson correlation coefficient for each group.Recall that the Pearson correlation coefficient is defined as follow:
\[\ P_{XY} = \frac{cov(X,Y)}{\sigma_X\sigma_Y} \]
where variance is defined as \[\ Var(X) = E[(X-\mu)^2] = E[(X-\overline{X})^2]\]
and standard deviation is defined as \[\ \sigma_X = \sqrt{Var(X)}\]
where covariance is defined as \[\ Cov(X,Y) = E[(X-\mu_X)(Y-\mu_Y)] = E[(X-\overline{X})(Y-\overline{Y})]\]
mu(\(\mu\)) is the average, mean or expected value defined as \[\overline{X} = \mu_X = \frac{1}{n} \sum_{i=1}^{n}X_i = \sum_{i=1}^{n}X_ip_i\]
Now we can calculate the Pearson correlation coefficient between body and tail length of male and female possums using the cor function.
## sex pearson
## 1: m 0.5524124
## 2: f 0.5921158
To predict the tail length of male and female possums we have to construct a regression model.
m.female <- dat[sex=="f", lm(totalL ~ tailL)]
summary(m.female)
##
## Call:
## lm(formula = totalL ~ tailL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7076 -2.0154 -0.1758 2.7346 6.2635
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.7190 10.6800 3.532 0.00104 **
## tailL 1.3526 0.2875 4.705 2.88e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.411 on 41 degrees of freedom
## Multiple R-squared: 0.3506, Adjusted R-squared: 0.3348
## F-statistic: 22.14 on 1 and 41 DF, p-value: 2.883e-05
How good is the fit of this model? The adjusted R-squared is 0.33 which means that tail length explains 33% of the variation of total body length.
How do you interpret the coefficients of the intercept and tail length values of the female model? The average body length of the possums is 37.72cm with the predicted body length and total body length increases by 1.35 cm per 1 additional cm of tail length measured.
In other words total body length = 37.72 + 1.35(tail length).
What is the difference between the actual and predicted values for body length? To calculate this you have two options, you can either create a function to calculate it or you can use the predict predict function. Lets start by defining a formula which you would also be able to use in the case of male possums, to do so we will have to pull the coefficients directly from the model.
Pull the coefficients out we can use the coefficient function
coef(m.female)
## (Intercept) tailL
## 37.719006 1.352606
Note that it prints a list of all coefficients, the get a specific one we have to state its position
coef(m.female)[1]
## (Intercept)
## 37.71901
Now we can define a function to predict body length and apply it to the female subset of the data.
body.pred <- function(model,x){
coef(model)[1] + coef(model)[2]*x
}
dat[sex == "f", bpred.1 := body.pred(m.female,tailL)]
head(dat)
## site pop sex age headL skullW totalL tailL bpred.1
## 1: 1 Vic m 8 94.1 60.4 89.0 36.0 NA
## 2: 1 Vic f 6 92.5 57.6 91.5 36.5 87.08912
## 3: 1 Vic f 6 94.0 60.0 95.5 39.0 90.47064
## 4: 1 Vic f 6 93.2 57.1 92.0 38.0 89.11803
## 5: 1 Vic f 2 91.5 56.3 85.5 36.0 86.41282
## 6: 1 Vic f 1 93.1 54.8 90.5 35.5 85.73652
The easy way to predict body length it to use the predict function. We can also calculate the difference between the actual and predicted values.
dat[sex == "f", bpred.2 := predict(m.female)]
head(dat)
## site pop sex age headL skullW totalL tailL bpred.1 bpred.2
## 1: 1 Vic m 8 94.1 60.4 89.0 36.0 NA NA
## 2: 1 Vic f 6 92.5 57.6 91.5 36.5 87.08912 87.08912
## 3: 1 Vic f 6 94.0 60.0 95.5 39.0 90.47064 90.47064
## 4: 1 Vic f 6 93.2 57.1 92.0 38.0 89.11803 89.11803
## 5: 1 Vic f 2 91.5 56.3 85.5 36.0 86.41282 86.41282
## 6: 1 Vic f 1 93.1 54.8 90.5 35.5 85.73652 85.73652
dat[, diff := totalL-bpred.2]
head(dat)
## site pop sex age headL skullW totalL tailL bpred.1 bpred.2 diff
## 1: 1 Vic m 8 94.1 60.4 89.0 36.0 NA NA NA
## 2: 1 Vic f 6 92.5 57.6 91.5 36.5 87.08912 87.08912 4.410878
## 3: 1 Vic f 6 94.0 60.0 95.5 39.0 90.47064 90.47064 5.029363
## 4: 1 Vic f 6 93.2 57.1 92.0 38.0 89.11803 89.11803 2.881969
## 5: 1 Vic f 2 91.5 56.3 85.5 36.0 86.41282 86.41282 -0.912819
## 6: 1 Vic f 1 93.1 54.8 90.5 35.5 85.73652 85.73652 4.763484
I wonder if adding head length will improve the model?
m.female1 <- dat[sex=="f", lm(totalL ~ tailL + headL)]
summary(m.female1)
##
## Call:
## lm(formula = totalL ~ tailL + headL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.644 -1.264 0.010 1.635 4.067
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -40.2713 12.8030 -3.145 0.00313 **
## tailL 0.8271 0.2036 4.063 0.00022 ***
## headL 1.0580 0.1447 7.309 6.99e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.26 on 40 degrees of freedom
## Multiple R-squared: 0.7219, Adjusted R-squared: 0.708
## F-statistic: 51.93 on 2 and 40 DF, p-value: 7.629e-12
Yes! The adjusted R-squared jumps to 0.708! Makes one wonder how easy it is to know where the head ends and the body begins…
I wonder if controlling for age would improve the model?
m.female2 <- dat[sex=="f", lm(totalL ~ tailL + headL + age)]
summary(m.female2)
##
## Call:
## lm(formula = totalL ~ tailL + headL + age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.836 -1.436 -0.027 1.659 3.776
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -35.8196 13.2424 -2.705 0.010076 *
## tailL 0.7649 0.2087 3.665 0.000735 ***
## headL 1.0246 0.1465 6.995 2.17e-08 ***
## age 0.2328 0.1913 1.217 0.231046
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.246 on 39 degrees of freedom
## Multiple R-squared: 0.7321, Adjusted R-squared: 0.7115
## F-statistic: 35.53 on 3 and 39 DF, p-value: 3.067e-11
Nope, age is not statistically significant thus even though its coefficient is positive as expected we cannot prove statistically that it has impact of body length if we control for both head and tail length. Note that the adjusted R-squared did improve a little bit by including the age of the possums. Also note tht age would become statistically significant if it is the only varialbe included in the model.
m.female3 <- dat[sex=="f", lm(totalL ~ age)]
summary(m.female3)
##
## Call:
## lm(formula = totalL ~ age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.4090 -2.4178 -0.2657 2.1539 9.4127
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.5699 1.3653 61.944 < 2e-16 ***
## age 0.8392 0.3091 2.715 0.00965 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.897 on 41 degrees of freedom
## Multiple R-squared: 0.1524, Adjusted R-squared: 0.1317
## F-statistic: 7.371 on 1 and 41 DF, p-value: 0.009654
Note that you can use the stargazer package to summarise your results.
require("stargazer")
## Loading required package: stargazer
##
## Please cite as:
## Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
stargazer(m.female, m.female1,m.female2,m.female3, type = "html",
title = "Predicting possum length",
covariate.labels = c("Intercept",
"Tail length (cm)",
"Head length (cm)",
"Age (years)"),
dep.var.caption = "Possum length (cm)",
column.labels = c("Model 1", "Model 2","Model 3","Model 4"),
dep.var.labels = "Relative yield",
header = FALSE)
##
## <table style="text-align:center"><caption><strong>Predicting possum length</strong></caption>
## <tr><td colspan="5" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="4">Possum length (cm)</td></tr>
## <tr><td></td><td colspan="4" style="border-bottom: 1px solid black"></td></tr>
## <tr><td style="text-align:left"></td><td colspan="4">Relative yield</td></tr>
## <tr><td style="text-align:left"></td><td>Model 1</td><td>Model 2</td><td>Model 3</td><td>Model 4</td></tr>
## <tr><td style="text-align:left"></td><td>(1)</td><td>(2)</td><td>(3)</td><td>(4)</td></tr>
## <tr><td colspan="5" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Intercept</td><td>1.353<sup>***</sup></td><td>0.827<sup>***</sup></td><td>0.765<sup>***</sup></td><td></td></tr>
## <tr><td style="text-align:left"></td><td>(0.287)</td><td>(0.204)</td><td>(0.209)</td><td></td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">Tail length (cm)</td><td></td><td>1.058<sup>***</sup></td><td>1.025<sup>***</sup></td><td></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.145)</td><td>(0.146)</td><td></td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">Head length (cm)</td><td></td><td></td><td>0.233</td><td>0.839<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td>(0.191)</td><td>(0.309)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">Age (years)</td><td>37.719<sup>***</sup></td><td>-40.271<sup>***</sup></td><td>-35.820<sup>**</sup></td><td>84.570<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(10.680)</td><td>(12.803)</td><td>(13.242)</td><td>(1.365)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td></td></tr>
## <tr><td colspan="5" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>43</td><td>43</td><td>43</td><td>43</td></tr>
## <tr><td style="text-align:left">R<sup>2</sup></td><td>0.351</td><td>0.722</td><td>0.732</td><td>0.152</td></tr>
## <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.335</td><td>0.708</td><td>0.712</td><td>0.132</td></tr>
## <tr><td style="text-align:left">Residual Std. Error</td><td>3.411 (df = 41)</td><td>2.260 (df = 40)</td><td>2.246 (df = 39)</td><td>3.897 (df = 41)</td></tr>
## <tr><td style="text-align:left">F Statistic</td><td>22.135<sup>***</sup> (df = 1; 41)</td><td>51.929<sup>***</sup> (df = 2; 40)</td><td>35.529<sup>***</sup> (df = 3; 39)</td><td>7.371<sup>***</sup> (df = 1; 41)</td></tr>
## <tr><td colspan="5" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="4" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
## </table>
Before next weeks class, answer the following questions using a new script and email the script to me.
construct a model wherein you estimate a model for length of male possums using their tail and head length
Is tail length a better or worse indicator of possom body lenght in male compared to female possums?
Predict the male possum body length using a function defined by you (thus you may not use the predict function).
calculate the difference between the predicted and actual body lengths and plot these on an appropriate scatter plot.