Code
remove(list=ls()) #clear all from memory
require("data.table")
require("ggplot2")
require("stargazer")
dat <- fread("Data/possum.csv")Go to the Teams site, download the possum data and load it into R.
remove(list=ls()) #clear all from memory
require("data.table")
require("ggplot2")
require("stargazer")
dat <- fread("Data/possum.csv")Using ggplot, plot the relationship between tail length and total length.
From the figure, there is a positive correlation between body and tail length, but is it different between male and female animals? Are you basing your answer on the fact that the slopes of the respective lines are different? That’s a start, but let’s quantify it.
To test if there is a statistical difference, you will need to calculate the Pearson correlation coefficient for each group. The Pearson correlation coefficient is defined as follows:
\[\ P_{XY} = \frac{cov(X,Y)}{\sigma_X\sigma_Y} \]
where variance is defined as \[\ Var(X) = E[(X-\mu)^2] = E[(X-\overline{X})^2]\]
and standard deviation is defined as \[\ \sigma_X = \sqrt{Var(X)}\]
where covariance is defined as \[\ Cov(X,Y) = E[(X-\mu_X)(Y-\mu_Y)] = E[(X-\overline{X})(Y-\overline{Y})]\]
mu(\(\mu\)) is the average, mean or expected value defined as \[\overline{X} = \mu_X = \frac{1}{n} \sum_{i=1}^{n}X_i = \sum_{i=1}^{n}X_ip_i\]
Now we can calculate the Pearson correlation coefficient between body and tail length of male and female possums using the cor function. Are they the same? How should we interpret the values?
dat[, list(pearson = cor(x = tailL, y = totalL, method = c("pearson"))), by = .(sex)] sex pearson
1: m 0.5524124
2: f 0.5921158
To predict the tail length of male and female possums, we have to construct a regression model.
m.female <- dat[sex=="f", lm(totalL ~ tailL)]
summary(m.female)
Call:
lm(formula = totalL ~ tailL)
Residuals:
Min 1Q Median 3Q Max
-8.7076 -2.0154 -0.1758 2.7346 6.2635
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.7190 10.6800 3.532 0.00104 **
tailL 1.3526 0.2875 4.705 2.88e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.411 on 41 degrees of freedom
Multiple R-squared: 0.3506, Adjusted R-squared: 0.3348
F-statistic: 22.14 on 1 and 41 DF, p-value: 2.883e-05
How good is the fit of this model? The adjusted R-squared is 0.33, meaning that tail length explains 33% of the variation of total body length.
How do you interpret the coefficients of the female model’s intercept and tail length values? The average body length of the possums is 37.72cm with the predicted body length, and total body length increases by 1.35 cm per 1 additional cm of tail length measured.
In other words total body length = 37.72 + 1.35(tail length).
I wonder if adding head length will improve the model.
m.female1 <- dat[sex=="f", lm(totalL ~ tailL + headL)]
summary(m.female1)
Call:
lm(formula = totalL ~ tailL + headL)
Residuals:
Min 1Q Median 3Q Max
-5.644 -1.264 0.010 1.635 4.067
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40.2713 12.8030 -3.145 0.00313 **
tailL 0.8271 0.2036 4.063 0.00022 ***
headL 1.0580 0.1447 7.309 6.99e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.26 on 40 degrees of freedom
Multiple R-squared: 0.7219, Adjusted R-squared: 0.708
F-statistic: 51.93 on 2 and 40 DF, p-value: 7.629e-12
Yes! The adjusted R-squared jumps to 0.708! It makes one wonder how easy it is to know where the head ends and the body begins…
I wonder if controlling for skull width would improve the model.
m.female2 <- dat[sex=="f", lm(totalL ~ tailL + skullW)]
summary(m.female2)
Call:
lm(formula = totalL ~ tailL + skullW)
Residuals:
Min 1Q Median 3Q Max
-6.6925 -2.1581 -0.6501 2.5832 6.3612
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.0651 12.4920 1.286 0.205827
tailL 1.0594 0.2852 3.714 0.000622 ***
skullW 0.5749 0.2033 2.828 0.007279 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.153 on 40 degrees of freedom
Multiple R-squared: 0.4588, Adjusted R-squared: 0.4318
F-statistic: 16.96 on 2 and 40 DF, p-value: 4.643e-06
OK, so skull width is statistically significant thus, we can conclude that the relationship between total length and skull width is caused by something other than chance. However, given the smaller R-squared value, skull width is a poorer predictor of body length than head length. Now lets combine all the variables.
m.female3 <- dat[sex=="f", lm(totalL ~ tailL + headL + skullW)]
summary(m.female3)
Call:
lm(formula = totalL ~ tailL + headL + skullW)
Residuals:
Min 1Q Median 3Q Max
-5.735 -1.253 0.030 1.688 3.953
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40.33275 12.94871 -3.115 0.003444 **
tailL 0.81203 0.21070 3.854 0.000423 ***
headL 1.02962 0.16898 6.093 3.85e-07 ***
skullW 0.05709 0.17011 0.336 0.738947
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.285 on 39 degrees of freedom
Multiple R-squared: 0.7228, Adjusted R-squared: 0.7014
F-statistic: 33.89 on 3 and 39 DF, p-value: 5.959e-11
O, now skull width is not statistically significant and the R-squared is not much better than the model with only head length. When you have coefficients that “flip” like this, a possible cause is multicollinearity, a statistical phenomenom where independent variables are correlated to each other. Read more about multicollinearity here.
Do you think there is a relationship between head length and width? Let’s test for it; you can read more here.
library(GGally)
ggpairs(dat[, .(tailL,headL,skullW)])Good, it is clear that head length and skull width is strongly correlated (0.71) - hence we should not use it for our prediction. The best model thus is m.female1.
To predict body length, you have two options: create a function to calculate it or use the predict predict function. Let’s start by defining a function which you could also use in the case of male possums. To do so we will have to extract the coefficients directly from the regression model.
Pull the coefficients out, we can use the coefficient function.
coef(m.female1)(Intercept) tailL headL
-40.2712950 0.8270772 1.0579608
Note that it prints a list of all coefficients; the get a specific one we have to state its position. For example
coef(m.female1)[1](Intercept)
-40.2713
Now we can define a function to predict body length and apply it to the female subset of the data.
body.pred <- function(model,x1,x2){
coef(model)[1] + coef(model)[2]*x1 + coef(model)[3]*x2
}
dat[sex == "f", bpred.1 := body.pred(m.female1,tailL,headL)]
head(dat) site pop sex age headL skullW totalL tailL bpred.1
1: 1 Vic m 8 94.1 60.4 89.0 36.0 NA
2: 1 Vic f 6 92.5 57.6 91.5 36.5 87.77840
3: 1 Vic f 6 94.0 60.0 95.5 39.0 91.43303
4: 1 Vic f 6 93.2 57.1 92.0 38.0 89.75959
5: 1 Vic f 2 91.5 56.3 85.5 36.0 86.30690
6: 1 Vic f 1 93.1 54.8 90.5 35.5 87.58610
The easy way to predict body length it to use the predict function. We can also calculate the difference between the actual and predicted values.
dat[sex == "f", bpred.2 := predict(m.female1)]
head(dat) site pop sex age headL skullW totalL tailL bpred.1 bpred.2
1: 1 Vic m 8 94.1 60.4 89.0 36.0 NA NA
2: 1 Vic f 6 92.5 57.6 91.5 36.5 87.77840 87.77840
3: 1 Vic f 6 94.0 60.0 95.5 39.0 91.43303 91.43303
4: 1 Vic f 6 93.2 57.1 92.0 38.0 89.75959 89.75959
5: 1 Vic f 2 91.5 56.3 85.5 36.0 86.30690 86.30690
6: 1 Vic f 1 93.1 54.8 90.5 35.5 87.58610 87.58610
Let’s test to see if my function and the predict function yield the same results:
dat[!is.na(bpred.1), bpred.1==bpred.2] [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Yes!
Let’s plot the actuals vs the predicted
ggplot(dat, # Draw plot using ggplot2 package
aes(x = bpred.1,
y = totalL)) +
geom_point() +
geom_smooth(method = lm,se = F,color="red") +
labs(x="Predicted",y="Observed")Mmm, is it good that the actuals and predicted values are correlated? To read more about linear regression and diagnostics, click here.
par(mfrow = c(2, 2))
plot(m.female1)What do you think?
You can use the stargazer package for summarising your results.
library("stargazer")
stargazer(m.female, m.female1,m.female2,m.female3, type = "html",
title = "Predicting body length",
covariate.labels = c("Tail length","Head length","Age","Intercept"),
dep.var.caption = "Model",
column.labels = c("Model 1", "Model 2","Model 3","Model 4"),
dep.var.labels = "Total body length",
column.sep.width = "+15pt",
digits = 2,
font.size = "small",
align = TRUE,
header = FALSE)| Model | ||||
| Total body length | ||||
| Model 1 | Model 2 | Model 3 | Model 4 | |
| (1) | (2) | (3) | (4) | |
| Tail length | 1.35*** | 0.83*** | 1.06*** | 0.81*** |
| (0.29) | (0.20) | (0.29) | (0.21) | |
| Head length | 1.06*** | 1.03*** | ||
| (0.14) | (0.17) | |||
| Age | 0.57*** | 0.06 | ||
| (0.20) | (0.17) | |||
| Intercept | 37.72*** | -40.27*** | 16.07 | -40.33*** |
| (10.68) | (12.80) | (12.49) | (12.95) | |
| Observations | 43 | 43 | 43 | 43 |
| R2 | 0.35 | 0.72 | 0.46 | 0.72 |
| Adjusted R2 | 0.33 | 0.71 | 0.43 | 0.70 |
| Residual Std. Error | 3.41 (df = 41) | 2.26 (df = 40) | 3.15 (df = 40) | 2.29 (df = 39) |
| F Statistic | 22.14*** (df = 1; 41) | 51.93*** (df = 2; 40) | 16.96*** (df = 2; 40) | 33.89*** (df = 3; 39) |
| Note: | p<0.1; p<0.05; p<0.01 | |||
Before next week’s class, please answer the following questions using a new script and email the script to me.
construct a model wherein you estimate a model for the length of male possums using their tail and head length
Is tail length a better or worse indicator of possum body length in male compared to female possums?
Predict the male possum’s body length using a function defined by you (thus, you may not use the predict function).
calculate the difference between the predicted and actual body lengths and plot these on an appropriate scatter plot.