The variable ABUND is the density of birds in 56 forest patches. The explanatory variables are size of the forest patches (AREA), distance to the nearest forest patch (DIST), distance to the nearest larger forest patch (LDIST), year of isolation of the patch (YR.ISOL), agricultural grazing intensity at each patch (GRAZE) and altitude (ALT). The underlying aim of the research is to find a relationship between bird densities and the explanatory variables.
Inspect the file. What do we have?
names(Birds)
## [1] "Site" "ABUND" "AREA" "DIST" "LDIST" "YR.ISOL" "GRAZE"
## [8] "ALT"
str(Birds)
## 'data.frame': 56 obs. of 8 variables:
## $ Site : int 1 2 3 4 5 6 7 8 9 10 ...
## $ ABUND : num 5.3 2 1.5 17.1 13.8 14.1 3.8 2.2 3.3 3 ...
## $ AREA : num 0.1 0.5 0.5 1 1 1 1 1 1 1 ...
## $ DIST : int 39 234 104 66 246 234 467 284 156 311 ...
## $ LDIST : int 39 234 311 66 246 285 467 1829 156 571 ...
## $ YR.ISOL: int 1968 1920 1900 1966 1918 1965 1955 1920 1965 1900 ...
## $ GRAZE : int 2 5 5 3 5 3 5 5 4 5 ...
## $ ALT : int 160 60 140 160 140 130 90 60 130 130 ...
#First some elementary R commands.
#1 How do you acces variables in an object like Birds?
Birds #All data
head(Birds) #First 6 rows
head(Birds, 10) #First 10 rows
Birds$ABUND #ABUND variable
Birds[,1] #First column
Birds[,2] #Second coloumn
Birds[,"ABUND"] #ABUND variable
Birds[1,"ABUND"] #First row of ABUND variable
Birds[1:10,"ABUND"] #First 10 rows of ABUND
c("ABUND", "AREA") #Two characters concatenated
Birds[, c("ABUND", "AREA")] #ABUND and AREA variables
MyVar <- c("ABUND", "AREA") #Same as last two steps
Birds[, MyVar]
qplot(x=as.factor(1), y=ABUND, data=Birds, geom=c("boxplot", "jitter"))
#A Outliers in X
c("AREA", "DIST", "LDIST", "YR.ISOL","ALT", "GRAZE" ) %>%
lapply(function(colName){
qplot(factor(1), Birds[,colName], geom="jitter")+xlab(colName)
}) %>%
do.call(grid.arrange, .)
#Apply transformations
Birds %<>% mutate(LOGAREA = log10(AREA))
Birds %<>% mutate(LOGDIST = log10(DIST))
Birds %<>% mutate(LOGLDIST = log10(LDIST))
qplot(ABUND, LOGAREA, data=Birds) + facet_wrap(~GRAZE) +
stat_smooth(method = "lm", formula = y ~ x, na.rm=T)
sum(Birds$ABUND == 0)
## [1] 0
100 * sum(Birds$ABUND == 0) / nrow(Birds)
## [1] 0
M1 = lm(ABUND ~ LOGAREA, data = Birds)
ABUND_i = alpha + beta * LOGAREA_i + eps_i
eps_i ~ N(0, sigma^2)
where i = 1..56
ABUND_i is the abundance at site i
E(ABUND_i) = mu_i = 10.40 + 9.77 * LOGAREA_i
eps ~ N(0, 7.28^2)
Summary results from model1
ABUND = alpha + beta * LOGAREA + eps
Dependent variable: | |
ABUND | |
LOGAREA | 9.778*** |
(1.209) | |
Constant | 10.401*** |
(1.489) | |
Observations | 56 |
R2 | 0.548 |
Adjusted R2 | 0.539 |
Residual Std. Error | 7.286 (df = 54) |
F Statistic | 65.377*** (df = 1; 54) |
Note: | p<0.1; p<0.05; p<0.01 |
H0: beta = 0
H1: beta <> 0
F_1,54 = 65.38 (p<0.001)
Or:
t_n-1 t_55 = 8.08 (p < 0.001)
Text in paper: A t-value indicated a significant effect (t = 8.08; df = 55, p < 0.001)
#ziggy: best to calculate by hand (lme or lmer sometimes include or doesnt include)
E1 <- resid(M1)
#Better: corrected by leverage
E1 <- rstandard(M1)
F1 <- fitted(M1)
plot(x = F1,
y = E1,
xlab = "Fitted values",
ylab = "Residuals",
main = "Homogeneity?")
abline(h = 0, v = 0, lty = 2)
Dependence due to model misfit. Check by plotting residuals versus covariates
But plot residuals also versus covariates NOT in the model!!!!!!!
#if you dont see a linear pattern, cannot do linear
plot(x = Birds$LOGAREA,
y = E1)
abline(0,0,lty=2)
boxplot(E1 ~ factor(Birds$GRAZE)) #here there is a grazing effect on residuals - means grazing should have been fitted in model
abline(h = 0, lty = 2)
Most are OK except YR.ISOL - there is a pattern there, so should be included in model. In fact, grazing and years since isolation were colinear (see Ex1)
#ggplot version
qplot(
x = YR.ISOL,
y = E1,
color = factor(GRAZE),
data=Birds
)+scale_color_brewer("Graze", type="qual")+xlab("Year")
and also for all the other covariates!!!
hist(E1, main = "Normality", breaks=10)
#Or qq-plot aka normality check
qqnorm(E1)
#qqline(E1)
range(Birds$LOGAREA)
## [1] -1.000000 3.248219
#Create a data frame that contains x numbers of LOGAREA values
MyData <- data.frame(LOGAREA = seq(from= -1,
to = 3.25,
length = 25))
P1 <- predict(M1, newdata = MyData)
plot(x = Birds$LOGAREA,
y = Birds$ABUND,
xlab = "Log transformed area",
ylab = "Bird abundance")
lines(x = MyData$LOGAREA,
y = P1,
lwd = 3,
lty = 1,
col = 1)
plotBare= ggplot(Birds, aes(x=ABUND, y=LOGAREA)) +
geom_point() +
geom_smooth(method="lm", formula = y ~ I(log(x)), na.rm=T) +
labs(
x = "Log transformed area",
y = "Bird abundance"
)
plotBare + coord_trans(x = "log10")
#Base plot Version
PP1 <- predict(M1, newdata = MyData, se = TRUE, interval = "confidence")
SeConf.Up <- PP1$fit[,3]
SeConf.Low <- PP1$fit[,2]
plot(x = Birds$LOGAREA,
y = Birds$ABUND)
lines(x = MyData$LOGAREA,
y = PP1$fit[,1],
lwd = 3,
lty = 1,
col = 1)
#Useful for later...when we have multiple lines
legend("topleft",
legend=c("Line 1"),
col = c(1),
lty = c(1),
lwd = c(1))
Now, model ABUND as a function of a categorical covariate (factor Graze)
M2 <- lm(ABUND ~ factor(GRAZE), data=Birds)
Dependent variable: | |
ABUND | |
factor(GRAZE)2 | -6.673* |
(3.379) | |
factor(GRAZE)3 | -7.336** |
(2.850) | |
factor(GRAZE)4 | -8.052** |
(3.526) | |
factor(GRAZE)5 | -22.331*** |
(2.950) | |
Constant | 28.623*** |
(2.086) | |
Observations | 56 |
R2 | 0.545 |
Adjusted R2 | 0.509 |
Residual Std. Error | 7.520 (df = 51) |
F Statistic | 15.267*** (df = 4; 51) |
Note: | p<0.1; p<0.05; p<0.01 |
Run annova on the model to check how well this model explains the Y in this case ABUND.
Statistic | N | Mean | St. Dev. | Min | Max |
Df | 2 | 27.500 | 33.234 | 4 | 51 |
Sum Sq | 2 | 3,168.964 | 402.655 | 2,884.244 | 3,453.685 |
Mean Sq | 2 | 459.987 | 570.541 | 56.554 | 863.421 |
F value | 1 | 15.267 | 15.267 | 15.267 | |
Pr(> F) | 1 | 0.00000 | 0.00000 | 0.00000 | |
Apply model as a function of all covariates But no interactions
M3 <- lm(ABUND ~ LOGAREA + LOGDIST + LOGLDIST +
YR.ISOL + ALT + factor(GRAZE),
data = Birds)
Dependent variable: | |
ABUND | |
LOGAREA | 6.833*** |
(1.503) | |
LOGDIST | 0.333 |
(2.748) | |
LOGLDIST | 0.798 |
(2.138) | |
YR.ISOL | -0.013 |
(0.058) | |
ALT | 0.011 |
(0.024) | |
factor(GRAZE)2 | 0.529 |
(3.252) | |
factor(GRAZE)3 | 0.066 |
(2.959) | |
factor(GRAZE)4 | -1.249 |
(3.198) | |
factor(GRAZE)5 | -12.473** |
(4.778) | |
Constant | 36.680 |
(115.163) | |
Observations | 56 |
R2 | 0.729 |
Adjusted R2 | 0.677 |
Residual Std. Error | 6.105 (df = 46) |
F Statistic | 13.784*** (df = 9; 46) |
Note: | p<0.1; p<0.05; p<0.01 |
We individually drop each of the variables
Statistic | N | Mean | St. Dev. | Min | Max |
Df | 6 | 1.500 | 1.225 | 1 | 4 |
Sum of Sq | 6 | 199.753 | 323.907 | 0.547 | 770.014 |
RSS | 7 | 1,885.647 | 305.172 | 1,714.430 | 2,484.443 |
AIC | 7 | 213.798 | 7.634 | 209.621 | 230.377 |
F value | 6 | 3.973 | 8.245 | 0.015 | 20.660 |
Pr(> F) | 6 | 0.523 | 0.400 | 0.00004 | 0.904 |
Now, as a function of the interaction between a continuous and a categorical covariate
#this is = to LOGAREA + factor(GRAZE) + LOGAREA : factor(GRAZE)
M4 <- lm(ABUND ~ LOGAREA * factor(GRAZE),
data = Birds)
Dependent variable: | |
ABUND | |
LOGAREA | 4.144** |
(2.057) | |
factor(GRAZE)2 | -6.165 |
(4.842) | |
factor(GRAZE)3 | -7.215 |
(4.820) | |
factor(GRAZE)4 | -17.910** |
(6.701) | |
factor(GRAZE)5 | -17.043*** |
(4.406) | |
LOGAREA:factor(GRAZE)2 | 4.368 |
(3.108) | |
LOGAREA:factor(GRAZE)3 | 4.989 |
(3.531) | |
LOGAREA:factor(GRAZE)4 | 15.235** |
(5.925) | |
LOGAREA:factor(GRAZE)5 | 1.996 |
(3.650) | |
Constant | 21.243*** |
(3.987) | |
Observations | 56 |
R2 | 0.767 |
Adjusted R2 | 0.721 |
Residual Std. Error | 5.666 (df = 46) |
F Statistic | 16.827*** (df = 9; 46) |
Note: | p<0.1; p<0.05; p<0.01 |
Dropping one var each in M4:
Statistic | N | Mean | St. Dev. | Min | Max |
Df | 1 | 4.000 | 4 | 4 | |
Sum of Sq | 1 | 253.771 | 253.771 | 253.771 | |
RSS | 2 | 1,603.513 | 179.443 | 1,476.628 | 1,730.399 |
AIC | 2 | 203.682 | 0.623 | 203.241 | 204.122 |
F value | 1 | 1.976 | 1.976 | 1.976 | |
Pr(> F) | 1 | 0.114 | 0.114 | 0.114 | |
Dependent variable: | |
ABUND | |
LOGAREA | 7.247*** |
(1.255) | |
factor(GRAZE)2 | 0.383 |
(2.912) | |
factor(GRAZE)3 | -0.189 |
(2.550) | |
factor(GRAZE)4 | -1.592 |
(2.976) | |
factor(GRAZE)5 | -11.894*** |
(2.931) | |
Constant | 15.716*** |
(2.767) | |
Observations | 56 |
R2 | 0.727 |
Adjusted R2 | 0.700 |
Residual Std. Error | 5.883 (df = 50) |
F Statistic | 26.627*** (df = 5; 50) |
Note: | p<0.1; p<0.05; p<0.01 |
Dropping one var each in M5
Statistic | N | Mean | St. Dev. | Min | Max |
Df | 2 | 2.500 | 2.121 | 1 | 4 |
Sum of Sq | 2 | 1,145.195 | 12.234 | 1,136.545 | 1,153.845 |
RSS | 3 | 2,493.862 | 661.235 | 1,730.399 | 2,884.244 |
AIC | 3 | 219.751 | 13.900 | 204.122 | 230.733 |
F value | 2 | 20.775 | 17.770 | 8.210 | 33.340 |
Pr(> F) | 2 | 0.00002 | 0.00003 | 0.00000 | 0.00004 |