Changes

  1. Switched to using ggplot2
  2. formatted this into a Rmarkdown file.

Underlying question and task

The variable ABUND is the density of birds in 56 forest patches. The explanatory variables are size of the forest patches (AREA), distance to the nearest forest patch (DIST), distance to the nearest larger forest patch (LDIST), year of isolation of the patch (YR.ISOL), agricultural grazing intensity at each patch (GRAZE) and altitude (ALT). The underlying aim of the research is to find a relationship between bird densities and the explanatory variables.

Data Exploration

Inspect the file. What do we have?

names(Birds)
## [1] "Site"    "ABUND"   "AREA"    "DIST"    "LDIST"   "YR.ISOL" "GRAZE"  
## [8] "ALT"
str(Birds)
## 'data.frame':    56 obs. of  8 variables:
##  $ Site   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ ABUND  : num  5.3 2 1.5 17.1 13.8 14.1 3.8 2.2 3.3 3 ...
##  $ AREA   : num  0.1 0.5 0.5 1 1 1 1 1 1 1 ...
##  $ DIST   : int  39 234 104 66 246 234 467 284 156 311 ...
##  $ LDIST  : int  39 234 311 66 246 285 467 1829 156 571 ...
##  $ YR.ISOL: int  1968 1920 1900 1966 1918 1965 1955 1920 1965 1900 ...
##  $ GRAZE  : int  2 5 5 3 5 3 5 5 4 5 ...
##  $ ALT    : int  160 60 140 160 140 130 90 60 130 130 ...

Questions to answer in the following

  1. Outliers in Y / Outliers in X
  2. Collinearity X
  3. Relationships Y vs X
  4. Spatial/temporal aspects of sampling design (not relevant here)
  5. Interactions (is the quality of the data good enough to include them?)
  6. Zero inflation Y
  7. Are categorical covariates balanced?
#First some elementary R commands.
#1 How do you acces variables in an object like Birds?
Birds            #All data
head(Birds)      #First 6 rows
head(Birds, 10)  #First 10 rows
Birds$ABUND      #ABUND variable
Birds[,1]        #First column
Birds[,2]        #Second coloumn
Birds[,"ABUND"]  #ABUND variable 
Birds[1,"ABUND"] #First row of ABUND variable
Birds[1:10,"ABUND"]  #First 10 rows of ABUND
c("ABUND", "AREA")   #Two characters concatenated
Birds[, c("ABUND", "AREA")] #ABUND and AREA variables
MyVar <- c("ABUND", "AREA")  #Same as last two steps
Birds[, MyVar]

1 Outliers in Y

qplot(x=as.factor(1), y=ABUND, data=Birds, geom=c("boxplot", "jitter"))

#A Outliers in X

c("AREA", "DIST", "LDIST", "YR.ISOL","ALT", "GRAZE" ) %>%
lapply(function(colName){
qplot(factor(1), Birds[,colName], geom="jitter")+xlab(colName)
                    }) %>%
do.call(grid.arrange, .)

#Apply transformations
Birds %<>% mutate(LOGAREA  = log10(AREA))
Birds %<>% mutate(LOGDIST  = log10(DIST))
Birds %<>% mutate(LOGLDIST = log10(LDIST))

2 Collinearity X

3. Investigating relationships bet Y against X

5. Interactions

qplot(ABUND, LOGAREA, data=Birds) + facet_wrap(~GRAZE) + 
  stat_smooth(method = "lm", formula = y ~ x, na.rm=T)

6. Zero inflation

sum(Birds$ABUND == 0)
## [1] 0
100 * sum(Birds$ABUND == 0) / nrow(Birds)
## [1] 0

7. Are categorical covariates balanced? Works on Factors only (categorical)

% latex table generated in R 3.1.1 by xtable 1.7-4 package % Mon May 4 15:29:06 2015
M1 = lm(ABUND ~ LOGAREA, data = Birds)
What is the model that we are fitting?
 ABUND_i = alpha + beta * LOGAREA_i + eps_i
 eps_i ~ N(0, sigma^2)
 where i = 1..56
 ABUND_i is the abundance at site i
What is the fitted model
  E(ABUND_i) = mu_i = 10.40 + 9.77 * LOGAREA_i 
  eps ~ N(0,  7.28^2)
Is everything significant?

Summary results from model1

ABUND = alpha + beta * LOGAREA + eps
Dependent variable:
ABUND
LOGAREA 9.778***
(1.209)
Constant 10.401***
(1.489)
Observations 56
R2 0.548
Adjusted R2 0.539
Residual Std. Error 7.286 (df = 54)
F Statistic 65.377*** (df = 1; 54)
Note: p<0.1; p<0.05; p<0.01

Interpretation of F and t values

H0: beta = 0
H1: beta <> 0

F_1,54 = 65.38 (p<0.001)
Or:
t_n-1    t_55 = 8.08 (p < 0.001)
Text in paper: A t-value indicated a significant effect (t = 8.08; df = 55, p < 0.001)

Model validation (independence is most important)

  1. Homogeneity
  2. Independence

A Homogeneity

#ziggy: best to calculate by hand (lme or lmer sometimes include or doesnt include)

E1 <- resid(M1)
#Better: corrected by leverage
E1 <- rstandard(M1) 
F1 <- fitted(M1)

plot(x = F1,
     y = E1,
     xlab = "Fitted values",
     ylab = "Residuals", 
     main = "Homogeneity?")
abline(h = 0, v = 0, lty = 2)

B Independence

Dependence due to model misfit. Check by plotting residuals versus covariates

But plot residuals also versus covariates NOT in the model!!!!!!!

#if you dont see a linear pattern, cannot do linear
plot(x = Birds$LOGAREA,
     y = E1)
abline(0,0,lty=2)

boxplot(E1 ~ factor(Birds$GRAZE)) #here there is a grazing effect on residuals - means grazing should have been fitted in model
abline(h = 0, lty = 2)

Most are OK except YR.ISOL - there is a pattern there, so should be included in model. In fact, grazing and years since isolation were colinear (see Ex1)

#ggplot version
qplot(
     x = YR.ISOL,
     y = E1, 
     color = factor(GRAZE),
     data=Birds
     )+scale_color_brewer("Graze", type="qual")+xlab("Year")

and also for all the other covariates!!!

Normality

hist(E1, main = "Normality", breaks=10)

#Or qq-plot aka normality check
qqnorm(E1)

#qqline(E1)

Plot

range(Birds$LOGAREA)
## [1] -1.000000  3.248219
#Create a data frame that contains x numbers of LOGAREA values
MyData <- data.frame(LOGAREA = seq(from= -1,
                                   to = 3.25,
                                   length = 25))

P1 <- predict(M1, newdata = MyData)
plot(x = Birds$LOGAREA, 
     y = Birds$ABUND,
     xlab = "Log transformed area",
     ylab = "Bird abundance")

lines(x = MyData$LOGAREA, 
      y = P1,
      lwd = 3,
      lty = 1,
      col = 1)

plotBare= ggplot(Birds, aes(x=ABUND, y=LOGAREA))                       +
            geom_point()                                               +
            geom_smooth(method="lm", formula = y ~ I(log(x)), na.rm=T) +
            labs(
     x = "Log transformed area",
     y = "Bird abundance"
     )
plotBare + coord_trans(x = "log10")

#Base plot Version
PP1 <- predict(M1, newdata = MyData, se = TRUE, interval = "confidence")

SeConf.Up  <- PP1$fit[,3]
SeConf.Low <- PP1$fit[,2]

plot(x = Birds$LOGAREA,
     y = Birds$ABUND)

lines(x = MyData$LOGAREA,
      y = PP1$fit[,1],
      lwd = 3,
      lty = 1,
      col = 1)


#Useful for later...when we have multiple lines
legend("topleft",
       legend=c("Line 1"),
       col = c(1),
       lty = c(1),
       lwd = c(1))

Now, model ABUND as a function of a categorical covariate (factor Graze)

Model2

M2 <- lm(ABUND ~ factor(GRAZE), data=Birds)
Dependent variable:
ABUND
factor(GRAZE)2 -6.673*
(3.379)
factor(GRAZE)3 -7.336**
(2.850)
factor(GRAZE)4 -8.052**
(3.526)
factor(GRAZE)5 -22.331***
(2.950)
Constant 28.623***
(2.086)
Observations 56
R2 0.545
Adjusted R2 0.509
Residual Std. Error 7.520 (df = 51)
F Statistic 15.267*** (df = 4; 51)
Note: p<0.1; p<0.05; p<0.01

Run annova on the model to check how well this model explains the Y in this case ABUND.

Statistic N Mean St. Dev. Min Max
Df 2 27.500 33.234 4 51
Sum Sq 2 3,168.964 402.655 2,884.244 3,453.685
Mean Sq 2 459.987 570.541 56.554 863.421
F value 1 15.267 15.267 15.267
Pr(> F) 1 0.00000 0.00000 0.00000

Apply model as a function of all covariates But no interactions

Model3

M3 <- lm(ABUND ~ LOGAREA + LOGDIST + LOGLDIST +
           YR.ISOL + ALT + factor(GRAZE),
         data = Birds)

Is everything significant?

Dependent variable:
ABUND
LOGAREA 6.833***
(1.503)
LOGDIST 0.333
(2.748)
LOGLDIST 0.798
(2.138)
YR.ISOL -0.013
(0.058)
ALT 0.011
(0.024)
factor(GRAZE)2 0.529
(3.252)
factor(GRAZE)3 0.066
(2.959)
factor(GRAZE)4 -1.249
(3.198)
factor(GRAZE)5 -12.473**
(4.778)
Constant 36.680
(115.163)
Observations 56
R2 0.729
Adjusted R2 0.677
Residual Std. Error 6.105 (df = 46)
F Statistic 13.784*** (df = 9; 46)
Note: p<0.1; p<0.05; p<0.01

We individually drop each of the variables

Statistic N Mean St. Dev. Min Max
Df 6 1.500 1.225 1 4
Sum of Sq 6 199.753 323.907 0.547 770.014
RSS 7 1,885.647 305.172 1,714.430 2,484.443
AIC 7 213.798 7.634 209.621 230.377
F value 6 3.973 8.245 0.015 20.660
Pr(> F) 6 0.523 0.400 0.00004 0.904

Model4

Now, as a function of the interaction between a continuous and a categorical covariate

#this is = to LOGAREA + factor(GRAZE) + LOGAREA : factor(GRAZE)
M4 <- lm(ABUND ~ LOGAREA * factor(GRAZE), 
         data = Birds)
Dependent variable:
ABUND
LOGAREA 4.144**
(2.057)
factor(GRAZE)2 -6.165
(4.842)
factor(GRAZE)3 -7.215
(4.820)
factor(GRAZE)4 -17.910**
(6.701)
factor(GRAZE)5 -17.043***
(4.406)
LOGAREA:factor(GRAZE)2 4.368
(3.108)
LOGAREA:factor(GRAZE)3 4.989
(3.531)
LOGAREA:factor(GRAZE)4 15.235**
(5.925)
LOGAREA:factor(GRAZE)5 1.996
(3.650)
Constant 21.243***
(3.987)
Observations 56
R2 0.767
Adjusted R2 0.721
Residual Std. Error 5.666 (df = 46)
F Statistic 16.827*** (df = 9; 46)
Note: p<0.1; p<0.05; p<0.01

Dropping one var each in M4:

Statistic N Mean St. Dev. Min Max
Df 1 4.000 4 4
Sum of Sq 1 253.771 253.771 253.771
RSS 2 1,603.513 179.443 1,476.628 1,730.399
AIC 2 203.682 0.623 203.241 204.122
F value 1 1.976 1.976 1.976
Pr(> F) 1 0.114 0.114 0.114

Model5

Dependent variable:
ABUND
LOGAREA 7.247***
(1.255)
factor(GRAZE)2 0.383
(2.912)
factor(GRAZE)3 -0.189
(2.550)
factor(GRAZE)4 -1.592
(2.976)
factor(GRAZE)5 -11.894***
(2.931)
Constant 15.716***
(2.767)
Observations 56
R2 0.727
Adjusted R2 0.700
Residual Std. Error 5.883 (df = 50)
F Statistic 26.627*** (df = 5; 50)
Note: p<0.1; p<0.05; p<0.01

Dropping one var each in M5

Statistic N Mean St. Dev. Min Max
Df 2 2.500 2.121 1 4
Sum of Sq 2 1,145.195 12.234 1,136.545 1,153.845
RSS 3 2,493.862 661.235 1,730.399 2,884.244
AIC 3 219.751 13.900 204.122 230.733
F value 2 20.775 17.770 8.210 33.340
Pr(> F) 2 0.00002 0.00003 0.00000 0.00004