Changes

Switched to using ggplot2
formatted this into a Rmarkdown file.

Underlying question and task

The variable ABUND is the density of birds in 56 forest patches. The explanatory variables are size of the forest patches (AREA), distance to the nearest forest patch (DIST), distance to the nearest larger forest patch (LDIST), year of isolation of the patch (YR.ISOL), agricultural grazing intensity at each patch (GRAZE) and altitude (ALT). The underlying aim of the research is to find a relationship between bird densities and the explanatory variables.

Data Exploration

Inspect the file. What do we have?

names(Birds)

## [1] "Site"    "ABUND"   "AREA"    "DIST"    "LDIST"   "YR.ISOL" "GRAZE"  
## [8] "ALT"

str(Birds)

## 'data.frame':    56 obs. of  8 variables:
##  $ Site   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ ABUND  : num  5.3 2 1.5 17.1 13.8 14.1 3.8 2.2 3.3 3 ...
##  $ AREA   : num  0.1 0.5 0.5 1 1 1 1 1 1 1 ...
##  $ DIST   : int  39 234 104 66 246 234 467 284 156 311 ...
##  $ LDIST  : int  39 234 311 66 246 285 467 1829 156 571 ...
##  $ YR.ISOL: int  1968 1920 1900 1966 1918 1965 1955 1920 1965 1900 ...
##  $ GRAZE  : int  2 5 5 3 5 3 5 5 4 5 ...
##  $ ALT    : int  160 60 140 160 140 130 90 60 130 130 ...

Questions to answer in the following

Outliers in Y / Outliers in X
Collinearity X
Relationships Y vs X
Spatial/temporal aspects of sampling design (not relevant here)
Interactions (is the quality of the data good enough to include them?)
Zero inflation Y
Are categorical covariates balanced?

#First some elementary R commands.
#1 How do you acces variables in an object like Birds?
Birds            #All data
head(Birds)      #First 6 rows
head(Birds, 10)  #First 10 rows
Birds$ABUND      #ABUND variable
Birds[,1]        #First column
Birds[,2]        #Second coloumn
Birds[,"ABUND"]  #ABUND variable 
Birds[1,"ABUND"] #First row of ABUND variable
Birds[1:10,"ABUND"]  #First 10 rows of ABUND
c("ABUND", "AREA")   #Two characters concatenated
Birds[, c("ABUND", "AREA")] #ABUND and AREA variables
MyVar <- c("ABUND", "AREA")  #Same as last two steps
Birds[, MyVar]

1 Outliers in Y

qplot(x=as.factor(1), y=ABUND, data=Birds, geom=c("boxplot", "jitter"))

#A Outliers in X

c("AREA", "DIST", "LDIST", "YR.ISOL","ALT", "GRAZE" ) %>%
lapply(function(colName){
qplot(factor(1), Birds[,colName], geom="jitter")+xlab(colName)
                    }) %>%
do.call(grid.arrange, .)

#Apply transformations
Birds %<>% mutate(LOGAREA  = log10(AREA))
Birds %<>% mutate(LOGDIST  = log10(DIST))
Birds %<>% mutate(LOGLDIST = log10(LDIST))

2 Collinearity X

3. Investigating relationships bet Y against X

5. Interactions

qplot(ABUND, LOGAREA, data=Birds) + facet_wrap(~GRAZE) + 
  stat_smooth(method = "lm", formula = y ~ x, na.rm=T)

6. Zero inflation

sum(Birds$ABUND == 0)

## [1] 0

100 * sum(Birds$ABUND == 0) / nrow(Birds)

## [1] 0

7. Are categorical covariates balanced? Works on Factors only (categorical)

% latex table generated in R 3.1.1 by xtable 1.7-4 package % Mon May 4 15:29:06 2015

M1 = lm(ABUND ~ LOGAREA, data = Birds)

What is the model that we are fitting?

 ABUND_i = alpha + beta * LOGAREA_i + eps_i
 eps_i ~ N(0, sigma^2)
 where i = 1..56
 ABUND_i is the abundance at site i

What is the fitted model

  E(ABUND_i) = mu_i = 10.40 + 9.77 * LOGAREA_i 
  eps ~ N(0,  7.28^2)

Is everything significant?

Summary results from model1

ABUND = alpha + beta * LOGAREA + eps


	Dependent variable:

	ABUND

LOGAREA	9.778^***
	(1.209)

Constant	10.401^***
	(1.489)


Observations	56
R²	0.548
Adjusted R²	0.539
Residual Std. Error	7.286 (df = 54)
F Statistic	65.377^*** (df = 1; 54)

Note:	p<0.1; p<0.05; p<0.01

Interpretation of F and t values

H0: beta = 0
H1: beta <> 0

F_1,54 = 65.38 (p<0.001)
Or:
t_n-1    t_55 = 8.08 (p < 0.001)
Text in paper: A t-value indicated a significant effect (t = 8.08; df = 55, p < 0.001)

Model validation (independence is most important)

Homogeneity
Independence

A Homogeneity

#ziggy: best to calculate by hand (lme or lmer sometimes include or doesnt include)

E1 <- resid(M1)
#Better: corrected by leverage
E1 <- rstandard(M1) 
F1 <- fitted(M1)

plot(x = F1,
     y = E1,
     xlab = "Fitted values",
     ylab = "Residuals", 
     main = "Homogeneity?")
abline(h = 0, v = 0, lty = 2)

B Independence

Dependence due to model misfit. Check by plotting residuals versus covariates

But plot residuals also versus covariates NOT in the model!!!!!!!

#if you dont see a linear pattern, cannot do linear
plot(x = Birds$LOGAREA,
     y = E1)
abline(0,0,lty=2)

boxplot(E1 ~ factor(Birds$GRAZE)) #here there is a grazing effect on residuals - means grazing should have been fitted in model
abline(h = 0, lty = 2)

Most are OK except YR.ISOL - there is a pattern there, so should be included in model. In fact, grazing and years since isolation were colinear (see Ex1)

#ggplot version
qplot(
     x = YR.ISOL,
     y = E1, 
     color = factor(GRAZE),
     data=Birds
     )+scale_color_brewer("Graze", type="qual")+xlab("Year")

and also for all the other covariates!!!

Normality

hist(E1, main = "Normality", breaks=10)

#Or qq-plot aka normality check
qqnorm(E1)

#qqline(E1)

Plot

range(Birds$LOGAREA)

## [1] -1.000000  3.248219

#Create a data frame that contains x numbers of LOGAREA values
MyData <- data.frame(LOGAREA = seq(from= -1,
                                   to = 3.25,
                                   length = 25))

P1 <- predict(M1, newdata = MyData)
plot(x = Birds$LOGAREA, 
     y = Birds$ABUND,
     xlab = "Log transformed area",
     ylab = "Bird abundance")

lines(x = MyData$LOGAREA, 
      y = P1,
      lwd = 3,
      lty = 1,
      col = 1)

plotBare= ggplot(Birds, aes(x=ABUND, y=LOGAREA))                       +
            geom_point()                                               +
            geom_smooth(method="lm", formula = y ~ I(log(x)), na.rm=T) +
            labs(
     x = "Log transformed area",
     y = "Bird abundance"
     )
plotBare + coord_trans(x = "log10")

#Base plot Version
PP1 <- predict(M1, newdata = MyData, se = TRUE, interval = "confidence")

SeConf.Up  <- PP1$fit[,3]
SeConf.Low <- PP1$fit[,2]

plot(x = Birds$LOGAREA,
     y = Birds$ABUND)

lines(x = MyData$LOGAREA,
      y = PP1$fit[,1],
      lwd = 3,
      lty = 1,
      col = 1)


#Useful for later...when we have multiple lines
legend("topleft",
       legend=c("Line 1"),
       col = c(1),
       lty = c(1),
       lwd = c(1))

Now, model ABUND as a function of a categorical covariate (factor Graze)

Model2

M2 <- lm(ABUND ~ factor(GRAZE), data=Birds)


	Dependent variable:

	ABUND

factor(GRAZE)2	-6.673^*
	(3.379)

factor(GRAZE)3	-7.336^**
	(2.850)

factor(GRAZE)4	-8.052^**
	(3.526)

factor(GRAZE)5	-22.331^***
	(2.950)

Constant	28.623^***
	(2.086)


Observations	56
R²	0.545
Adjusted R²	0.509
Residual Std. Error	7.520 (df = 51)
F Statistic	15.267^*** (df = 4; 51)

Note:	p<0.1; p<0.05; p<0.01

Run annova on the model to check how well this model explains the Y in this case ABUND.


Statistic	N	Mean	St. Dev.	Min	Max

Df	2	27.500	33.234	4	51
Sum Sq	2	3,168.964	402.655	2,884.244	3,453.685
Mean Sq	2	459.987	570.541	56.554	863.421
F value	1	15.267		15.267	15.267
Pr(> F)	1	0.00000		0.00000	0.00000

Apply model as a function of all covariates But no interactions

Model3

M3 <- lm(ABUND ~ LOGAREA + LOGDIST + LOGLDIST +
           YR.ISOL + ALT + factor(GRAZE),
         data = Birds)

Is everything significant?


	Dependent variable:

	ABUND

LOGAREA	6.833^***
	(1.503)

LOGDIST	0.333
	(2.748)

LOGLDIST	0.798
	(2.138)

YR.ISOL	-0.013
	(0.058)

ALT	0.011
	(0.024)

factor(GRAZE)2	0.529
	(3.252)

factor(GRAZE)3	0.066
	(2.959)

factor(GRAZE)4	-1.249
	(3.198)

factor(GRAZE)5	-12.473^**
	(4.778)

Constant	36.680
	(115.163)


Observations	56
R²	0.729
Adjusted R²	0.677
Residual Std. Error	6.105 (df = 46)
F Statistic	13.784^*** (df = 9; 46)

Note:	p<0.1; p<0.05; p<0.01

We individually drop each of the variables


Statistic	N	Mean	St. Dev.	Min	Max

Df	6	1.500	1.225	1	4
Sum of Sq	6	199.753	323.907	0.547	770.014
RSS	7	1,885.647	305.172	1,714.430	2,484.443
AIC	7	213.798	7.634	209.621	230.377
F value	6	3.973	8.245	0.015	20.660
Pr(> F)	6	0.523	0.400	0.00004	0.904

Model4

Now, as a function of the interaction between a continuous and a categorical covariate

#this is = to LOGAREA + factor(GRAZE) + LOGAREA : factor(GRAZE)
M4 <- lm(ABUND ~ LOGAREA * factor(GRAZE), 
         data = Birds)


	Dependent variable:

	ABUND

LOGAREA	4.144^**
	(2.057)

factor(GRAZE)2	-6.165
	(4.842)

factor(GRAZE)3	-7.215
	(4.820)

factor(GRAZE)4	-17.910^**
	(6.701)

factor(GRAZE)5	-17.043^***
	(4.406)

LOGAREA:factor(GRAZE)2	4.368
	(3.108)

LOGAREA:factor(GRAZE)3	4.989
	(3.531)

LOGAREA:factor(GRAZE)4	15.235^**
	(5.925)

LOGAREA:factor(GRAZE)5	1.996
	(3.650)

Constant	21.243^***
	(3.987)


Observations	56
R²	0.767
Adjusted R²	0.721
Residual Std. Error	5.666 (df = 46)
F Statistic	16.827^*** (df = 9; 46)

Note:	p<0.1; p<0.05; p<0.01

Dropping one var each in M4:


Statistic	N	Mean	St. Dev.	Min	Max

Df	1	4.000		4	4
Sum of Sq	1	253.771		253.771	253.771
RSS	2	1,603.513	179.443	1,476.628	1,730.399
AIC	2	203.682	0.623	203.241	204.122
F value	1	1.976		1.976	1.976
Pr(> F)	1	0.114		0.114	0.114

Model5


	Dependent variable:

	ABUND

LOGAREA	7.247^***
	(1.255)

factor(GRAZE)2	0.383
	(2.912)

factor(GRAZE)3	-0.189
	(2.550)

factor(GRAZE)4	-1.592
	(2.976)

factor(GRAZE)5	-11.894^***
	(2.931)

Constant	15.716^***
	(2.767)


Observations	56
R²	0.727
Adjusted R²	0.700
Residual Std. Error	5.883 (df = 50)
F Statistic	26.627^*** (df = 5; 50)

Note:	p<0.1; p<0.05; p<0.01

Dropping one var each in M5


Statistic	N	Mean	St. Dev.	Min	Max

Df	2	2.500	2.121	1	4
Sum of Sq	2	1,145.195	12.234	1,136.545	1,153.845
RSS	3	2,493.862	661.235	1,730.399	2,884.244
AIC	3	219.751	13.900	204.122	230.733
F value	2	20.775	17.770	8.210	33.340
Pr(> F)	2	0.00002	0.00003	0.00000	0.00004

LM

Wesley GOI, Ezequiel (Ziggy) Marzinelli

Changes

Underlying question and task

Data Exploration

Questions to answer in the following

1 Outliers in Y

2 Collinearity X

3. Investigating relationships bet Y against X

5. Interactions

6. Zero inflation

7. Are categorical covariates balanced? Works on Factors only (categorical)

What is the model that we are fitting?

What is the fitted model

Is everything significant?

Interpretation of F and t values

Model validation (independence is most important)

A Homogeneity

B Independence

Normality

Plot

Model2

Model3

Is everything significant?

Model4

Model5