Portfolio

ANOVA

library(tidyverse)
library(multcomp)
rm(list=ls())

Question: Is there evidence that natural selection has resulted in mean cuckoo egg size being related to the local dominant host.

The null hypothesis is that there is no difference in the variation of cuckoo egg sizes found in different species nests.

df_cuckoo<- read.csv("Cuckoo.csv", fileEncoding = "UTF-8-BOM")
glimpse(df_cuckoo)

## Observations: 120
## Variables: 2
## $ Host      <fct> MDW_PIPIT, MDW_PIPIT, MDW_PIPIT, MDW_PIPIT, MDW_PIPI...
## $ EggLength <dbl> 19.65, 20.05, 20.65, 20.85, 21.65, 21.65, 21.65, 21....

levels(df_cuckoo$Host)

## [1] "HDGE_SPRW"  "MDW_PIPIT"  "PIED_WTAIL" "ROBIN"      "TREE_PIPIT"
## [6] "WREN"

There are 6 different hosts for the cuckoo eggs.

We are interested in finding out if the mean cuckoo egg size is affected by the host species the cuckoo has layed its eggs in.

First we will visualise the spread of the data using a boxplot.

ggplot(df_cuckoo, aes(x = Host, y = EggLength)) + 
  geom_boxplot() + 
  labs(x = "Host Nest Species", y = "Cuckoo Egg Length (mm)")

The above box plot shows the spread of the Cuckoo egg length data (mm) for the different host species nests the cuckoo eggs were found in. The dark horizontal bar shows the median egg length for that host. The box contains 50% of the data, with the upper and lower edges representing the upper and lower mid-quartile values. The whiskers represnt the top and bottom recorded length with the open circles showing any outliers in the data. From this figure, we can suspect that the Wren hosts may be causing an effect on the cuckoo egg sizes found in those nests. The Meadow Pipit may also be statistically different from the Hedge Sparrow egg size, as there is very little overlap in the plots, however it seems as though the outliers in the Meadow Pipit data have outliers causing the data to have a large spread, while most of the data points are culstering within 2mm (betweem 21.5 and 23.5mm). The Wren data appears to be the most different to the other 5 data sets.

To perform an ANOVA we must assume normality in the data. The boxplots show the data all look relatively symmetrical, and have a similar sized spread of variance, so we can assume a normal distribution.

We should fit an ANOVA model to try and best fit the data:

fit <- aov(EggLength ~ 1 + Host, data = df_cuckoo) 
anova(fit)

## Analysis of Variance Table
## 
## Response: EggLength
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Host        5 42.940  8.5879  10.388 3.152e-08 ***
## Residuals 114 94.248  0.8267                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can reject the null hypothesis that the host nest species is not having a significant effect on cuckoo egg length in this model (as indicated by the low p-value). We now need to determine which of the group means differ from one another.

We can perform a Tukey test to perform multi-comparisons between all host variables and select the species of interest based on the output in the plots created earlier.

library(multcomp)

summary(glht(fit, linfct = mcp(Host = "Tukey")))

## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: aov(formula = EggLength ~ 1 + Host, data = df_cuckoo)
## 
## Linear Hypotheses:
##                              Estimate Std. Error t value Pr(>|t|)    
## MDW_PIPIT - HDGE_SPRW == 0   -0.82254    0.27825  -2.956   0.0415 *  
## PIED_WTAIL - HDGE_SPRW == 0  -0.21810    0.33789  -0.645   0.9868    
## ROBIN - HDGE_SPRW == 0       -0.54643    0.33275  -1.642   0.5658    
## TREE_PIPIT - HDGE_SPRW == 0  -0.03143    0.33789  -0.093   1.0000    
## WREN - HDGE_SPRW == 0        -1.99143    0.33789  -5.894   <0.001 ***
## PIED_WTAIL - MDW_PIPIT == 0   0.60444    0.27109   2.230   0.2276    
## ROBIN - MDW_PIPIT == 0        0.27611    0.26466   1.043   0.8994    
## TREE_PIPIT - MDW_PIPIT == 0   0.79111    0.27109   2.918   0.0462 *  
## WREN - MDW_PIPIT == 0        -1.16889    0.27109  -4.312   <0.001 ***
## ROBIN - PIED_WTAIL == 0      -0.32833    0.32678  -1.005   0.9130    
## TREE_PIPIT - PIED_WTAIL == 0  0.18667    0.33201   0.562   0.9930    
## WREN - PIED_WTAIL == 0       -1.77333    0.33201  -5.341   <0.001 ***
## TREE_PIPIT - ROBIN == 0       0.51500    0.32678   1.576   0.6093    
## WREN - ROBIN == 0            -1.44500    0.32678  -4.422   <0.001 ***
## WREN - TREE_PIPIT == 0       -1.96000    0.33201  -5.903   <0.001 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

The Tukey comparison is used to determine which means amongst a set of means differ from the rest. Here we can see that the mean egg length in the Wren nest is strongly significanly different to all the means found in each other species nest. We also see that the eggs found in the Meadow Pipet nests are significantly smaller than the Hedge Sparrow and Tree Pipit eggs. All other comparisons did not show a significant difference.

So to answer the question of there being evidence that natural selection has resulted in mean cuckoo egg size being related to the local dominant host species; Yes, there is some evidence that the mean cuckoo egg length can vary depending on the host species the eggs are laid in. However, it would be benefitial to have a reference mean for the host species own eggs, and a mean for cuckoo eggs, to see if the mean cuckoo egg length in a host species nest is sized more similar to the host species mean egg length, and less like the normal cuckoo egg length.

Linear regression

library(tidyverse)
df_alligator<- read.csv("Alligator.csv", fileEncoding = "UTF-8-BOM")
glimpse(df_alligator)

## Observations: 19
## Variables: 2
## $ SnoutVent   <dbl> 1.10, 1.19, 1.13, 1.15, 0.96, 1.19, 1.06, 0.70, 0....
## $ PelvicCanal <dbl> 7.62, 8.20, 8.00, 9.60, 6.50, 8.17, 7.20, 4.65, 5....

Question 1: Is pelvic canal width a good predictor of snout-vent length for A. mississipiensis?

Null hypothesis is there is no correlation between the width of the pelvic canal, and the snout length in A. mississipiensis.

First we should have a look at how the data look:

plot(x = df_alligator$PelvicCanal, y = df_alligator$SnoutVent, xlab = "Pelvic Canal Width (cm)",
  ylab = "Snout-Vent Length (m)")

From this scatter plot, we can see there seems to be some sort of an association between Pelvic Canal Width and Snout-Vent Length. It looks as though Snout-Vent Length is increasing as Pelvic Canal Width increases in a positive correlation. Because there is no clear causation between these two traits, and the data are continuous, we can perform a Persson’s Correlation Test.

cor.test(df_alligator$PelvicCanal, df_alligator$SnoutVent)

## 
##  Pearson's product-moment correlation
## 
## data:  df_alligator$PelvicCanal and df_alligator$SnoutVent
## t = 7.6355, df = 17, p-value = 6.846e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7091013 0.9531712
## sample estimates:
##       cor 
## 0.8799091

There appears to be a strong positive correlation between Pelvic Canal Width and Snout-Vent Length, supported by the p-value, so Pelvic Canal Width is a good predictor for Snout-Vent Length in A. mississipiensis Alligators, and we can reject the null hypothesis.

Question 2: Suppose you measured the pelvic canal width of a fossil to be 7.62 cm. Assuming that alligators are a neontological model of the fossil species (i.e., a similarly proportioned extant species), what is your estimated snout-vent length for the fossil specimen?

To predict the snout-length for this fossil specimen based on the pelvic canal width measurment, we need to perform a linear regression to then estimate the snout-length from the line of best-fit.

LinearMod <- lm(SnoutVent ~ PelvicCanal, data = df_alligator)
summary(LinearMod)

## 
## Call:
## lm(formula = SnoutVent ~ PelvicCanal, data = df_alligator)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.19515 -0.04335  0.00711  0.06006  0.14431 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.1970     0.1213   1.624    0.123    
## PelvicCanal   0.1176     0.0154   7.636 6.85e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09467 on 17 degrees of freedom
## Multiple R-squared:  0.7742, Adjusted R-squared:  0.761 
## F-statistic:  58.3 on 1 and 17 DF,  p-value: 6.846e-07

The linear regression model has calculated the Y intercept (c) to be 0.197 and the slope (m) is 0.1176.

Now we need to test our null hypothesis that the real slope of the line is zero.

anova(LinearMod)

## Analysis of Variance Table
## 
## Response: SnoutVent
##             Df  Sum Sq Mean Sq F value    Pr(>F)    
## PelvicCanal  1 0.52250 0.52250  58.301 6.846e-07 ***
## Residuals   17 0.15235 0.00896                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is not enough support for the null hypothesis, so we reject the null.

confint(LinearMod)

##                   2.5 %    97.5 %
## (Intercept) -0.05896178 0.4529820
## PelvicCanal  0.08508317 0.1500558

The equation to explain the line of best-fit is y = mx + c. Here we can work out the estimation of snout-vent length (y).

predict(LinearMod, interval = "prediction", newdata = data.frame(PelvicCanal=7.62))

##       fit       lwr      upr
## 1 1.09289 0.8879237 1.297855

Using this linear regression we can predict that if the fossil pelvic canal width is 7.62 cm, the best estimation for that alligator’s snout-vent length is 1.09 m with a confidence interval of 0.88 to 1.29 m.

General Linear Models

library(datasets)
data("ToothGrowth")
glimpse(ToothGrowth)

## Observations: 60
## Variables: 3
## $ len  <dbl> 4.2, 11.5, 7.3, 5.8, 6.4, 10.0, 11.2, 11.2, 5.2, 7.0, 16....
## $ supp <fct> VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, V...
## $ dose <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1.0, 1....

First make sure R knows that dose is a factor:

ToothGrowth$dose <- factor(ToothGrowth$dose)

We want to know what the effect of the dose and delivery method of vitamin C suppliments as either Orange Juice (OJ) or ascorbic acid (VC) has on odontoblast cells growth (tooth growth).

First we’ll calculate the mean tooth growth length as grouped by the 6 dose and suppliment pairings.

df_summary <- ToothGrowth %>%
  group_by(dose, supp) %>%
  summarise(len = mean(len))

df_summary

## # A tibble: 6 x 3
## # Groups:   dose [?]
##   dose  supp    len
##   <fct> <fct> <dbl>
## 1 0.5   OJ    13.2 
## 2 0.5   VC     7.98
## 3 1     OJ    22.7 
## 4 1     VC    16.8 
## 5 2     OJ    26.1 
## 6 2     VC    26.1

Now we can plot the data and overlay the means to visualise the trends.

ggplot(ToothGrowth, aes(x = dose, y = len, color = supp)) +
  geom_point() + 
  geom_line(data = df_summary,
    aes(x = dose, y = len, color = supp, group = supp)) + 
  labs(x = "Vitamin C dose (mg per day)", y = "Tooth Growth (mm)", group = "Suppliment") +
  theme_bw()

It looks as though OJ is having less of an effect on tooth growth for doses 0.5 and 1mg/day, but a more similar effect at the 2mg/day dosage.

Next we need to fit a Linear Model to the data and determine if there is an interaction between Vitamin C dose and suppliment type.

fit_with_interaction <- lm(len ~ 1 + dose + supp + dose:supp, data = ToothGrowth)

summary(fit_with_interaction)

## 
## Call:
## lm(formula = len ~ 1 + dose + supp + dose:supp, data = ToothGrowth)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8.20  -2.72  -0.27   2.65   8.27 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    13.230      1.148  11.521 3.60e-16 ***
## dose1           9.470      1.624   5.831 3.18e-07 ***
## dose2          12.830      1.624   7.900 1.43e-10 ***
## suppVC         -5.250      1.624  -3.233  0.00209 ** 
## dose1:suppVC   -0.680      2.297  -0.296  0.76831    
## dose2:suppVC    5.330      2.297   2.321  0.02411 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.631 on 54 degrees of freedom
## Multiple R-squared:  0.7937, Adjusted R-squared:  0.7746 
## F-statistic: 41.56 on 5 and 54 DF,  p-value: < 2.2e-16

The estimated model parameters indicate that there is a positive relationship between dose and suppliment VC as indicated by the estimated coefficient.

drop1(fit_with_interaction, test = "F")

## Single term deletions
## 
## Model:
## len ~ 1 + dose + supp + dose:supp
##           Df Sum of Sq    RSS    AIC F value  Pr(>F)  
## <none>                 712.11 160.43                  
## dose:supp  2    108.32 820.43 164.93   4.107 0.02186 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The significant F statistic indicates there is an interaction between dose amount and suppliment type.

Now we can fit a model without the dose:supp interaction factor:

fit_without_interaction <- lm(len ~ 1 + dose + supp, data = ToothGrowth)

summary(fit_without_interaction)

## 
## Call:
## lm(formula = len ~ 1 + dose + supp, data = ToothGrowth)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.085 -2.751 -0.800  2.446  9.650 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12.4550     0.9883  12.603  < 2e-16 ***
## dose1         9.1300     1.2104   7.543 4.38e-10 ***
## dose2        15.4950     1.2104  12.802  < 2e-16 ***
## suppVC       -3.7000     0.9883  -3.744 0.000429 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.828 on 56 degrees of freedom
## Multiple R-squared:  0.7623, Adjusted R-squared:  0.7496 
## F-statistic: 59.88 on 3 and 56 DF,  p-value: < 2.2e-16

drop1(fit_without_interaction, test = "F")

## Single term deletions
## 
## Model:
## len ~ 1 + dose + supp
##        Df Sum of Sq    RSS    AIC F value    Pr(>F)    
## <none>               820.4 164.93                      
## dose    2   2426.43 3246.9 243.47  82.811 < 2.2e-16 ***
## supp    1    205.35 1025.8 176.33  14.017 0.0004293 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Because there is a significant interaction between dose and suppliment type, we will need to seperate out the data by suppliment type.

To examine the effect of dose on tooth growth for both suppliment types, we will look at the OJ and VC data seperately:

df_oj <- ToothGrowth[which(ToothGrowth$supp== "OJ"),]

df_vc <- ToothGrowth[which(ToothGrowth$supp== "VC"),]

First we will perform an ANOVA of the OJ data:

fit_oj <- aov(len~ 1 + dose, data= df_oj)
anova(fit_oj)

## Analysis of Variance Table
## 
## Response: len
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## dose       2 885.26  442.63  31.442 8.887e-08 ***
## Residuals 27 380.11   14.08                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA is telling us that dose has a significant effect on tooth growth for the OJ data set as indicated by the F-value 31.442, p-value <0.005 with 2 df.

summary(glht(fit_oj, linfct = mcp(dose = "Tukey")))

## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: aov(formula = len ~ 1 + dose, data = df_oj)
## 
## Linear Hypotheses:
##              Estimate Std. Error t value Pr(>|t|)    
## 1 - 0.5 == 0    9.470      1.678   5.644   <0.001 ***
## 2 - 0.5 == 0   12.830      1.678   7.646   <0.001 ***
## 2 - 1 == 0      3.360      1.678   2.002    0.131    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

Tukey result tells us for OJ delivery method, tooth growth in the 1mg/day dose is significantly higher than the 0.5mg/day dose (difference in means of 9.47 length), but no difference between 1mg and 2mg/day.

Now we can do the same for the ascorbic acid data:

fit_vc <- aov(len~ 1 + dose, data= df_vc)
anova(fit_vc)

## Analysis of Variance Table
## 
## Response: len
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## dose       2 1649.5  824.74  67.072 3.357e-11 ***
## Residuals 27  332.0   12.30                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(glht(fit_vc, linfct = mcp(dose = "Tukey")))

## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: aov(formula = len ~ 1 + dose, data = df_vc)
## 
## Linear Hypotheses:
##              Estimate Std. Error t value Pr(>|t|)    
## 1 - 0.5 == 0    8.790      1.568   5.605 1.48e-05 ***
## 2 - 0.5 == 0   18.160      1.568  11.580  < 1e-05 ***
## 2 - 1 == 0      9.370      1.568   5.975  < 1e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

All dose amounts of VC suppliment are significantly different from one another.

Now we will perform seperate T-tests on the doses:

df_0.5 <- ToothGrowth[which(ToothGrowth$dose== "0.5"),]
df_1 <- ToothGrowth[which(ToothGrowth$dose== "1"),]
df_2 <- ToothGrowth[which(ToothGrowth$dose== "2"),]

t.test(subset(df_0.5, supp=="OJ")$len, subset(df_0.5, supp=="VC")$len,  paired= FALSE, var.equal = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  subset(df_0.5, supp == "OJ")$len and subset(df_0.5, supp == "VC")$len
## t = 3.1697, df = 14.969, p-value = 0.006359
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.719057 8.780943
## sample estimates:
## mean of x mean of y 
##     13.23      7.98

Mean tooth length is 13.23 for OJ and 7.90 for VC for the 0.5mg/day dose. The different delivery methods have significant difference for tooth growth for the 0.5mg/day delivery method.

t.test(subset(df_1, supp=="OJ")$len, subset(df_1, supp=="VC")$len,  paired= FALSE, var.equal = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  subset(df_1, supp == "OJ")$len and subset(df_1, supp == "VC")$len
## t = 4.0328, df = 15.358, p-value = 0.001038
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.802148 9.057852
## sample estimates:
## mean of x mean of y 
##     22.70     16.77

Mean tooth growth is 16.77 for VC, and 22.70 for OJ for the 1mg/day dose. Again, the different delivery methods have a significant difference for tooth growth at a 1mg/day dosage.

t.test(subset(df_2, supp=="OJ")$len, subset(df_2, supp=="VC")$len,  paired= FALSE, var.equal = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  subset(df_2, supp == "OJ")$len and subset(df_2, supp == "VC")$len
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.79807  3.63807
## sample estimates:
## mean of x mean of y 
##     26.06     26.14

Mean tooth growth for VC is 20.06, and 26.14 for OJ at dosage level of 2mg/day. At a 2mg/day dosage level, delivery method does not have a significant effect on tooth growth.

This highlights the importance of seperating out the different dosage levels to compare the delivery methods effects properly.

Chi-Squared

Question: is there an association between the two variables - stream site and electrofisher pass?

The null hypothesis is that there is no association between the stream site that the fish were caught, and the electrofisher pass in which they were caught.

m_catches <- matrix(
  data = c(45, 11, 18, 8, 24, 13, 5, 3), 
  nrow = 2, ncol = 4, byrow = TRUE)
# add names to the rows and columns
rownames(m_catches) <- c("UNM", "HAY")
colnames(m_catches) <- c("First", "Second", "Third", "Fourth")

res <- chisq.test(m_catches, correct = FALSE)

## Warning in chisq.test(m_catches, correct = FALSE): Chi-squared
## approximation may be incorrect

res$observed

##     First Second Third Fourth
## UNM    45     11    18      8
## HAY    24     13     5      3

res

## 
##  Pearson's Chi-squared test
## 
## data:  m_catches
## X-squared = 5.8998, df = 3, p-value = 0.1166

Here we see from the X^2 value and p-value>0.05 that the rate of capture in the different passes is not significantly different between the two streams. In otherwords, the number of fish caught in each pass is not differentially changing between the stream sites. There is no significant stream site impact on the number of fish caught in each pass.

Overdispersion

Question: Is there statistical evidence that bee activity is affected by temperature, wind, or rain?

Temperature, wind and rain are all continuous measurments.

df_bees<- read.csv("Bees.csv", fileEncoding = "UTF-8-BOM")
glimpse(df_bees)

## Observations: 100
## Variables: 5
## $ tags     <int> 897, 711, 919, 129, 94, 896, 576, 437, 784, 1096, 168...
## $ detected <int> 7, 10, 6, 1, 0, 7, 0, 2, 3, 3, 0, 0, 5, 18, 1, 0, 5, ...
## $ temp     <dbl> 14.7, 12.4, 14.7, 19.8, 14.6, 16.6, 13.2, 12.6, 14.7,...
## $ wind     <dbl> 1.7, 0.9, 1.2, 1.0, 1.6, 1.4, 5.2, 3.6, 1.4, 1.7, 1.9...
## $ rain     <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.8, 0.4, 0.0, 0.0, 0.0...

If we assume bees detected outside of the hive are the active bees, and the rest that are tagged are being treated as inactive:

bees.tagged <- factor(df_bees$tags) 
bees.active <- factor(df_bees$detected)

model1 <- glm(cbind(bees.tagged, bees.active) ~1 + temp + wind + rain, family = binomial(link = "logit"), data = df_bees)

summary(model1)

## 
## Call:
## glm(formula = cbind(bees.tagged, bees.active) ~ 1 + temp + wind + 
##     rain, family = binomial(link = "logit"), data = df_bees)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.3754  -0.7097  -0.0862   0.9315   2.7537  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.15290    0.34731  11.957  < 2e-16 ***
## temp        -0.15361    0.02345  -6.551 5.71e-11 ***
## wind         0.12560    0.05724   2.194   0.0282 *  
## rain         0.03884    0.01694   2.292   0.0219 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 282.67  on 99  degrees of freedom
## Residual deviance: 192.22  on 96  degrees of freedom
## AIC: 464
## 
## Number of Fisher Scoring iterations: 5

When residual deviance is over double the degrees of freedom, as there is in these data, there is overdispersion in the data. Without treating the data as overdispersed, all three variables are showing a significant effect on bee activity.

Random effects: Here we use a binomial drstribution when fitting a model to the data.

df_bees$OD <- factor(1:100)

library(lme4)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following object is masked from 'package:tidyr':
## 
##     expand

model2 <- glmer(cbind(bees.active, bees.tagged) ~ 1 + temp + wind + rain + (1|OD),
  family=binomial(link=logit), data = df_bees)
summary(model2)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: cbind(bees.active, bees.tagged) ~ 1 + temp + wind + rain + (1 |  
##     OD)
##    Data: df_bees
## 
##      AIC      BIC   logLik deviance df.resid 
##    428.1    441.2   -209.1    418.1       95 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.13757 -0.40718  0.09068  0.47016  3.03051 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  OD     (Intercept) 0.3447   0.5871  
## Number of obs: 100, groups:  OD, 100
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.31594    0.51854  -8.323  < 2e-16 ***
## temp         0.16102    0.03439   4.682 2.83e-06 ***
## wind        -0.12891    0.08968  -1.438    0.151    
## rain        -0.03019    0.01994  -1.514    0.130    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##      (Intr) temp   wind  
## temp -0.918              
## wind -0.285 -0.069       
## rain -0.300  0.384 -0.349

Then we test for the main effects of temperature, wind and rain using Drop1 on the model.

drop1(model2, test = "Chisq")

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control
## $checkConv, : Model failed to converge with max|grad| = 0.00137014 (tol =
## 0.001, component 1)

## Single term deletions
## 
## Model:
## cbind(bees.active, bees.tagged) ~ 1 + temp + wind + rain + (1 | 
##     OD)
##        Df    AIC     LRT   Pr(Chi)    
## <none>    428.15                      
## temp    1 446.36 20.2115 6.933e-06 ***
## wind    1 428.23  2.0850    0.1487    
## rain    1 428.56  2.4162    0.1201    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Only temperature has significant effect, whereas before (when the data was not treated as over dispersed) all factors seemed to have a sugnigicant effect on bee activity.

Logistic Regression

Question1: Are altitute and soil moisture affecting the distribution of Eucalyptus amygdalina accross Tasmania? Null hypothesis is that there are no effects of altitude and soil moisture on the distribution of E. amygdalina.

df_euc<- read.csv("E_amygdalina.csv", fileEncoding = "UTF-8-BOM")
glimpse(df_euc)

## Observations: 100
## Variables: 4
## $ site      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ elevation <int> 0, 27, 352, 436, 30, 536, 186, 71, 112, 373, 297, 19...
## $ moisture  <fct> well_drained, poorly_drained, poorly_drained, well_d...
## $ present   <int> 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1...

summary(df_euc$moisture)

## poorly_drained   well_drained 
##             23             77

ggplot(df_euc)

ggplot(data = df_euc, aes(x = elevation, y = present, color = moisture)) +
  geom_point() + # plot points in out graph
   ylab("Presence of Euc (0= no, 1 = yes)") + xlab("Site elevation (m)") +
  theme_bw()

ggplot(data = df_euc, aes(x = elevation, y = present, color = moisture)) +
  geom_point() + 
  facet_wrap(~ moisture) +
   ylab("Presence of Euc (0= no, 1 = yes)") + xlab("Site elevation (m)") +
  theme_bw()

This plot shows that both random variables have effects on the presence of E. amygdalina. Now we need to test if there is an interaciton between the two variables, site elevation and soil moisture by adding the interaction as a factor.

model_euc <- glm(present ~ 1 + elevation + moisture +elevation:moisture,
                 family = binomial(link = "logit"),
                 data = df_euc)

summary(model_euc)

## 
## Call:
## glm(formula = present ~ 1 + elevation + moisture + elevation:moisture, 
##     family = binomial(link = "logit"), data = df_euc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2408  -0.9677  -0.4836   1.1660   1.9771  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)
## (Intercept)                    -0.80553    1.12642  -0.715    0.475
## elevation                      -0.02076    0.01942  -1.069    0.285
## moisturewell_drained            0.95340    1.17541   0.811    0.417
## elevation:moisturewell_drained  0.01820    0.01947   0.935    0.350
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 128.21  on 99  degrees of freedom
## Residual deviance: 111.22  on 96  degrees of freedom
## AIC: 119.22
## 
## Number of Fisher Scoring iterations: 8

drop1(model_euc, test= "Chisq")

## Single term deletions
## 
## Model:
## present ~ 1 + elevation + moisture + elevation:moisture
##                    Df Deviance    AIC    LRT Pr(>Chi)
## <none>                  111.22 119.22                
## elevation:moisture  1   112.88 118.88 1.6642    0.197

Drop1 compares two models, with and without interaction of elevation and souil moisture. The likelihood ratio test statistic is 1.6642, so there is no significant interaction present (p-value 0.197).

model_euc2 <- glm(present ~ 1 + elevation + moisture,
                 family = binomial(link = "logit"),
                 data = df_euc)

summary(model_euc2)

## 
## Call:
## glm(formula = present ~ 1 + elevation + moisture, family = binomial(link = "logit"), 
##     data = df_euc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2619  -0.9588  -0.4785   1.1510   2.0978  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)   
## (Intercept)          -1.946917   0.760371  -2.560  0.01045 * 
## elevation            -0.002836   0.001338  -2.120  0.03404 * 
## moisturewell_drained  2.143319   0.785549   2.728  0.00636 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 128.21  on 99  degrees of freedom
## Residual deviance: 112.88  on 97  degrees of freedom
## AIC: 118.88
## 
## Number of Fisher Scoring iterations: 5

drop1(model_euc2, test = "Chisq")

## Single term deletions
## 
## Model:
## present ~ 1 + elevation + moisture
##           Df Deviance    AIC     LRT  Pr(>Chi)    
## <none>         112.88 118.88                      
## elevation  1   118.13 122.13  5.2467 0.0219886 *  
## moisture   1   124.07 128.07 11.1851 0.0008246 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Using Drop1, we can determine that moisture is the main factor on presence of E. amygdilina, however elevation also having an effect.

x <- seq(from = min(df_euc$elevation), to = max(df_euc$elevation), length.out = 800)

beta.0 <- model_euc2$coefficients[1]
beta.1 <- model_euc2$coefficients[2]

p<- beta.0 + beta.1 * x
p<- exp(p)/ (1 + exp(p)) 

df_euc.fitted <-data_frame(elevation = x, prob_present = p)
glimpse(df_euc.fitted)

## Observations: 800
## Variables: 2
## $ elevation    <dbl> 0.0000000, 0.9974969, 1.9949937, 2.9924906, 3.989...
## $ prob_present <dbl> 0.1248899, 0.1245811, 0.1242729, 0.1239654, 0.123...

ggplot(data = df_euc, aes(x = elevation, y = present, color = moisture)) +
  geom_point() + # plot the observed data
  geom_line(data = df_euc.fitted, 
            aes(x = elevation, y = prob_present), 
            color = "black") +
       ylab("Presence of Euc (0= no, 1 = yes)") + xlab("Site elevation (m)") +
  theme_bw()

This plot shows us that as elevation decreases, the probability of E. amygdalina presence decreases.

Question2: What is the predicted probability of E. amygdalina at sea level?

newdata = data.frame(elevation = 0, moisture = factor(c("poorly_drained", "well_drained")))
predict(model_euc2, newdata, type = "response")

##         1         2 
## 0.1248899 0.5489432

We predict that there is a 12.5% chance of seeing a E. amygdalina plant in poorly-drained soil at sealevel, and a 54% chance of seeing a plant at sea-level in well-drained soil.

KMA711 Assignment1

KPicard