Data Analysis for Environmental Professionals

BEMA 5003H – Fall 2021

Trent University

Modular Assignment 3 - Regression

Overview:

Write a report detailing the methods you used in the scenarios below, providing the R code, output, as well as the resulting plots, to address the questions and make your conclusion. Be sure to address the questions in words (don’t just provide R code and output). It may be helpful to think of this assignment as a report that you would give to a stakeholder who is keen not only to know the answer to these questions, but also to see how you addressed them. Produce your report using RMarkdown, and submit your report as a PDF.

The point value for each question is indicated in [ ].

Total possible points: 9 (9% of your final grade)

HINT: to run a linear regression in R, use the following syntax:

Model = lm(Response ~ Predictor1 + Predictor2, data = DATA)

Question 1:

You are an avid gardener and enjoy planting perennials because they produce beautiful flowers year after year without re-planting. The persistence of your perennials, however, depends on successful pollination each year. This year, you’re going to build a new flower bed, but are uncertain how big to make it. Using the pollinator dataset (pollinators.csv), estimate the size your flower bed should be to ensure that you achieve the median level of pollination success. Given the results of your regression, how confident are you that constructing a flower bed within the range of error provided by your estimate will result in a median level of pollination success? [3 points]

HINTs:

  • AREA is in square meters
  • You can assume a linear relationship between pollination success and area of the flower bed
  • Only include the variables of interest
  • Consider using the predict() function. Specifying the argument se.fit = TRUE is one way to get a standard error around your estimate.
  • Consider how you should specify your response and predictor variables to answer the question that is being asked
  • Don’t forget to test your model assumptions!
  • It is important that your variables are in their original units when making estimations
  • To add lines to a plot, use the function abline()
  • Calculate the standard error or a 95% confidence interval to quantify how confident you that you will achieve the median level of pollination success
  • Look at the \(R^2\) value to evaluate the strength of the relationship between AREA and success
setwd("/Users/brendandickson/Desktop/R working Directroy/Assignment 3")
pollinators <- read.table("pollinators(5).csv", header= TRUE, sep= ",")
str(pollinators)
## 'data.frame':    30 obs. of  4 variables:
##  $ success : int  19 32 35 17 21 7 57 27 87 45 ...
##  $ INSECT  : int  8 15 19 6 11 3 11 11 32 49 ...
##  $ AREA    : int  10 22 32 17 11 11 54 44 50 45 ...
##  $ DISTANCE: int  11 74 42 39 21 19 91 62 91 57 ...
# Calculate median success

psm <- median(pollinators$success)
psm
## [1] 34.5
# Plot relationship to see if a linear model might be appropriate
par(mfrow=c(2,2))
hist(pollinators$success)
hist(pollinators$AREA)

par(mfrow=c(1,1))

plot(AREA ~ success, data = pollinators, main= "The Effect Pollination Success has on Area Patch Size", xlab = "Pollination Success", ylab=" Area Patch Size")

The histograms of both success and AREA from the pollinators dataset demonstrate a normal distribution of the data, showing for the most part a bell shape. Similarly the scatter plots for both success and AREA data demonstrate a normal distribution of data. The scatter plot is normally distributed and is not cone shaped, and heteroskedasticity is not present in the data. Therefore a linear model would be appropriate between success and AREA data in the pollinators dataset.

# Fit a linear regression
Model.lm = lm(AREA ~ success , data = pollinators)

# Look at model summary
summary(Model.lm)
## 
## Call:
## lm(formula = AREA ~ success, data = pollinators)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.3950 -12.0980   0.7554  10.5330  24.8757 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  18.0541     5.2396   3.446  0.00182 **
## success       0.2882     0.1105   2.609  0.01442 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.32 on 28 degrees of freedom
## Multiple R-squared:  0.1955, Adjusted R-squared:  0.1668 
## F-statistic: 6.805 on 1 and 28 DF,  p-value: 0.01442
## Assess model performance
anova(Model.lm)
## Analysis of Variance Table
## 
## Response: AREA
##           Df Sum Sq Mean Sq F value  Pr(>F)  
## success    1 1394.5 1394.51  6.8046 0.01442 *
## Residuals 28 5738.2  204.94                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model.lm.anova <- anova(Model.lm)
Model.lm.tot.ss <- sum(Model.lm.anova$`Sum Sq`)
success.R2 <- Model.lm.anova[1,2]/Model.lm.tot.ss
success.R2
## [1] 0.1955088
library(fmsb)
success.VIF <- VIF(lm(AREA ~ success, data=pollinators))
success.VIF
## [1] 1.243022
# Produce a plot
par(mfrow=c(2,2))
plot(Model.lm)

Based on the annalysis of the regression, it was determined that there is a moderate correlation between pollination success and AREA within the data set, with a VIF = 1.24. It was determined that success had a positive effect on AREA, for every unit of success that increases, AREA increased by 0.29 meters^2 (t = 2.69, d.f. = 28, p = 0.014). As well the relationship between the variables was found to be significant with a p value < 0.05.

# Use the predict function to generate an estimate of patch size,
# with SE of estimate, for a median level of pollination success

new.data <- data.frame(success = psm)
predict(Model.lm, newdata = new.data, se.fit= TRUE, interval = 'prediction')
## $fit
##        fit       lwr      upr
## 1 27.99773 -1.848514 57.84398
## 
## $se.fit
## [1] 2.71348
## 
## $df
## [1] 28
## 
## $residual.scale
## [1] 14.31557
# State AREA of flowerbed required to achive median pollination success, 
  # Include a measure of your confidence in this estimate (e.g., and SE)
# Can also have estimate AREA from the model coefficients
  # Look at model summary

Based on the meadian success value of psm = 34.5, it was determined that within the linear regression of Model.lm, that a flower bed area of 27.99 meters^2 was required to achieve meadian pollination success. The 95% confidence interval lands within a range of -1.85 on the lower end, and 57.85 on the upper end, meaning the fit estimate of a garden bed area of 27.99 meters^2 will achieve median pollination success. As well, the AREA data was found to explain 19.55% of the variation within the data, with an adjusted r-squared = 0.1668.

Question 2:

A client is concerned that colony collapse disorder - a phenomenon in which a majority of worker bees disappear, abandoning their queen, food, and immature bees - will influence whether their native plants get pollinated. Using the data we have on pollinators (pollinators.bin.csv), examine the relationship between the number of insect visitors and pollination success (binary: yes/no). Should your client be concerned about potential reductions in honeybee populations? [3 points]

HINTs:

  • Do not use a linear model with a binary response variable
  • Visually or statistically report the direction and size of effect
  • Only include the variables of interest
# Read in dataset
pollinators.bin <- read.table("pollinators.bin.csv", header= TRUE, sep= ",")
str(pollinators.bin)
## 'data.frame':    30 obs. of  4 variables:
##  $ success.bin: int  0 0 1 0 0 0 1 0 1 1 ...
##  $ INSECT     : int  8 15 19 6 11 3 11 11 32 49 ...
##  $ AREA       : int  10 22 32 17 11 11 54 44 50 45 ...
##  $ DISTANCE   : int  11 74 42 39 21 19 91 62 91 57 ...
# Fit a model
bin.glm <- glm(success.bin ~ INSECT, data = pollinators.bin, family="binomial")


anova(bin.glm, test = "Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: success.bin
## 
## Terms added sequentially (first to last)
## 
## 
##        Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                      29     41.589              
## INSECT  1   12.954        28     28.635 0.0003193 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
mcf.r2<- with(summary(bin.glm), 1- deviance/null.deviance)
mcf.r2
## [1] 0.3114694
# Look at model summary
summary(bin.glm)
## 
## Call:
## glm(formula = success.bin ~ INSECT, family = "binomial", data = pollinators.bin)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -2.88276    1.12649  -2.559  0.01050 * 
## INSECT       0.15801    0.06115   2.584  0.00977 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 41.589  on 29  degrees of freedom
## Residual deviance: 28.635  on 28  degrees of freedom
## AIC: 32.635
## 
## Number of Fisher Scoring iterations: 5

The relationship between pollinator success and pollinator insect numbers is significant p < 0.01. The relationship has a positive effect on pollination success with a positive slope = 0.15801, indicating that as insect numbers increases pollination success is achieved. Based on the MacFaddens R-squared, insect numbers explain 31.1% of the variation in pollination success.

# A plot will help determine direction and size of this effect

# Make a plot 
par(mfrow=c(2,2))
plot(bin.glm)

par(mfrow=c(1,2))
hist(pollinators.bin$success.bin)
hist(pollinators.bin$INSECT)

par(mfrow=c(1,1))
plot( pollinators.bin$INSECT, pollinators.bin$success.bin, main = "Effect Insect Visitation Numbers has on Pollination Success", xlab = "Insect Visitation Numbers", ylab = "Pollination Success", pch = 19, frame=FALSE)
abline(bin.glm, col= "blue")

[Tell us whether the client should be concerned, and why!]

  • After analizing the data presented in the pollinators.bin data set, and examining the relationship between pollination success and pollinator numbers, it is apparent from the data that there is a threshold associated with pollination success and pollinator numbers. From the data and the scatter plot of the relationship, between the range of 11:17 within the pollinator count data, the number of pollinators present has a direct effect on the success rate of pollination. Testing this assumption based on a General Linerar Model(GLM) of the relationship between (success~INSECT, data=pollinator.bin), pollinator numbers == 40 & 20 were predicted to achieve pollinator success. Based on the data, and the GLM, it could be condlued that collony collapse could have a drastic effect on pollination success of plants depenent on pollinator populations.
# Calculate what would happen with with 40 visiting pollinators

INS <- data.frame(INSECT=c(40))

predict.glm(bin.glm, newdata =INS, se.fit = TRUE, type = "response")
## $fit
##         1 
## 0.9688584 
## 
## $se.fit
##          1 
## 0.04497185 
## 
## $residual.scale
## [1] 1
# If that is halved to 20 pollinators, the probability of pollination is:

INS2 <- data.frame(INSECT = c(20))

predict.glm(bin.glm, newdata =INS2, se.fit = TRUE, type = "response")
## $fit
##         1 
## 0.5689103 
## 
## $se.fit
##         1 
## 0.1224824 
## 
## $residual.scale
## [1] 1

Question 3:

Multiple factors likely influence pollination success. Determine the relative influences of insect visitation and patch area on pollination success. Which has a stronger effect on pollination success - the number of insect visitors or patch size? [3 points]

HINTs:

  • Don’t standardize your response variable
  • You need to standardize the predictor variables if you want to compare them directly
  • Only include the variables of interest
  • P-values are NOT a measure of effect size! (Never forget this!)
# Read in dataset
pollinators <- read.table("pollinators(5).csv", header= TRUE, sep= ",")
str(pollinators)
## 'data.frame':    30 obs. of  4 variables:
##  $ success : int  19 32 35 17 21 7 57 27 87 45 ...
##  $ INSECT  : int  8 15 19 6 11 3 11 11 32 49 ...
##  $ AREA    : int  10 22 32 17 11 11 54 44 50 45 ...
##  $ DISTANCE: int  11 74 42 39 21 19 91 62 91 57 ...
# Standardize INSECT and AREA variables, as you want to directly compare their effects
pollinators$INSECT.stand <- scale(pollinators$INSECT, center =TRUE, scale = TRUE)
pollinators$AREA.stand <- scale(pollinators$AREA, center =TRUE, scale = TRUE)
pollinators
##    success INSECT AREA DISTANCE INSECT.stand  AREA.stand
## 1       19      8   10       11  -0.94666220 -1.26889328
## 2       32     15   22       74  -0.37375424 -0.50373150
## 3       35     19   32       42  -0.04637826  0.13390331
## 4       17      6   17       39  -1.11035018 -0.82254891
## 5       21     11   11       21  -0.70113021 -1.20512980
## 6        7      3   11       19  -1.35588217 -1.20512980
## 7       57     11   54       91  -0.70113021  1.53669990
## 8       27     11   44       62  -0.70113021  0.89906509
## 9       87     32   50       91   1.01759366  1.28164597
## 10      45     49   45       57   2.40894155  0.96282857
## 11      31     43   40       43   1.91787759  0.64401116
## 12      28     10   51       34  -0.78297421  1.34540945
## 13      35     15   31       65  -0.37375424  0.07013983
## 14      74     33   18       22   1.09943765 -0.75878543
## 15      22     15    3       34  -0.37375424 -1.71523765
## 16      11      5   32       17  -1.19219418  0.13390331
## 17       7      2    3       11  -1.43772616 -1.71523765
## 18      17     17   16       50  -0.21006625 -0.88631239
## 19      86     19   22       91  -0.04637826 -0.50373150
## 20      34     14   41       54  -0.45559823  0.70777464
## 21      45     22   40       31   0.19915372  0.64401116
## 22      75     29   33       65   0.77206168  0.19766679
## 23      64     33   16       32   1.09943765 -0.88631239
## 24      54     17   27       93  -0.21006625 -0.18491410
## 25      29     12   36       32  -0.61928622  0.38895724
## 26      60     24   34       17   0.36284171  0.26143027
## 27      43     34   51       78   1.18128165  1.34540945
## 28      56     22   44       97   0.19915372  0.89906509
## 29      88     41   52      143   1.75418960  1.40917294
## 30      27     15   11       46  -0.37375424 -1.20512980
# Fit the regression model
Model.var.lm <- lm(success~ INSECT.stand + AREA.stand, data= pollinators)
par(mfrow=c(2,2))
plot(Model.var.lm)

par(mfrow=c(1,1))
plot(pollinators$INSECT.stand, pollinators$success, pch = 19, main= "Effect Insect Visititation Numbers Has On Pollination Success", xlab = "Insect Visitation Numbers", ylab="Pollination Success" )

plot(pollinators$AREA.stand, pollinators$success, pch = 19, main= "Effect Area Patch Size Has On Pollination Success", xlab= "Area Patch Size", ylab = "Pollination Success")

library(fmsb)
INSECT.VIF <- VIF(lm(success ~ INSECT, data=pollinators))
INSECT.VIF
## [1] 1.715814
AREA.VIF <- VIF(lm(success ~ AREA, data= pollinators))
AREA.VIF
## [1] 1.243022
# Look at model summary
summary(Model.var.lm)
## 
## Call:
## lm(formula = success ~ INSECT.stand + AREA.stand, data = pollinators)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.967  -8.339  -3.842  12.655  47.904 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    41.100      3.381  12.155 1.85e-12 ***
## INSECT.stand   13.465      3.828   3.517  0.00156 ** 
## AREA.stand      4.724      3.828   1.234  0.22782    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.52 on 27 degrees of freedom
## Multiple R-squared:  0.4483, Adjusted R-squared:  0.4074 
## F-statistic: 10.97 on 2 and 27 DF,  p-value: 0.0003258
Model.var.lm0 <- lm(success~ INSECT + AREA, data= pollinators)
summary(Model.var.lm0)
## 
## Call:
## lm(formula = success ~ INSECT + AREA, data = pollinators)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.967  -8.339  -3.842  12.655  47.904 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  10.5302     7.9360   1.327  0.19566   
## INSECT        1.1020     0.3133   3.517  0.00156 **
## AREA          0.3012     0.2441   1.234  0.22782   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.52 on 27 degrees of freedom
## Multiple R-squared:  0.4483, Adjusted R-squared:  0.4074 
## F-statistic: 10.97 on 2 and 27 DF,  p-value: 0.0003258
# Look at coefficient estimates for the answer
anova(Model.var.lm)
## Analysis of Variance Table
## 
## Response: success
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## INSECT.stand  1 7003.2  7003.2 20.4171 0.000111 ***
## AREA.stand    1  522.4   522.4  1.5229 0.227816    
## Residuals    27 9261.2   343.0                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model.var.lm.anova <- anova(Model.var.lm)
var.total.ss <- sum(Model.var.lm.anova$`Sum Sq`)

INSECT.stand.R2 <- Model.var.lm.anova[1,2]/var.total.ss
INSECT.stand.R2
## [1] 0.4171863
AREA.stand.R2 <- Model.var.lm.anova[2,2]/var.total.ss
AREA.stand.R2
## [1] 0.03111754
Model.var.lm.R2 <-Model.var.lm.anova[1,2]/var.total.ss + Model.var.lm.anova[2,2]/var.total.ss
Model.var.lm.R2
## [1] 0.4483038

Insect visitation numbers and Area patch size had a positive effect on the pollination success. It was determined that there is no colinearity between success ~ Insect visitation VIF = 1.72, and success ~ Area patch size data VIF = 1.24. For every insect count increase of 1 Insect visitation, pollination success increased by a factor of 1.10 units (t= 3.517, d.f= 27, p<0.001). Similarly, for every 1m^2 increase in patch size, pollination success increased by a factor of 0.30 units (t= 1.234, d.f = 27, p = 0.22). Based on the data produced by the linear model, it can be determined that Insect visitation had a greater effect on pollination success within the model because it produced a t value = 3.517, compared to Area patch size with a t value = 1.234.