Homework 5

Introduction
Methods
Results
References

Introduction

Sarracenia, commonly known as pitcher plants, are carnivorous plants that mainly consume small prey like flies, ants, bees, and even snails. They are native to northeastern North America and are the largest group of carnivorous plants in the world. Sarracenia is already rare in the plant kingdom with them being carnivores and learning more about them can give us better insight into plant evolution. Sarracenia has also been found to be useful for medicinal purposes. The roots of Sarracenia contain high levels of anticancer activities, antioxidant capacity, and antibacterial activities (Huang et al., 2022). The leaf extract of Sarracenia has also been used to treat type II diabetes (Leduc et al., 2005). Understanding how we can predict their biomass from their physiological traits can help us find ways to conserve them by ensuring that they have the conditions to grow to the right biomass. This is important because the biomass of Sarracenia has been found to be a significant determinant of their prey capture ability (Bhattari and Horner, 2009). The objective is for us to understand how Sarracenia physiology data can be used to predict their individual biomass. We hypothesize that the Sarracenia species, feed level, and chlorophyll content are predictors of individual biomass.

# should haves (from last week)
library(tidyverse)
library(here)
library(janitor)
library(ggeffects)
library(performance)
library(naniar) # or equivalent
library(flextable) # or equivalent
library(car)
library(broom)
# would be nice to have
library(corrplot)
library(AICcmodavg)
library(GGally)

#use here package to organize files
here("data", "knb-lter-hfr.109.18")

## [1] "/Users/williamyip/github/ENVS-193DS_homework-05/data/knb-lter-hfr.109.18"

plant <- read_csv("~/github/ENVS-193DS_homework-05/data/knb-lter-hfr.109.18/hf109-01-sarracenia.csv") %>%
  #make column names cleaner
  clean_names() %>%
  #selecting columns of interest
  select(totmass, species, feedlevel, sla, chlorophyll, amass, num_lvs, num_phylls)

Methods

We took 8 different species of sarracenia and assigned 2 plants from each species group and assigned them different feeding levels from one to six. These levels were determined by the various weights of finely ground wasps that were fed to the plants. 0-0.25g for small species, 0-0.5g for medium species, and 0-1g for large species. Total biomass, chlorophyll content, total number of pitchers and phyllodes are measured after the feeding period (Ellison and Farnsworth, 2021). We collected our data by measuring the physiological and morphological attributes of the sarracenia and input the data into a dataset that we called plants.

We ran multiple visualizations of the dataset to better understand it and the relationships that it’s variables had with each other. As shown in figure 1, there seems to be a significant number of missing data in our original dataset. More importantly, the variables that have missing data are potential predictors that we may use for creating models. We subset the plant dataset to eliminate the missing data. As shown in figure 2, we calculated Pearson’s correlation to determine relationships between numerical values in our dataset. We found that there was a mix of positive and negative correlations between variables, however, all of these correlations were weak. The strongest correlation was between photosynthetic rate and specific leaf area at 0.32 which is still considered a weak correlation. As shown in figure 3, we visually assessed the relationships between variables using figure 3. We found that there were some variables that had a significant but weak Pearson’s correlation while there seemed to be no relationship between variables based on visual assessment of the density plot and scatter plot.

In order for us to determine how species and physiological characteristics predict biomass, we fit multiple linear models. We create a null model where we do not have any predictor variables and we also create a full model where we include all the predictor variables. These variables are: species, chlorophyll content, grams of ground Hymenoptera fed per week, specific leaf area, total number of pitchers, and the total number of phyllodes produced. As shown in figure 4, we visually assessed the normality and homoskestacitiy of residuals using a series of diagnostic plot for the full model. We found that the visual assumption check of the plots show that the model does not conform to linearity or normality. We also used the Shapiro-Wilk test (null hypothesis: variable of interest (i.e the residuals) are normally distributed) to test normality and the Breush-Pagan test (null hypothesis: variable of interest has constant variance) to test for heteroskedasticity. We found that the model had non-normality of residuals and heteroskedasticity which means it does not conform to the assumptions of linear regression. We used a log10 of each observation to transform the response variable. This transform residuals to normal and allows it to conform to the assumptions of linear regression. We evaluated multicollinearity of the full model by calculating the generalized variance inflation factor and determined that the model did not display multicollinearity. This is because the generalized variance inflation factor values were all below 5 which indicates that there was no multicollinearity in the model.

We then made 3 additional models to compare to the null and full models. For model 2, we choose the predictor variable of species because it was reasonable to believe that different species of Sarracenia grow to different sizes. Different species of Sarracenia likely grow in different regions or have different adaptations that can result in biomass that differ from other species. For model 3, We choose the predictors variables of species and chlorophyll content. The species variable was chosen because we used it as a predictor in our previous model and wanted to build on it. We also look at the predictor variable chlorophyll content because it indicates how much energy the plants can obtain through photosynthesis which could potentially correlate to biomass. For model 4, We choose the predictor variables of species and feed level. Again, the species variable was chosen for us to build upon it. We also choose the predictor variable feed level which is the grams of finely ground wasps that are fed to the Sarracenia. This is because we assume that higher feed levels will result in larger biomass. To compare the models that we created, we used Alkaline’s Information criterion (AIC) values to determine which model was the best one. The AIC values tell us which model is the least complex model that predicts the most variance.

#create missing data visual
gg_miss_var(plant)

Figure 1. Visualization of missing observations in plant dataset.
Variables are shown on the y-axis and the number of data missing is shown on the x-axis. The lines and dots represent how many missing observations are in each variable.

#drop the NAs
plant_subset <- plant %>%
  drop_na(sla, chlorophyll, num_lvs, num_phylls, amass)

#create dataset for correlation plot
plant_cor <- plant_subset %>%
  select(feedlevel:num_phylls) %>%
  cor(method = "pearson")

#create correlation plot
corrplot(plant_cor, 
         method = "ellipse",
         addCoef.col = "black"
         )

Figure 2. Pearson’s correlation between predictor variables.
The scale on the right indicates the strength of correlation and each cell has color shading to indicate the strength of correlation. Blue represents a positive correlation and red represents a negative correlation. The diagonals represent the correlation between the same variables.

#create plots to visually assess relationship between variables
plant_subset %>%
  select(species:num_phylls) %>%
  ggpairs()

Figure 3. Relationships between variables.
Density plot are shown along the diagonal and below the diagonal are scatterplots. Above the diagonal, there are Pearson’s correlations and the asterisks indicate significant relationships between variables. The bar plot at the top left shows the frequency of each species and the boxplot along the top line show the quantitative variable compared with the categorical variable species.

#create null and full model
null <- lm(totmass ~ 1, data = plant_subset) #start with nothing in there
full <- lm(totmass ~ species + feedlevel + sla + chlorophyll + amass + num_lvs + num_phylls, data = plant_subset) #everything in there

#create plots for visual assessment
par(mfrow = c(2,2))
plot(full)

Figure 4. Assumption checks for the full model.
Four plots used to determine the normality of full mode. Residual vs Fitted show residuals on y-axis and fitted values on x-axis. Used to indicate linear relationship. Normal Q-Q plot had standardized residuals on the y-axis and theoretical quantities on the x-axis. Used to determine normality of residuals. Scale-location show the square root of standardized residuals on the y-axis and fitted values on the x-axis. Used to look at homogeneity of variance of residuals. Residuals vs leverage show standardized residuals on the y-axis, leverage on x-axis, and cooke’s distance represented by dotted lines. Used to check how influencial outliers are.

#check assumptions for full model
check_normality(full)

## Warning: Non-normality of residuals detected (p < .001).

check_heteroscedasticity(full)

## Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).

#create log transformation of full model
full_log <- lm(log(totmass) ~ species + feedlevel + sla + chlorophyll + amass + num_lvs + num_phylls, data = plant_subset)
#check assumptions
par(mfrow = c(2,2))
plot(full_log)

Figure 5. Assumption checks for the full log model.
Four plots used to determine the normality of full mode. Residual vs Fitted show residuals on y-axis and fitted values on x-axis. Used to indicate linear relationship. Normal Q-Q plot had standardized residuals on the y-axis and theoretical quantities on the x-axis. Used to determine normality of residuals. Scale-location show the square root of standardized residuals on the y-axis and fitted values on the x-axis. Used to look at homogeneity of variance of residuals. Residuals vs leverage show standardized residuals on the y-axis, leverage on x-axis, and cooke’s distance represented by dotted lines. Used to check how influencial outliers are.

#check assumptions for full model log transformation
check_normality(full_log)

## OK: residuals appear as normally distributed (p = 0.107).

check_heteroskedasticity(full_log)

## OK: Error variance appears to be homoscedastic (p = 0.071).

#create log transformations of null and full model
null_log <- lm(log(totmass) ~ 1, data = plant_subset)
full_log <- lm(log(totmass) ~ species + feedlevel + sla + chlorophyll + amass + num_lvs + num_phylls, data = plant_subset)
#check assumptions
par(mfrow = c(2,2))
plot(full_log)

check_normality(full_log)

## OK: residuals appear as normally distributed (p = 0.107).

check_heteroscedasticity(full_log)

## OK: Error variance appears to be homoscedastic (p = 0.071).

#evaluate multicollinearity
vif(full_log)

##                  GVIF Df GVIF^(1/(2*Df))
## species     42.351675  9        1.231351
## feedlevel    1.621993  1        1.273575
## sla          1.999989  1        1.414210
## chlorophyll  1.949828  1        1.396362
## amass        2.872084  1        1.694722
## num_lvs      2.813855  1        1.677455
## num_phylls   2.995510  1        1.730754

#create other models
model2_log <- lm(log(totmass) ~ species, data = plant_subset)
model3_log <- lm(log(totmass) ~ species + chlorophyll, data = plant_subset)
model4_log <- lm(log(totmass) ~ feedlevel + species, data = plant_subset)

#check assumptions for model 2
par(mfrow = c(2,2))
plot(model2_log)

Figure 6. Assumption checks for model 2.
Four plots used to determine the normality of full mode. Residual vs Fitted show residuals on y-axis and fitted values on x-axis. Used to indicate linear relationship. Normal Q-Q plot had standardized residuals on the y-axis and theoretical quantities on the x-axis. Used to determine normality of residuals. Scale-location show the square root of standardized residuals on the y-axis and fitted values on the x-axis. Used to look at homogeneity of variance of residuals. Residuals vs leverage show standardized residuals on the y-axis, leverage on x-axis, and cooke’s distance represented by dotted lines. Used to check how influencial outliers are.

#check assumptions for model 2
check_normality(model2_log)

## OK: residuals appear as normally distributed (p = 0.374).

check_heteroskedasticity(model2_log)

## OK: Error variance appears to be homoscedastic (p = 0.100).

#check assumptions for model 3
par(mfrow = c(2,2))
plot(model3_log)

Figure 7. Assumption checks for model 3.
Four plots used to determine the normality of full mode. Residual vs Fitted show residuals on y-axis and fitted values on x-axis. Used to indicate linear relationship. Normal Q-Q plot had standardized residuals on the y-axis and theoretical quantities on the x-axis. Used to determine normality of residuals. Scale-location show the square root of standardized residuals on the y-axis and fitted values on the x-axis. Used to look at homogeneity of variance of residuals. Residuals vs leverage show standardized residuals on the y-axis, leverage on x-axis, and cooke’s distance represented by dotted lines. Used to check how influencial outliers are.

#check assumptions for model 3
check_normality(model3_log)

## OK: residuals appear as normally distributed (p = 0.134).

check_heteroskedasticity(model3_log)

## OK: Error variance appears to be homoscedastic (p = 0.327).

#check assumptions for model 4
par(mfrow = c(2,2))
plot(model4_log)

Figure 8. Assumption checks for model 4.
Four plots used to determine the normality of full mode. Residual vs Fitted show residuals on y-axis and fitted values on x-axis. Used to indicate linear relationship. Normal Q-Q plot had standardized residuals on the y-axis and theoretical quantities on the x-axis. Used to determine normality of residuals. Scale-location show the square root of standardized residuals on the y-axis and fitted values on the x-axis. Used to look at homogeneity of variance of residuals. Residuals vs leverage show standardized residuals on the y-axis, leverage on x-axis, and cooke’s distance represented by dotted lines. Used to check how influencial outliers are.

#check assumptions for model 4
check_normality(model4_log)

## OK: residuals appear as normally distributed (p = 0.339).

check_heteroskedasticity(model4_log)

## OK: Error variance appears to be homoscedastic (p = 0.110).

#Compare models using Akaline's Information criterion values
AICc(full_log)

## [1] 133.9424

AICc(model2_log)

## [1] 157.5751

AICc(null_log)

## [1] 306.0028

AICc(model3_log)

## [1] 146.0276

AICc(model4_log)

## [1] 159.6218

Results

For the best model, we choose the full model which compared the biomass to every predictor variable. We choose this model because it had the lowest AIC score compared to the other models. The model results show that it has a p-value of less than 0.001, F-statistic of 38.38, degree of freedom of 87, and an adjusted R-squared value of 0.8461. The full log shows the grams of food, chlorophyll content, surface leaf area, number of phyllodes, number of pitchers, and photosynthetic rate are all significant predictors of total biomass. This model shows that there are a multitude of biological factors that can be a predictor of plant biomass. This means that our original hypothesis was incorrect. The chlorophyll content, photosynthetic rate, surface leaf area, and number of phyllodes can have an effect on the energy that Sarracenia can produce from photosynthesis which can predict how large it can grow. However, since the Sarracenia are carnovirous plants, the number of pitchers and feedlevel can also be a predictor of the size that Sarracenia can grow into. The number of pitchers can affect the rate at which Sarracenia are able to catch prey in the wild and the feed level is directly related to the amount of food that Sarracenia consume which can predict biomass. What the full model tells us it that many biological components of Sarracenia can affect their biomass due to them being both plants and carnivorous.

#show results of model
summary(full_log)

## 
## Call:
## lm(formula = log(totmass) ~ species + feedlevel + sla + chlorophyll + 
##     amass + num_lvs + num_phylls, data = plant_subset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88872 -0.20811  0.02825  0.24218  0.78287 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.339043   0.597727  -2.240 0.027624 *  
## speciesalata        1.113163   0.184021   6.049 3.56e-08 ***
## speciesflava        1.404562   0.262955   5.341 7.29e-07 ***
## speciesjonesii      0.319652   0.196426   1.627 0.107281    
## speciesleucophylla  1.709035   0.227608   7.509 4.88e-11 ***
## speciesminor        0.389310   0.187903   2.072 0.041239 *  
## speciespsittacina  -1.645198   0.207035  -7.946 6.36e-12 ***
## speciespurpurea    -0.364348   0.254380  -1.432 0.155643    
## speciesrosea       -0.947383   0.260495  -3.637 0.000467 ***
## speciesrubra        0.875342   0.196361   4.458 2.46e-05 ***
## feedlevel          -0.474255   0.234493  -2.022 0.046199 *  
## sla                -0.002493   0.001160  -2.149 0.034430 *  
## chlorophyll         0.004368   0.001189   3.672 0.000414 ***
## amass               0.002338   0.002988   0.782 0.436166    
## num_lvs             0.091764   0.022413   4.094 9.46e-05 ***
## num_phylls         -0.039585   0.051714  -0.765 0.446068    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.413 on 87 degrees of freedom
## Multiple R-squared:  0.8687, Adjusted R-squared:  0.8461 
## F-statistic: 38.38 on 15 and 87 DF,  p-value: < 2.2e-16

#use ggpredict() to backtransform estimates

model_pred <- ggpredict(full_log, terms = "species", back.transform = TRUE)

plot(ggpredict(full_log, terms = "species", back.transform = TRUE), add.data = TRUE)

Figure 9. Species as a predictor of total biomass.
Vertical lines represent the error bars and the dark-shaded dots are the mean biomass values of each species. The light-shaded dots are all the different biomass values for each species.

#show ggpredict results
model_pred

## # Predicted values of totmass
## 
## species     | Predicted |        95% CI
## ---------------------------------------
## alabamensis |      2.78 | [2.11,  3.65]
## alata       |      8.45 | [6.58, 10.86]
## flava       |     11.31 | [7.57, 16.89]
## jonesii     |      3.82 | [2.78,  5.26]
## minor       |      4.10 | [3.15,  5.33]
## psittacina  |      0.54 | [0.37,  0.77]
## purpurea    |      1.93 | [1.28,  2.91]
## rubra       |      6.66 | [5.03,  8.82]
## 
## Adjusted for:
## *   feedlevel =   0.18
## *         sla = 129.27
## * chlorophyll = 471.29
## *       amass =  35.26
## *     num_lvs =   7.08
## *  num_phylls =   0.58

References

Bhattarai, G. P., & Horner, J. D. (2009). The Importance of Pitcher Size in Prey Capture in the Carnivorous Plant, Sarracenia alata Wood (Sarraceniaceae). The American Midland Naturalist, 161(2), 264–272. http://www.jstor.org/stable/20491437
Ellison, A. and E. Farnsworth. 2021. Effects of Prey Availability on Sarracenia Physiology at Harvard Forest 2005 ver 18. Environmental Data Initiative. https://doi.org/10.6073/pasta/26b22d09279e62fd729ffc35f9ef0174 (Accessed 2023-06-05).
Huang, Y. H., Chiang, W. Y., Chen, P. J., Lin, E. S., & Huang, C. Y. (2022). Anticancer and Antioxidant Activities of the Root Extract of the Carnivorous Pitcher Plant Sarracenia purpurea. Plants (Basel, Switzerland), 11(13), 1668. https://doi.org/10.3390/plants11131668
Leduc, C., Coonishish, J., Haddad, P., & Cuerrier, A. (2006). Plants used by the Cree Nation of Eeyou Istchee (Quebec, Canada) for the treatment of diabetes: A novel approach in quantitative ethnobotany. Journal of ethnopharmacology, 105(1-2), 55–63. https://doi.org/10.1016/j.jep.2005.09.038

Homework 5

William Yip

2023-06-06

Introduction

Methods

Results

References