This document summarizes attempts to develop an allometric equation for Phragmites australis and Typha x glauca in our study area. Work is shown and results are explained when possible, with findings summarized at the end.
I selected all Phragmites data for 2019, as all basins were pre-treatment at the time of data collection, but I used only the control plots from 2020. This arrangement provides the greatest number of untreated samples for analysis.
phrag <- subset(main.df, species == "phrag" &
((date == "2020" & treatment == "control") |
(date == "2019")))
summary(phrag)
linear model and QQ plots:
# building linear model
LMphrag <- lm(DryWeight_g ~ HeightClass, data = phrag)
# plot linear model residuals
qqnorm(LMphrag$residuals)
qqline(LMphrag$residuals)
Shapiro-Wilk Test
Ho: Data are drawn from a normal distribution (p-value > 0.001)
Ha: Data are NOT drawn from a normal distribution
shapiro.test(LMphrag$residuals)
##
## Shapiro-Wilk normality test
##
## data: LMphrag$residuals
## W = 0.9614, p-value = 0.0004067
The QQ plot is unconvincing and our Shapiro-Wilk test generated a p-value below 0.001. Data are not normally distributed and a transformation is required.
I hid the output to keep the section tidy, but p-values are listed under each transformation test.
LOGphrag <- lm(log10(DryWeight_g) ~ HeightClass, data = phrag) # lm with log10 transformation
SQRTphrag <- lm(sqrt(DryWeight_g) ~ HeightClass, data = phrag) # lm with sqrt transformation
X2phrag <- lm((DryWeight_g)^2 ~ HeightClass, data = phrag) # lm with x^2 transformation
shapiro.test(LOGphrag$residuals)
# p = 0.194, Ho accepted, DATA NORMAL
shapiro.test(SQRTphrag$residuals)
# p = 0.002986, Ho accepted, DATA NORMAL
shapiro.test(X2phrag$residuals)
# p = 0.000000002497, Ho rejected, data not normal
Of the three tested data transformations, both the log10 transformation and the square root transformation passed the Shapiro-Wilk Test.
Here are their QQ plots: The log10 transformation has a stronger QQ plot, but other tests may give a better idea of which transformation to choose.
Now that two sets of data meet normality assumptions, we can run equal variance tests of residuals.
plot(LOGphrag, 1, main = "Residuals vs Fitted, log10 transformation")
plot(SQRTphrag, 1, main = "Residuals vs Fitted, square root transformation")
The square root transformation is more convincing here.
# Linearity Test
crPlots(LOGphrag, main = "log10 transformation")
crPlots(SQRTphrag, main = "square root transformation")
# Outlier Test
plot(LOGphrag, 5, main = "log10 transformation")
plot(SQRTphrag, 5, main = "square root transformation")
For the non-constant variance test, the hypotheses are:
Ho: Data have constant variance (p-value > 0.05)
Ha: Data have non-constant variance
ncvTest(LOGphrag)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 5.870782, Df = 1, p = 0.015394
ncvTest(SQRTphrag)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 3.889406, Df = 1, p = 0.048592
The log10 transformation does not have constant variance: its p-value is 0.015394.
The square root transformation, though, is very close. Its p-value is 0.048592. If we round up, we meet the assumptions and assume constant variance, and the residual plots are more convincing on the whole. Are we okay with this?
summary(SQRTphrag)
##
## Call:
## lm(formula = sqrt(DryWeight_g) ~ HeightClass, data = phrag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4977 -0.5752 -0.1201 0.4342 2.3289
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.093546 0.309873 -3.529 0.00056 ***
## HeightClass 0.112853 0.005217 21.631 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8263 on 144 degrees of freedom
## Multiple R-squared: 0.7647, Adjusted R-squared: 0.763
## F-statistic: 467.9 on 1 and 144 DF, p-value: < 2.2e-16
Based on the summary, we can tell the y-intercept is -1.093546 and the slope is 0.112853. Therefore our allometric equation is y = 0.112853x - 1.093546.
The following code can determine our root mean squared error:
RSS <- c(crossprod(SQRTphrag$residuals))
MSE <- RSS / length(SQRTphrag$residuals)
RMSE <- sqrt(MSE)
RMSE # 68% of predicted values fall within x +/- RMSE
## [1] 0.8206188
RMSE * 2 # 95% of predict values fall within x +/- RMSE
## [1] 1.641238
Bringing in data collected from the field:
field.df <- read.csv("Sam_Counts_Cleaning.csv")
field.df$year <- as.factor(field.df$year)
field.df$basin <- as.factor(field.df$basin)
This data set is not in a format that is easy to analyze, so the next chunk of code prepares data for manipulation and analysis.
BasinPlots <- melt(field.df, id = c("date", "year", "basin", "plot", "notes"))
BasinPlots$value <- as.numeric(BasinPlots$value)
# clean up data frame, remove NA values
analysis <- subset(BasinPlots, (variable != "phrag_sum") & # deletes these columns from subset. "is not equal to"
(variable != "typha_sum") &
(variable != "inf_sum"))
analysis <- na.omit(analysis) # removes NA values from subset
# create appropriate columns
TollwayDB <- analysis %>% separate(variable, c("species", "height_class")) # breaks apart the species and height class
TollwayDB$species <- as.factor(TollwayDB$species) # reminds R that species is a factor
TollwayDB$height_class <- as.numeric(TollwayDB$height_class)
TollwayDB$weighted_height <- TollwayDB$height_class * TollwayDB$value
# will divide this column from stem_total below to get mean height per plot
head(TollwayDB)
## date year basin plot notes species height_class
## 1507 8.14.19 2019 183 phrag_center phrag 7
## 1730 8.13.20 2020 140 phrag_240 phrag 8
## 1883 8.13.20 2020 140 phrag_120 32: either 1 or 2 phrag 9
## 1978 8.22.19 2019 236 phrag_240 phrag 10
## 2002 8.22.19 2019 236 phrag_center phrag 10
## 2056 8.13.20 2020 140 phrag_240 phrag 10
## value weighted_height
## 1507 1 7
## 1730 1 8
## 1883 1 9
## 1978 2 20
## 2002 2 20
## 2056 1 10
# create data frame that summarizes total plots and total count of each species
PlotTotals <- ddply(TollwayDB, c("date", "basin", "plot", "species"), summarize,
stem_total = sum(value, na.rm = T),
sum_weighted = sum(weighted_height, na.rm = T) ) # all heights added together (numerator of mean!)
# reminder: count is a new column made of the sum of values
# create column for mean height of each plot
PlotTotals$mean_height <- PlotTotals$sum_weighted / PlotTotals$stem_total
head(PlotTotals)
## date basin plot species stem_total sum_weighted mean_height
## 1 8.11.20 184 phrag_120 phrag 28 1254 44.78571
## 2 8.11.20 184 phrag_120 typha 21 855 40.71429
## 3 8.11.20 184 phrag_240 phrag 139 3976 28.60432
## 4 8.11.20 184 phrag_240 typha 8 356 44.50000
## 5 8.11.20 184 phrag_360 phrag 140 4175 29.82143
## 6 8.11.20 184 phrag_center phrag 76 2685 35.32895
PlotTotals contains the stem total and mean height of all plots for all species. This “master” dataset will be used for both the Phragmites and Typha estimates.
For Phragmites:
phragPlots <- subset(PlotTotals, species == "phrag")
# subset of all plots in which phragmites is present
# insert column with transformed allometric equation
phragPlots$biomass_raw <- 0.112853*phragPlots$mean_height - 1.093546
# back-transform data. Value obtained is the average biomass in the plot
phragPlots$mean_biomass <- (phragPlots$biomass_raw)^2
# insert column of total estimated biomass in each plot
phragPlots$total_plot_biomass <- phragPlots$mean_biomass * phragPlots$stem_total
head(phragPlots)
## date basin plot species stem_total sum_weighted mean_height
## 1 8.11.20 184 phrag_120 phrag 28 1254 44.78571
## 3 8.11.20 184 phrag_240 phrag 139 3976 28.60432
## 5 8.11.20 184 phrag_360 phrag 140 4175 29.82143
## 6 8.11.20 184 phrag_center phrag 76 2685 35.32895
## 8 8.11.20 184 typha_120 phrag 5 197 39.40000
## 11 8.11.20 184 typha_240 phrag 6 211 35.16667
## biomass_raw mean_biomass total_plot_biomass
## 1 3.960656 15.686798 439.23033
## 3 2.134537 4.556248 633.31846
## 5 2.271892 5.161492 722.60885
## 6 2.893432 8.371947 636.26797
## 8 3.352862 11.241685 56.20842
## 11 2.875118 8.266303 49.59782
# basin-wide biomass per plot estimate
phragBasin <- ddply(phragPlots, c("date", "basin", "species"), summarize,
avg_biomass = mean(total_plot_biomass, na.rm = T))
head(phragBasin)
## date basin species avg_biomass
## 1 8.11.20 184 phrag 369.6386
## 2 8.12.20 28 phrag 618.9667
## 3 8.12.20 32 phrag 304.3385
## 4 8.13.19 184 phrag 1618.6949
## 5 8.13.20 140 phrag 232.8805
## 6 8.13.20 141 phrag 445.4147
We now have the average phrag biomass per m^2 for every basin. To get the total biomass, I pulled in Drew’s 2019 acreage estimates from NearMap. I would like to update these valueswith 2020 aerial imagery.
# import of basin acreage data
acreage.df <- read.csv("basin_species_acreage.csv")
acreage.df$basin <- as.factor(acreage.df$basin)
# summarize total phrag area of each basin
speciesAreas <- ddply(acreage.df, c("species", "basin"), summarize,
total_acreage = sum(acreage, na.rm = T))
# Convert to m^2. Reminder: 1 acre = 4046.86 m^2
speciesAreas$area_in_m2 <- speciesAreas$total_acreage * 4046.86
head(speciesAreas)
## species basin total_acreage area_in_m2
## 1 phrag 28 0.6161340 2493.4080
## 2 phrag 32 0.1072954 434.2095
## 3 phrag 140 0.5493020 2222.9483
## 4 phrag 141 0.8745370 3539.1288
## 5 phrag 153 0.8505230 3441.9475
## 6 phrag 158 1.5082621 6103.7256
Similar to PlotTotals, we can use speciesAreas for both species analyses. The next step here is to merge the phrag subset of speciesAreas with the per-plot phragBasin estimates. The merge function worked but the columns required a little cleanup. I checked my work to make sure the numbers were accurate after merge().
phragArea <- subset(speciesAreas, species == "phrag")
phragFinal <- merge(phragArea, phragBasin, by = "basin") # puts avg biomass and total m^2 in same dataframe
phragFinal = select(phragFinal, -c(species.x, species.y, total_acreage)) #drops unnecessary columns
names(phragFinal)
## [1] "basin" "area_in_m2" "date" "avg_biomass"
col_order <- c("basin", "date", "avg_biomass", "area_in_m2") # reorder columns for easier viewing
phragFinal <- phragFinal[, col_order]
The final step is to add a column that multiplies biomass per m^2 by the total m^2 in the basin.
phragFinal$total_biomass <- phragFinal$avg_biomass * phragFinal$area_in_m2
We now have total biomass estimates per basin. Note that there are too many values for basin 140. A look back reveals it was sampled on two different dates, and I will figure out how to correct the data given that realization.
phragFinal
## basin date avg_biomass area_in_m2 total_biomass
## 1 140 8.13.20 232.8805 2222.9483 517681.4
## 2 140 8.15.19 990.2061 2222.9483 2201177.0
## 3 140 8.21.19 1715.5844 2222.9483 3813655.4
## 4 141 8.13.20 445.4147 3539.1288 1576379.9
## 5 141 8.16.19 780.1900 3539.1288 2761192.9
## 6 153 8.16.19 1213.5758 3441.9475 4177064.2
## 7 158 8.17.20 482.9721 6103.7256 2947929.1
## 8 158 8.23.19 1252.4274 6103.7256 7644473.4
## 9 159 8.21.19 709.8142 2542.1727 1804470.4
## 10 159 8.19.20 1057.9005 2542.1727 2689365.7
## 11 183 8.14.19 797.4651 1350.3603 1076865.2
## 12 183 8.18.20 485.0438 1350.3603 654983.9
## 13 184 8.13.19 1618.6949 6924.1491 11208084.8
## 14 184 8.11.20 369.6386 6924.1491 2559432.6
## 15 188 8.19.20 1334.8108 15959.9251 21303480.4
## 16 188 8.19.19 1674.2044 15959.9251 26720177.6
## 17 236 8.22.19 965.6221 4494.6855 4340167.8
## 18 237 8.17.20 1220.0602 12325.4814 15037829.7
## 19 237 8.22.19 1883.5831 12325.4814 23216068.2
## 20 28 8.15.19 644.4606 2493.4080 1606903.3
## 21 28 8.12.20 618.9667 2493.4080 1543336.5
## 22 32 8.12.20 304.3385 434.2095 132146.7
## 23 32 8.14.19 730.1464 434.2095 317036.5
I selected only non-inflorescence control plots from 2020, as 2019 data combined inflo + non-inflo typha samples. We can tell from looking at data that inflos need to be calculated separately to make estimates accurate, but for now, this 2020 data will suffice until next year’s collection. #### Building subset of typha data
typha20 <- subset(main.df, species == "typha" &
(date == "2020" & treatment == "control"))
summary(typha20)
linear model and QQ plots:
# linear model
LMtypha20 <- lm(DryWeight_g ~ HeightClass, data = typha20)
# testing residuals
qqnorm(LMtypha20$residuals)
qqline(LMtypha20$residuals)
Shapiro-Wilk Test
Ho: Data are drawn from a normal distribution (p-value > 0.001)
Ha: Data are NOT drawn from a normal distribution
shapiro.test(LMtypha20$residuals)
##
## Shapiro-Wilk normality test
##
## data: LMtypha20$residuals
## W = 0.88391, p-value = 0.0003104
The QQ plot is unconvincing and our Shapiro-Wilk test generated a p-value below 0.001. Data are not normally distributed and a transformation is required.
I hid the output to keep the section tidy, but p-values are listed under each transformation test.
LOGtypha20 <- lm(log10(DryWeight_g) ~ HeightClass, data = typha20) # linear model with log10 transformation
SQRTtypha20 <- lm(sqrt(DryWeight_g) ~ HeightClass, data = typha20) # lm with sqrt transformation
X2typha20 <- lm((DryWeight_g)^2 ~ HeightClass, data = typha20) # lm w/ x^2 transformation
shapiro.test(LOGtypha20$residuals)
# p-value is 0.1976, DATA NORMAL
shapiro.test(SQRTtypha20$residuals)
# p-value is 0.02009, DATA NORMAL
shapiro.test(X2typha20$residuals)
# p-value is 0.00000008287, data not normal
Of the three tested data transformations, both the log10 transformation and the square root transformation passed the Shapiro-Wilk Test. I chose the the log10 for further analysis here, but may want to revisit this spot given SQRT transformations in other scenarios tended to be stronger. be consistent with the transformation to phragmites.
Here is the log10 transformation QQ plot:
qqnorm(LOGtypha20$residuals)
qqline(LOGtypha20$residuals)
Now that data meets normality assumptions, we can run an equal variance test of residuals.
plot(LOGtypha20, 1)
For the non-constant variance test, the hypotheses are:
Ho: Data have constant variance (p-value > 0.05)
Ha: Data have non-constant variance
ncvTest(LOGtypha20)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.2141622, Df = 1, p = 0.64352
Here, the p-value is 0.64352. We confirm the null hypothesis; our data have constant variance.
Next, linearity and outlier tests.
# Linearity Test
crPlots(LOGtypha20)
# Outlier Test ==================
plot(LOGtypha20, 5)
Now that we have a model that meets our assumptions, we can build a biomass equation.
summary(LOGtypha20)
##
## Call:
## lm(formula = log10(DryWeight_g) ~ HeightClass, data = typha20)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.38283 -0.16112 0.00088 0.10863 0.43361
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.112070 0.161810 0.693 0.492
## HeightClass 0.023742 0.003504 6.776 2.72e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2013 on 43 degrees of freedom
## Multiple R-squared: 0.5164, Adjusted R-squared: 0.5051
## F-statistic: 45.91 on 1 and 43 DF, p-value: 2.721e-08
The relevant outputs are the y-intercept of 0.112070 and the HeightClass of 0.023742. These numbers give us the y-intercept and slope. The equation is y = 0.023742x + 0.112070 + E, where E is the root mean squared error we will now calculate.
# Root mean squared error
RSS <- c(crossprod(LOGtypha20$residuals))
MSE <- RSS / length(LOGtypha20$residuals)
RMSE <- sqrt(MSE)
RMSE # 68% of predicted values fall within (x +/- RMSE)
## [1] 0.1967814
RMSE * 2 # 95% of predict values fall within (x +/- RMSE*2)
## [1] 0.3935628
Note: the RMSE must also be back-transformed. Both the linear output and RMSE can be back-transformed by 10^x, which returns data from a log10 transformation.
I selected all Typha data for 2019, as all basins were pre-treatment at the time of data collection, but I used only the control plots from 2020. This arrangement provides the greatest number of untreated samples for analysis.
typha <- subset(main.df, species == "typha" &
((date == "2020" & treatment == "control") |
(date == "2019")))
summary(typha)
linear model and QQ plots:
# linear model
LMtypha <- lm(DryWeight_g ~ HeightClass, data = typha)
# testing residuals
qqnorm(LMtypha$residuals)
qqline(LMtypha$residuals)
Shapiro-Wilk Test
Ho: Data are drawn from a normal distribution (p-value > 0.001)
Ha: Data are NOT drawn from a normal distribution
shapiro.test(LMtypha$residuals)
##
## Shapiro-Wilk normality test
##
## data: LMtypha$residuals
## W = 0.88309, p-value = 3.76e-09
The QQ plot is unconvincing and our Shapiro-Wilk test generated a p-value well below 0.001. Data are not normally distributed and a transformation is required.
I hid the output to keep the section tidy, but p-values are listed under each transformation test.
LOGtypha <- lm(log10(DryWeight_g) ~ HeightClass, data = typha) # linear model with log10 transformation
SQRTtypha <- lm(sqrt(DryWeight_g) ~ HeightClass, data = typha) # lm with sqrt transformation
X2typha <- lm((DryWeight_g)^2 ~ HeightClass, data = typha) # lm w/ x^2 transformation
shapiro.test(LOGtypha$residuals)
# p-value is 0.2202, DATA NORMAL
shapiro.test(SQRTtypha$residuals)
# p-value is 0.00002206, DATA NORMAL
shapiro.test(X2typha$residuals)
# p-value is 0.000000000000025, data not normal
Of the three tested data transformations, both the log10 transformation and the square root transformation passed the Shapiro-Wilk Test.
Here are their QQ plots: The log10 transformation has a stronger QQ plot, but other tests may give a better idea of which transformation to choose.
Now that two sets of data meet normality assumptions, we can run equal variance tests of residuals.
plot(LOGtypha, 1, main = "Residuals vs Fitted, log10 transformation")
plot(SQRTtypha, 1, main = "Residuals vs Fitted, square root transformation")
The log10 transformation is more convincing here.
# Linearity Test
crPlots(LOGtypha, main = "log10 transformation")
crPlots(SQRTtypha, main = "square root transformation")
# Outlier Test
plot(LOGtypha, 5, main = "log10 transformation")
plot(SQRTtypha, 5, main = "square root transformation")
For the non-constant variance test, the hypotheses are:
Ho: Data have constant variance (p-value > 0.05)
Ha: Data have non-constant variance
ncvTest(LOGtypha)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 14.5631, Df = 1, p = 0.00013554
ncvTest(SQRTtypha)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 2.302147, Df = 1, p = 0.1292
The log10 transformation does not have constant variance: its p-value is 0.00013554.
The square root transformation, though, is very close. Its p-value is 0.1292. This transformation has constant variance.
summary(SQRTtypha)
##
## Call:
## lm(formula = sqrt(DryWeight_g) ~ HeightClass, data = typha)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6087 -1.0065 -0.1491 0.7086 4.2744
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2064 0.6104 -0.338 0.736
## HeightClass 0.1074 0.0130 8.268 9.83e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.261 on 139 degrees of freedom
## Multiple R-squared: 0.3296, Adjusted R-squared: 0.3248
## F-statistic: 68.35 on 1 and 139 DF, p-value: 9.834e-14
Based on the summary, we can tell the y-intercept is -0.2064 and the slope is 0.1074. Therefore our allometric equation is y = 0.1074x -0.2064.
The following code can determine our root mean squared error:
RSS <- c(crossprod(SQRTtypha$residuals))
MSE <- RSS / length(SQRTtypha$residuals)
RMSE <- sqrt(MSE)
RMSE # 68% of predicted values fall within x +/- RMSE
## [1] 1.252277
RMSE * 2 # 95% of predict values fall within x +/- RMSE
## [1] 2.504555