Exploring data patterns of sugar maple and red maple trees
To start off this project uses a data set that covers a reasearch
study conducted on “Maple Reproduction and Sap Flow at Harvard Forest
since 2011”. This project uses the data sets provides by the study and
can be found here https://doi.org/10.6073/pasta/7c2ddd7b75680980d84478011c5fbba9
or below in the full citation. This project will try and answer the
question “Does the non-masting red maple species exhibit muted dynamics
compared to the masting sugar maple species?”
Due to the incomplete nature of this data set, there is far too
little data to concretely say red maples exhibit muted dynamics when
compared to the sugar maple trees. Therefore, I’m going to be looking to
find the best key indicators so that if the data were more fleshed out
it would be easier to see if the red maple species exhibit muted
dynamics.
Rapp, J., E. Crone, and K. Stinson. 2023. Maple Reproduction and Sap
Flow at Harvard Forest since 2011 ver 6. Environmental Data Initiative.
https://doi.org/10.6073/pasta/7c2ddd7b75680980d84478011c5fbba9
(Accessed 2024-12-11)
This notebook uses tidyverse for graphing and general R commands as
well as for graphing and data visualizations, dplyr for data
manipulation, and modelr for creating statistical models and their
summarys.
library(tidyverse)
library(dplyr)
library(ggplot2)
library(modelr)
Below we read in each data set
library(readr)
hf285_01_maple_tap <- read_csv("C:/Users/Admin/Desktop/R stats homework/Final Project/knb-lter-hfr.285.6/hf285-01-maple-tap.csv")
hf285_02_maple_sap <- read_csv("C:/Users/Admin/Desktop/R stats homework/Final Project/knb-lter-hfr.285.6/hf285-02-maple-sap.csv")
hf285_03_maple_flower_qual <- read_csv("C:/Users/Admin/Desktop/R stats homework/Final Project/knb-lter-hfr.285.6/hf285-03-maple-flower-qual.csv")
hf285_04_maple_flower <- read_csv("C:/Users/Admin/Desktop/R stats homework/Final Project/knb-lter-hfr.285.6/hf285-04-maple-flower.csv")
hf285_05_maple_spring_branch <- read_csv("C:/Users/Admin/Desktop/R stats homework/Final Project/knb-lter-hfr.285.6/hf285-05-maple-spring-branch.csv")
hf285_06_maple_fall_branch <- read_csv("C:/Users/Admin/Desktop/R stats homework/Final Project/knb-lter-hfr.285.6/hf285-06-maple-fall-branch.csv")
hf285_07_maple_seedfilling <- read_csv("C:/Users/Admin/Desktop/R stats homework/Final Project/knb-lter-hfr.285.6/hf285-07-maple-seedfilling.csv")
hf285_08_maple_pollen_excl <- read_csv("C:/Users/Admin/Desktop/R stats homework/Final Project/knb-lter-hfr.285.6/hf285-08-maple-pollen-excl.csv")
hf285_09_maple_seed_count <- read_csv("C:/Users/Admin/Desktop/R stats homework/Final Project/knb-lter-hfr.285.6/hf285-09-maple-seed-count.csv")
hf285_10_leaf_removal <- read_csv("C:/Users/Admin/Desktop/R stats homework/Final Project/knb-lter-hfr.285.6/hf285-10-leaf-removal.csv")
hf285_11_leaf_removal_branches <- read_csv("C:/Users/Admin/Desktop/R stats homework/Final Project/knb-lter-hfr.285.6/hf285-11-leaf-removal-branches.csv")
hf285_12_leaf_seed_removal <- read_csv("C:/Users/Admin/Desktop/R stats homework/Final Project/knb-lter-hfr.285.6/hf285-12-leaf-seed-removal.csv")
hf285_13_leaf_seed_removal_branches <- read_csv("C:/Users/Admin/Desktop/R stats homework/Final Project/knb-lter-hfr.285.6/hf285-13-leaf-seed-removal-branches.csv")
hf285_14_perm_branches <- read_csv("C:/Users/Admin/Desktop/R stats homework/Final Project/knb-lter-hfr.285.6/hf285-14-perm-branches.csv")
Masting and Non-Masting
First we need to better understand our problem and learn a few key
defintions. A masting tree is a tree that produces seeds large quanities
in irregular intervals. Our sugar maple trees are masting trees and we
want to know if the red maples show muted dynamics when compared to the
sugar maples. To better understand our sugar maples we first need to see
what years were masting years for the trees. We do this by examening the
seed collection data average for the sugar maples each year of the
data.
tree_seed_data <- hf285_09_maple_seed_count %>%
filter(!is.na(total.count)) %>%
mutate(year = as.numeric(format(date, "%Y"))) %>%
group_by(tree, year) %>%
summarise(Total_Seeds = sum(total.count, na.rm = TRUE), .groups = "drop")
ggplot(tree_seed_data, aes(x = factor(year), y = tree, fill = Total_Seeds)) +
geom_tile(color = "black", linewidth = 0.5) +
scale_fill_viridis_c() +
labs(
title = "Yearly Seed Count by Tree",
x = "Year",
y = "Tree ID",
fill = "Seed Count"
)

Based on the heat map we can see a few years that stick out 2011,
2013, 2017, and 2019. These are the best indicators that show what years
we can see the masting taking place in our sugar maples.
Next below we add our masting years we found to our sap collection
and sugar content dataset for the approprate trees which we can use for
later to attept to build a model to predict masting seasons.
hf285_02_maple_sap <- hf285_02_maple_sap %>%
mutate(date = as.Date(date))
hf285_02_maple_sap <- hf285_02_maple_sap %>%
mutate(year = year(date))
hf285_02_maple_sap_masting <- hf285_02_maple_sap %>%
mutate(masting = ifelse(year %in% c(2011, 2013, 2017, 2019), "Yes", "No"))
head(hf285_02_maple_sap_masting)
Breaking down key indicators
Here if we want to look for key indicators in our sugar maples that
could predict when a masting might occur. These keye predictors could
help us better compare and undertand if our red maples are exhibiting
signs of muted dynamics. Below we graph both the sugar and red maples
average sap sugar content for years there was data collected. HF stands
for sugar maple and AR for red maple.
grouped_data <- hf285_02_maple_sap %>%
filter(!is.na(sugar)) %>%
mutate(Tree_Group = substr(tree, 1, 2)) %>%
group_by(date, Tree_Group) %>%
summarise(Average_Sugar = mean(sugar, na.rm = TRUE), .groups = "drop")
ggplot(grouped_data, aes(x = date, y = Average_Sugar, color = Tree_Group, group = Tree_Group)) +
geom_line() +
geom_point() +
labs(
title = "Average Sugar Content Over Time",
x = "Date",
y = "Average Sugar Content",
color = "Tree Group"
)

While we can see the average sugar content is much lower for the red
maple that doesn’t answer our question right away especially when the
data collected for red maples is very small with only about 4-5 years of
data collected the data simply isn’t conclusive enough to answer our
question.
Here is where can start to dig in and see if overlaying our graphs
between yealy average sugar content of the sugar maple trees (HF) and
the averge seed count of sugar maples have simlar graphs.
###Average seed count per year compared to average sugar content of
sugar maples
hf285_02_maple_sap_HF <- hf285_02_maple_sap %>%
filter(grepl("^HF", tree), !is.na(sugar))
hf285_09_maple_seed_count_HF <- hf285_09_maple_seed_count %>%
filter(grepl("^HF", tree), !is.na(total.count))
hf285_02_maple_sap_HF <- hf285_02_maple_sap_HF %>%
mutate(year = year(date))
hf285_09_maple_seed_count_HF <- hf285_09_maple_seed_count_HF %>%
mutate(year = year(date))
average_data <- hf285_02_maple_sap_HF %>%
group_by(year) %>%
summarise(average_sugar = mean(sugar, na.rm = TRUE))
average_seeds <- hf285_09_maple_seed_count_HF %>%
group_by(year) %>%
summarise(average_seeds = mean(total.count, na.rm = TRUE))
merged_avg_data <- left_join(average_data, average_seeds, by = "year")
merged_avg_data_clean <- merged_avg_data %>%
filter(!is.na(average_sugar) & !is.na(average_seeds))
sugar_content_HF <- ggplot(merged_avg_data_clean, aes(x = year, y = average_sugar)) +
geom_line(color = "blue", size = 1) +
geom_point(color = "red", size = 3) +
labs(
title = "Average Sugar Content for HF Trees by Year",
x = "Year",
y = "Average Sugar Content"
) +
theme_minimal()
seed_count_HF <- ggplot(merged_avg_data_clean, aes(x = year, y = average_seeds)) +
geom_line(color = "green", size = 1) +
geom_point(color = "orange", size = 3) +
labs(
title = "Average Seed Count for HF Trees by Year",
x = "Year",
y = "Average Seed Count"
) +
theme_minimal()
print(sugar_content_HF)

print(seed_count_HF)

Here we can see that while the graphs do have a similar shape the
overall patterns don’t match exactly so to better try and find our key
predictors we can use our added masting column from earlier to create a
linear model.
Building a model with our data
Below we use our added data from before to attept to predict the
sugar tree mastings based on sugar content of trees sap.
hf285_02_maple_sap_HF <- hf285_02_maple_sap_masting %>%
mutate(masting_binary = ifelse(masting == "Yes", 1, 0))
lm_model <- lm(masting_binary ~ sugar, data = hf285_02_maple_sap_HF)
summary(lm_model)
Call:
lm(formula = masting_binary ~ sugar, data = hf285_02_maple_sap_HF)
Residuals:
Min 1Q Median 3Q Max
-1.1457 -0.3395 -0.3146 0.6480 0.7228
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.231458 0.020481 11.301 < 2e-16 ***
sugar 0.041555 0.008005 5.191 2.14e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.471 on 8011 degrees of freedom
(1009 observations deleted due to missingness)
Multiple R-squared: 0.003352, Adjusted R-squared: 0.003228
F-statistic: 26.95 on 1 and 8011 DF, p-value: 2.143e-07
Here we get to know break down our model and look at its key values.
Right off the bat we can see the sugar coef of .041555 so for every
increase in 1 unit of sugar we see a 4.15% increase in the chance of a
masting year. Next we can see that our RSE is quite high, a lower value
would be better but this does show that some variability remains
unexplained. Our R squared value of just .003352 shows that sugar alone
is not a strong predictor. Lastly we look at our p-value which shows an
incredly small 2.143e-07. So while sugar isn’t a great solo predictor
our new model is statistically significant.
With sugar content helping work towards a stronger model we can look
at other possible predictors we might have over looked. Below I have
taken thehf285_03_maple_flower_qual data set and edited the data to give
values to the flowering.intensity based on the values provided by
Harvads ranges which can be found here https://harvardforest1.fas.harvard.edu/exist/apps/datasets/showData.html?id=hf285.
Which states that data set hf285_03_maple_flower_qual gives these
ranges
“flowering.intensity: qualitative evaluation of whole-tree flowering
low: generally <1,000 flowering buds medium: generally 1,000-10,000
flowering buds high: generally >10,000 flowering buds none: no
flowering buds”
hf285_03_maple_flower_qual_edit <- hf285_03_maple_flower_qual %>%
mutate(flowering_value = case_when(
flowering.intensity == "none" ~ 0,
flowering.intensity == "low" ~ 999,
flowering.intensity == "medium" ~ 5000,
flowering.intensity == "high" ~ 10000,
TRUE ~ NA_real_
))
head(hf285_03_maple_flower_qual_edit)
While creating the numbers for each group is overreaching. The
dataset itself gives us very little to work with so attepting to put
numbers to values that Harvad offered is the best we can do to get a
better idea of our predictors.
Now that we have values for out flower intensity we can put it to a
graph and compare to our graph of mastings
average_flowering_data <- hf285_03_maple_flower_qual_edit %>%
group_by(year) %>%
summarise(average_flowering = mean(flowering_value, na.rm = TRUE))
ggplot(average_flowering_data, aes(x = factor(year), y = average_flowering)) +
geom_point(color = "blue", size = 3) +
labs(
title = "Average Flowering Intensity for the Forest Over Time",
x = "Year",
y = "Average Flowering Intensity"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

Here we can see a strong pattern though we see large spikes on our
masting years with 2011, 2013, 2017, and 2019! This is very big becuase
it matches very close to our heat map of our masting years. But just
looking similar isn’t enough, we can test this by adding on to our
linear model to see if the number of flowers is a good predictor.
Below we look to test if the number of flower is a accurate predictor
on if sugar maple trees will have a masting year. By combining it with
our sugar model from earlier.
average_flowering_data_masting <- hf285_03_maple_flower_qual_edit %>%
mutate(
masting = ifelse(year %in% c(2011, 2013, 2017, 2019), "Yes", "No"),
masting_binary = ifelse(masting == "Yes", 1, 0)
)
lm_model_flowering <- lm(masting_binary ~ flowering_value, data = average_flowering_data_masting)
summary(lm_model_flowering)
Call:
lm(formula = masting_binary ~ flowering_value, data = average_flowering_data_masting)
Residuals:
Min 1Q Median 3Q Max
-0.62634 -0.10057 -0.04222 0.37366 0.95778
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.222e-02 3.625e-02 1.165 0.245
flowering_value 5.841e-05 5.669e-06 10.304 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3843 on 230 degrees of freedom
Multiple R-squared: 0.3158, Adjusted R-squared: 0.3129
F-statistic: 106.2 on 1 and 230 DF, p-value: < 2.2e-16
First off we can see by our average_flowering coef showing .004
average flowering is statistically significant. Second we can see our
intercept shows that for every increase by 1 flower increases the chance
of a masting year by .0001. Which on the surface doesn’t seem like much
when dealing with trees these blossoms can range from a few hundred to
well over ten thousand this can add up. Now when we look at RSE there
are improvements over the sugar model with ours being .334 which shows
us the model is a better fit. Our R squared shows .579 showing us that
about 58% of the variance in the masting binary outcome is explained by
average_flowering. Lastly our p-value shows that over all this model is
statistically significant.
view(hf285_02_maple_sap_masting)
combined_data <- left_join(
average_flowering_data_masting,
hf285_02_maple_sap_masting %>%
group_by(year) %>%
summarise(average_sugar = mean(sugar, na.rm = TRUE)),
by = "year"
)
view(average_flowering_data_masting)
combined_lm <- lm(masting_binary ~ flowering_value + average_sugar, data = combined_data)
summary(combined_lm)
Call:
lm(formula = masting_binary ~ flowering_value + average_sugar,
data = combined_data)
Residuals:
Min 1Q Median 3Q Max
-0.66676 -0.22594 -0.03772 0.34683 0.83780
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.725e-01 3.400e-01 -2.860 0.00465 **
flowering_value 4.897e-05 5.899e-06 8.302 1.09e-14 ***
average_sugar 4.202e-01 1.393e-01 3.017 0.00285 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3765 on 217 degrees of freedom
(12 observations deleted due to missingness)
Multiple R-squared: 0.2951, Adjusted R-squared: 0.2886
F-statistic: 45.42 on 2 and 217 DF, p-value: < 2.2e-16
First off we can see by our average_flowering coef is incredibly
small so average flowering is highly statistically significant. Second
we can see our intercept shows that for every increase by 1 flower
increases the chance of a masting year but only by 4.897e-05. Which on
the surface doesn’t seem like much when dealing with trees these
blossoms can range from a few hundred to well over ten thousand this can
add up. Our average sugar has changed some too we now see the coef at
.002 making it a statistically significant predictor of masting. Now
when we look at RSE there are improvements over the sugar model with
ours being .376 which shows us the model is a better fit then before.
Our R squared shows .295 showing us that about 30% of the variance in
the masting binary outcome is explained by average_flowering and
average_sugar. Lastly our p-value shows that over all this model is
highly statistically significant. While our data shows that this model
is statistically significant our R squared only explains about 30% of
the variance in the model showing the model needs more data to better
predict masting seasons.
In concluion
Overall, while there is not enough data to accurately state whether
red maple species exhibit muted dynamics or not I believe that the two
best key predictor in this limited data set are sap sugar content and
the number of flowers in bloom. If the data were more complete I believe
that these two predictors could be used to better see if the red maple
trees exhibit muted dynamics when compared to the sugar maple trees but
with the current data set limitations the data is inconclusive
