This report seeks to answer the following question:
Does the non-masting red maple species exhibit muted dynamics compared to the masting sugar maple species? (In botany/biology, masting means when trees of the same species periodically produce an overabundance of seeds in a couple of years to ensure their species survive. Mast could be fruits, seeds, and nuts of trees. Dynamics means how alive and healthy the tree is. This could mean how many leaves, flowers, sap, etc. the tree produces, and how big the tree, nuts, branches, etc. are. Muted Dynamics are then the opposite. Where it is how not alive and unhealthy the tree is.)
We will be using a data set called
maple_tap_w_sap_collection_data_full which I created from
the original maple_tapping_data and
sap_collection_data. The two original data sets were
obtained from https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-hfr.285.4.
The maple_tap_w_sap_collection_data_full contains data
about the two different type of trees that were in the Harvard forest
from 2012 to 2018 and the characteristics about the trees. There are 10
total variables in the data set, but the most important ones for this
analysis are; tree species (the type of tree it is; ACSA =
sugar maple and ACRU = red maple), sugar concentration
(sugar concentration measured in Brixx (weight percent) with Misco
digital refractometer of sap collected directly from tap),
sap weight (the weight of the sap when collected),
tree thickness (How thick the tree is (diameter, in
centimeters, at 1.4 m above ground)), and
tree identification number (How the tree is identified;
prefix is either HF (for sugar maples) or HFR (for red maples)). To see
all of the original data sets, go to the link above. That has the ones I
used and did not use. Showing them all would be too much and too large.
The full data set I created can be viewed below:
Throughout, we will need the functionality of the tidyverse package, mainly to create visualizations and transformations. The modelr package is for helping me with the regression modeling. The DT package is to help me with creating nice looking data tables. Finally, the readr package is there to import the two original data set csv files.
library(tidyverse)
library(modelr)
library(DT)
library(readr)
Before I start to try and answer the main question that this whole
report is about. I want to talk a little bit about the cleaning that I
did to get my data set,
maple_tap_w_sap_collection_data_full. The first thing I did
was change any columns that had the wrong data type. Then, I removed any
columns that I felt like were not needed for this analysis. Next, I made
it so that all of the prefix’s in the
tree identification number column were the same two. This
means HF or HFR for each of the two tree species. After
that, I found, what I believe, to be a data entry error in the
sugar concentration column with it having one row that had
a value that was much higher than the rest, so I changed it to what I
believed to be the correct entry. After that I combined the two data
sets into one. Then I finished it off by giving any of the columns I
think needed it better names.
For the visualizations, I thought the best way to tackle the main
problem of this report is to see how, what I think, the main variables
that showcase differences in
tree species/tree identification number look
individually and together.
The question we are investigating deals with the relationship between
the species that a tree is and the percentage of
sugar concentration the trees sap has. We could suggest
that based on the species a tree is can potentially affect how much
sugar concentration its sap has. What we want to do here is
to make a simple visualization that can show if there is in fact any
correlation between these two variables. I hypothesis that this
visualization will be able to show that there is a difference between
these two variables. We can test this with a box plot:
ggplot(maple_tap_w_sap_collection_data_full) +
geom_boxplot(mapping = aes(`tree species`, `sugar concentration`, color = `tree species`)) +
labs(x = "tree species (ACSA = sugar maple and ACRU = red maple)",
y = "concentration of sugar (Brixx)",
color = "tree species",
title = "Tree Species as a Function of Sugar Concentration",
caption = "Data obtained from https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-hfr.285.4")
My hypothesis was correct. By looking at the box plot, we can see
that there is in fact a difference in how much
sugar concentration a tree has based on what type of
species it is. This graph shows us that ACSAs (sugar maples) have higher
concentration of sugar in their sap compared to their ACRUs (red maples)
counterpart. What this means is that ACSAs have about a 2.5 Brixx median
sugar concentration and ACRUs have about a 1.8 Brixx
median.
Next, we are trying to examine if there is another relationship with
the species that a tree is, but this time with the weight of the sap.
Based on the species a tree is, we can propose that, perhaps, this can
impact how much the sap collected weighs. To see if this is true, we can
create a quick visualization that can show a connection between the
species of a tree and its sap weight. I theorize that there
will be a difference between the species the tree is and how much their
sap weighs when looking at the visualization created. To prove my theory
correct, we can create a box plot:
ggplot(maple_tap_w_sap_collection_data_full) +
geom_boxplot(aes(`tree species`, `sap weight`, color = `tree species`)) +
labs(x = "tree species (ACSA = sugar maple and ACRU = red maple)",
y = "weight of sap (kilogram)",
color = "tree species",
title = "Tree Species as a Function of Sap Weight",
caption = "Data obtained from https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-hfr.285.4")
My theory was correct. If we look at the box plot above, we can see
that there is indeed a difference in how heavy a trees sap is based on
what species of tree it is. This graph basically shows us that ACRUs
(red maples) have sap weight that is lighter than that of
the ACSAs (sugar maples). To be more specific, ACRUs have a median
sap weight of about 1.8 kg, while ACSAs median
sap weight is about 3.2 kg.
Finally, the last thing we want to analyze in this section is if there is one last relationship between the species that a tree is and, this time, the thickness of the tree. We also want to answer if these three visualizations I created are effective enough to answer the main question. We could insinuate that based on the species a tree is can perchance influence how thick a tree is. To investigate this problem, we want to create a basic visualization that we can use to show a link between the trees species and how thick the tree is. For this visualization, I speculate that it will display that there is a disparity between the type of species the tree is and how thick it can get. Overall, I do not think that these three visualizations are good enough to answer the main question. The reason why is because this is a very complex problem, and it will take more than just some basic visualizations to answer it. We can make a box plot to see if I am correct:
ggplot(maple_tap_w_sap_collection_data_full) +
geom_boxplot(aes(`tree species`, `tree thickness`, color = `tree species`)) +
labs(x = "tree species (ACSA = sugar maple and ACRU = red maple)",
y = "thickness of tree (centimeters)",
color = "tree species",
title = "Tree Species as a Function of Tree Thickness",
caption = "Data obtained from https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-hfr.285.4")
My first idea about the two tree species having
different tree thicknesses was true. The box plot shows that there is a
huge difference in how thick a tree is based on what
tree species it is. What this shows us is that ACSAs (sugar
maples) have thicker trees than those of the ACRU (red maple) variety.
In general, ACSAs have about a 66 centimeter diameter compared to ACRUs
having about a 43 centimeter diameter. What all of these visualizations
show us is that ACSAs are much healthier than ACRUs.
My second idea, about the three visualizations not being enough, is
also true. The visualizations do not tell me enough information to
adequately answer the overall question that I am trying to get at. The
main reason why is because these box plots are just way too simple of a
tool to effectively answer anything. Yes, they do show that sugar maples
are healthier trees because they have higher medians in all of the three
variables we looked at. But, they do not take into account how all of
these variables affect each other, and how tree species
affects each one of them together. This is where we need to look at how
they all work in tandem. This will help us to truly get to the bottom of
if non-masting red maple trees really do exhibit muted dynamics, or not,
compared to masting sugar maple trees.
The problem that we are trying to solve here is if the tree, and its
identification, were the ones from the sugar maple group, and if that
made them have better averages for the three main variables we talked
about earlier than the trees from the red maple group. Those variables
were sugar concentration, sap weight, and
tree thickness. We can hypothesis that the averages will be
greater for the HF (sugar maples) than the HFR (red maples), and I
believe that they will be pretty big. The reason why I believe this is
because, in this data set, there are far more data points for HF than
there are for HFR. This will allow HF to not be affected by small
outliers as much as HFR does because it can be safeguarded by the sheer
amount of data points. We can test this hypothesis by creating a new
data set from maple_tap_w_sap_collection_data_full:
average_maple_tap_w_sap_collection_data_full <- maple_tap_w_sap_collection_data_full %>%
group_by(`tree identification number`) %>%
summarize(`average sugar concentration` = mean(`sugar concentration`, na.rm = TRUE), `average sap weight` = mean(`sap weight`, na.rm = TRUE), `average tree thickness` = mean(`tree thickness`, na.rm = TRUE)) %>%
separate(`tree identification number`, into = c("tree identification letter", "tree identification number"), sep = "(?<=[A-Za-z])(?=[0-9])") %>%
arrange(`tree identification number`)
datatable(`average_maple_tap_w_sap_collection_data_full`, options = list(scrollX = TRUE))
I was correct with my hypothesis. Almost every HF (sugar maple) has a
bigger average sugar concentration,
sap weight, and tree thickness then any HFR
(red maple). There were some HFRs that had bigger averages than a couple
HFs, but HFs had the most highest averages out of the both of them. For
example, all of the comparable trees, ones that had the same
tree identification number, had the HF have higher
averages. There was one exception to this though, and that was HF and
HFR5. HFR5 had a higher tree thickness than HF5, while the
other two variables were higher with HF5. This shows that HFs are, on
average, more healthier trees than HFRs are. Another thing is that Anna
Hess and Mark Hamilton talk about a New England study they found that
said the range of sap sugar content in maple trees is 1.8% to 8.4%. This
is for a single tree. Another website I found from Lynda Lancaster from
the National Park Service of Indiana said that 2-3% is the sugar content
of an average maple tree. From my table we can see that all, except 1
out of the 20, sugar maple trees are in that range. While red maple
trees only have 2 out of 10 of their trees averages in that range. This
shows that sugar maples trees have better
sugar concentration than their red maple tree
counterparts.
For the regression modeling, I thought it would be most beneficial,
and appropriate, to do three individual simple regression models about
the three main variables that I have been using so far. Those being
sugar concentration, sap weight, and
tree thickness. to see how tree species
affects them all individually. Then, after the simple regression models,
to make a multiple regression model using all three of the variables and
see how tree species affects then as well.
maple_tap_and_sap_simple_sugar_model <- lm(`sugar concentration` ~ `tree species`, data = maple_tap_w_sap_collection_data_full)
summary(maple_tap_and_sap_simple_sugar_model)
##
## Call:
## lm(formula = `sugar concentration` ~ `tree species`, data = maple_tap_w_sap_collection_data_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7656 -0.3656 -0.0656 0.3344 4.7344
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.83537 0.02102 87.31 <2e-16 ***
## `tree species`ACSA 0.73019 0.02258 32.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5893 on 5890 degrees of freedom
## (953 observations deleted due to missingness)
## Multiple R-squared: 0.1508, Adjusted R-squared: 0.1506
## F-statistic: 1046 on 1 and 5890 DF, p-value: < 2.2e-16
To start things off, we want to see what does the regression
coefficient and the p-value of tree species tell me about
the amount of sugar concentration each of the two species
of trees get? We can see what these are by looking at the summary
statistics for the simple sugar concentration regression
model.
The regression coefficient of tree species tells me that
when the predictor variable changes from ACRU (red maple) to ACSA (sugar
maple), the sugar concentration, the response variable, is
more than red maple by about 0.73019 for sugar maple. The null
hypothesis is that tree species has no effect on
sugar concentration. The p-value for
tree species is 0.0000000000000002, which is less than the
0.05 threshold. This means that we can reject the null hypothesis.
Additionally, this states that tree species does have a
significant impact on sugar concentration and that
difference seems to be 0.73019 Brixx.
The question we want to answer here is if the conclusion from the
previous question is sufficient based on the goodness-of-fit statistics.
We can test this by seeing what values the RSE and the R^2 are in the
summary statistics of the simple sugar concentration
regression model.
No, I should not believe the conclusion from the previous part. The main reason why is because of the two R^2 values. The multiple R^2 value is 0.1508 and the adjusted R^2 value is 0.1506. These values are way too far from 1, which means that this regression model is not very good to use for any kind of statistical analysis. Although, the RSE value is on the low side being only 0.5893. Because the R^2 values are so low they have a bigger impact in telling if the model is good or not.
The problem we want to address here is to see how the residuals of the model are distributed and what this means for the model. In order to do this we need to create a visualization that can show the spread of the residuals. We can solve this problem by creating a histogram of the model:
maple_tap_w_sap_collection_data_full_w_sugar_resid <- maple_tap_w_sap_collection_data_full %>%
add_residuals(maple_tap_and_sap_simple_sugar_model)
ggplot(maple_tap_w_sap_collection_data_full_w_sugar_resid) +
geom_histogram(aes(resid)) +
labs(x = "residual",
y = "total amount of data points",
title = "Residual Sugar Concentration Regression Model",
caption = "Data obtained from https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-hfr.285.4")
The residual distribution of this model shows that it is decently
normally distributed because it does have a very close peak at 0, but it
is a little bit right skewed because of how much it is trailing to the
right. What this distribution means is that the model is under
predicting the sugar concentration values compared to the
actual values in the data set. This means that there is somewhat of a
bias and that there could be a pattern in the data set that is not being
detected.
The final question we want to answer, in this section, is what can we
conclude from the simple sugar concentration model? In
order to answer this we need to look at what the model says about the
non-masting red maple muted dynamics exhibited compared to masting sugar
maple question at face-value. To do this we can look at the regression
coefficient and p-value of the model.
What I can conclude from the simple sugar concentration
regression model is that there is in fact a correlation between how much
sugar concentration there is in a trees sap and the species
of that tree. The simple regression model tells me that the
tree species variable does have a significant impact on the
sugar concentration variable. Which means that sugar maples
(ACSA) have higher sugar concentrations than red maples (ACRU) by about
0.73019 Brixx.
maple_tap_and_sap_simple_weight_model <- lm(`sap weight` ~ `tree species`, data = maple_tap_w_sap_collection_data_full)
summary(maple_tap_and_sap_simple_weight_model)
##
## Call:
## lm(formula = `sap weight` ~ `tree species`, data = maple_tap_w_sap_collection_data_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0855 -2.1355 -0.5755 1.5445 14.6445
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.89603 0.09344 20.29 <2e-16 ***
## `tree species`ACSA 2.19949 0.10187 21.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.925 on 6169 degrees of freedom
## (674 observations deleted due to missingness)
## Multiple R-squared: 0.07025, Adjusted R-squared: 0.0701
## F-statistic: 466.1 on 1 and 6169 DF, p-value: < 2.2e-16
For this section, what does the p-value and the regression
coefficient of tree species say about how heavy the sap
weighs based on the two species of trees we are comparing? To test this
out, we can look at the summary statistics of the simple
sap weight regression model.
The tree species regression coefficient is 2.19949. What
this means is that when the predictor variable changes from ACRU (red
maple) to ACSA (sugar maple), the sap weight, the response
variable, for sugar maple is more than red maple by about 2.19949. The
null hypothesis is that tree species has no effect on
sap weight. The p-value for this simple model is
0.0000000000000002. With the p-value being less than the 0.05 threshold,
this means that tree species does have a substantial impact
on sap weight. With that difference looking to be 2.19949
kilograms.
Next, we want to see if the conclusion from the previous problem is
good according to the goodness-of-fit statistics. The best way to test
this, to see if it is true, is by looking at the summary statistics in
the simple sap weight regression model. More specifically,
the R^2 and RSE values.
The previous parts conclusion should not be believed. The first reason why is because this simple model has a pretty high RSE value of 2.925. This is something that we do not want. What we want instead is a pretty low RSE. The main reason why we should not trust this models conclusion is because it has some really low R^2 values. The multiple R^2 is 0.07025 and the adjusted R^2 is 0.0701. With them being so far from 1, and the RSE being on the high side, this model is not great to use for statistical analysis of any kind.
The second to last thing we need to do is to look at the models residuals. What we want to do specifically is to see how they are distributed and what that means for the model. We can do this by making a visualization that display the dispersion of the residuals. A histogram would be the best tool for this kind of visualization:
maple_tap_w_sap_collection_data_full_w_weight_resid <- maple_tap_w_sap_collection_data_full %>%
add_residuals(maple_tap_and_sap_simple_weight_model)
ggplot(maple_tap_w_sap_collection_data_full_w_weight_resid) +
geom_histogram(aes(resid)) +
labs(x = "residual",
y = "total amount of data points",
title = "Residual Sap Weight Regression Model",
caption = "Data obtained from https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-hfr.285.4")
This models residual distribution is somewhat normally distributed.
The only bad thing is that the left side of the peak stops abruptly just
above the 400 mark before ever getting to the bottom, and the right side
goes all the way down to the bottom. The peak is somewhat close to 0,
but is very much right skewed because it trails off way to much to the
right. This residual distribution shows us that this model is under
predicting the sap weight values compared to the actual
values in the data set. This shows that there is very much a bias in the
data set and that there could very well be a pattern that is not being
detected as well.
Lastly, we need to see what we can infer from the simple
sap weight model? We want to specifically see what the
model says at face-value about the main question we are trying to
answer. The regression coefficient and the p-value of the simple model
will help us in answering this.
The simple sap weight regression model lets me conclude
that there is in face a relationship between the species a tree is and
how heavy the trees sap is. It shows me that the
tree species variable does have a important impact on the
sap weight variable. What this all means is that sugar
maples (ACSA) have heavier sap weights than red maples (ACRU) by about
2.19949 kilograms.
maple_tap_and_sap_simple_thickness_model <- lm(`tree thickness` ~ `tree species`, data = maple_tap_w_sap_collection_data_full)
summary(maple_tap_and_sap_simple_thickness_model)
##
## Call:
## lm(formula = `tree thickness` ~ `tree species`, data = maple_tap_w_sap_collection_data_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.403 -8.648 -2.103 10.697 21.017
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.583 2.405 19.373 < 2e-16 ***
## `tree species`ACSA 20.120 2.673 7.528 9.19e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.78 on 124 degrees of freedom
## (6719 observations deleted due to missingness)
## Multiple R-squared: 0.3137, Adjusted R-squared: 0.3081
## F-statistic: 56.67 on 1 and 124 DF, p-value: 9.191e-12
We want to look at the regression coefficient and the p-value of
tree species to see what it shows about how thick a tree is
according to the tree species it is. The summary statistics
in the simple tree thickness regression model can show us
the relationship between the two.
The regression coefficient shows how when tree species,
the predictor variable, goes from ACRU (red maple) to ACSA (sugar
maple), the tree thickness, the response variable, for
sugar maple is more than red maple by about 20.120. The null hypothesis
is that tree species has no effect on
tree thickness. With the p-value being less than the 0.05
threshold-0.00000000000919. This means that species the tree is does
have a major impact on tree thickness. The difference
between the two species being 20.120 centimeter.
In order to move on, we need to see if the conclusion from the
previous question is satisfactory derived from the goodness-of-fit
statistics. In order to see if this is true, we can look at the simple
tree thickness regression models summary statistics to see
what the R^2 and RSE values are.
For the conclusion in the previous section, I think that we should probably not believe it. The first reason why is because the simple model has such a considerably high RSE value, with it being 11.78. With RSE values, you want to make sure that it is as low as possible, so in the decimals is best. The main reason is because compared to the other R^2 values, this simple model does have better ones. With them being 0.3137 for the multiple R^2 and 0.3081 for the adjusted R^2. But, these are still somewhat too far away from 1 to be as good as they can be. I still think that this simple model, based on its goodness-of-fit statistics, should probably not be used for any statistical analysis.
We want to next create a visualization of the model. We want to do this so we can look at how the models residuals are distributed, and what that can mean for the model as well. In order to do this we can create a histogram of the model:
maple_tap_w_sap_collection_data_full_w_thickness_resid <- maple_tap_w_sap_collection_data_full %>%
add_residuals(maple_tap_and_sap_simple_thickness_model)
ggplot(maple_tap_w_sap_collection_data_full_w_thickness_resid) +
geom_histogram(aes(resid)) +
labs(x = "residual",
y = "total amount of data points",
title = "Residual Tree Thickness Regression Model",
caption = "Data obtained from https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-hfr.285.4")
This residual distribution kind of looks like it is normally
distributed. But, because it does not have that many data points it does
not look that great. The peak of the distribution is somewhat close to
0, with it only being a just a few over to the left. The graph at least
does not look like it is skewed to any side. This distribution is
telling me that it is able to somewhat predict the
tree thickness values compared to the actual values in the
data set.
The last problem we need to do before going onto the next section, is
to look at the simple tree thickness model, and detect what
we can deduce from it. The model will help us to see a simple answer to
the main problem that we are trying to answer. To answer the problem, we
need to look at the p-value and the regression coefficient of the
model.
After looking at the simple tree thickness regression
model, I can conclude that there is a connection between how thick a
tree is and the species that tree is from. The model also demonstrates
that the tree species variable does have a substantial
impact on the tree thickness variable. Overall, this shows
that sugar maple trees (ACSA) are thicker than red maple trees (ACRU) by
about 20.120 centimeters.
There are three questions we need to answer only in this section.
Those questions are, from the three variables I have been using what is
the most likely one to be the response variable, what two are most
likely to be the confounding variables, and answer why for all of them?
The three main variables I have been using are
sugar concentration, sap weight, and
tree thickness.
I think that sugar concentration is the response
variable for the multiple regression model. The first reason why I
believe this is because based on the simple regression models I made
earlier it had the best one. The sugar concentration simple
model had the better RSE value and the better residual distribution. It
did not have the best R^2 values, but taking everything into
consideration it is the most well rounded and best simple model out of
the three. The second reason why I think it should be the response
variable is because I believe that it is the best indication of how
healthy a maple tree is. If a maple tree produces more sap, then it will
not have muted dynamics and everything else about the tree will be
healthier. The first variable that I believe to be a confounding
variable is sap weight. The reason why is because you would
assume that if a sap has a higher weight, then it would also have a
higher sugar concentration. Also the healthier the tree is,
the more it would be able to produce sap resulting in having a higher
weight. The second variable I believe is tree thickness.
The first reason why is because the healthier the tree is the bigger it
would be. This would also allow the tree to produce more sap, which
would means it can most likely have a higher
sugar concentration.
maple_tap_w_sap_collection_data_full2 <- maple_tap_w_sap_collection_data_full %>%
select(-date, -`tree identification number`, -`which tap`, -time, -`tap cardinal direction`, -`tap height`)
entryNA <- which(is.na(maple_tap_w_sap_collection_data_full2$`sap weight`))
maple_tap_w_sap_collection_data_full2$`sap weight`[entryNA] <- mean(maple_tap_w_sap_collection_data_full2$`sap weight`, na.rm = TRUE)
entryNA2 <- which(is.na(maple_tap_w_sap_collection_data_full2$`tree thickness`))
maple_tap_w_sap_collection_data_full2$`tree thickness`[entryNA2] <- mean(maple_tap_w_sap_collection_data_full2$`tree thickness`, na.rm = TRUE)
maple_tap_and_sap_multiple_model <- lm(`sugar concentration` ~ `tree species` + `tree thickness` + `sap weight`, data = maple_tap_w_sap_collection_data_full2)
summary(maple_tap_and_sap_multiple_model)
##
## Call:
## lm(formula = `sugar concentration` ~ `tree species` + `tree thickness` +
## `sap weight`, data = maple_tap_w_sap_collection_data_full2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8275 -0.3973 -0.0674 0.3418 4.6705
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.514103 0.662534 2.285 0.0223 *
## `tree species`ACSA 0.762622 0.023103 33.010 < 2e-16 ***
## `tree thickness` 0.005750 0.010532 0.546 0.5851
## `sap weight` -0.016863 0.002702 -6.241 4.66e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5875 on 5888 degrees of freedom
## (953 observations deleted due to missingness)
## Multiple R-squared: 0.1564, Adjusted R-squared: 0.156
## F-statistic: 363.8 on 3 and 5888 DF, p-value: < 2.2e-16
What we first want to accomplish, in this section, is to see what is
the regression coefficient and the p-value of tree species
is in the multiple regression model? The next thing we want to look at
is how does it compare to the simple sugar concentration
regression model? In order to see and compare the regression model and
the p-value, we need to look at the summary statistic of the multiple
regression model.
The regression coefficient of tree species is 0.762622
in the multiple regression model. What this means is that when the
predictor variable changes from ACRU (red maple) to ACSA (sugar maple),
the sugar concentration, the response variable, is more
than red maple by about 0.762622 for sugar maple. The way that the
multiple regression models regression coefficient compares to the simple
sugar concentrations model, is that it is slightly higher. The multiple
one is bigger by about 0.032432 (0.762622 - 0.73019). The null
hypothesis is that tree species has no effect on
sugar concentration. We can reject the null hypothesis,
because the p-value for tree species is 0.0000000000000002,
which is less than the 0.05 threshold. This means that
tree species does have a significant impact on
sugar concentration and that difference seems to be
0.762622 Brixx.
The next thing is to look at the goodness-of-fit statistics for the
multiple regression model to see what it means, and how it compares to
the simple sugar concentration regression model. To see the
statistics and compare them, we need to look at the multiple models
summary statistics.
This multiple regression model has a pretty bad goodness-of-fit
statistics compared to the sugar concentration simple
model. The first thing is the RSE is 0.5875, which is low. What this
tell me is that the 70% confidence interval of the residuals is between
the intervals of -0.5875 and 0.5875. The main thing that ruins this
models credibility to be used for any statistical modeling is the
multiple R^2 being 0.1564 and the adjusted R^2 being 0.156. These are
not anywhere near close enough to 1. What these two R^2s tell me is that
the model can confidently only explain about 15%, respectively, of the
variation in sap concentration and the rest is from random noise.
Overall, this models goodness-of-fit statistics are actually about the
same as the simple regression models one. The multiple regression model
only has a slightly lower RSE and negligibly higher multiple and
adjusted R^2.
Up next, we want to see how the multiple regression models residual
distribution looks like. We want to also see how it compares to the
simple sugar concentration models as well. What we want to
see, when we are comparing them, is if there were any improvements at
all when we moved from the simple to the multiple model. To test this
out we need to create a histogram of the multiple model and compare it
to the simple one:
maple_tap_w_sap_collection_data_full_w_multiple_resid <- maple_tap_w_sap_collection_data_full2 %>%
add_residuals(maple_tap_and_sap_multiple_model)
ggplot(maple_tap_w_sap_collection_data_full_w_multiple_resid) +
geom_histogram(aes(resid)) +
labs(x = "residual",
y = "total amount of data points",
title = "Multiple Regression Models Residual Distribution",
caption = "Data obtained from https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-hfr.285.4")
I only see tiny improvements between the multiple regression model
residual distribution and the simple sugar concentration
regression model one. The first improvement I see is that the peak looks
a bit more like it is at 0, which is good. Also this distribution looks
a tiny bit more normally distributed around 0, but is still skewed the
same amount to the right as the simple model one. This distribution also
stops at the same x-value mark at the graph on the right as the simple
model does.
The second to last question I need to answer here is how does the
multiple regression model I created compare to the simple
sugar concentration regression model and the two other
simple models? More specifically how does my multiple model match
against the simple ones and is it really better? I also need to address
how I created the multiple model in the first place. In order to fully
compare all of the models, I need to look at everything that I looked at
previously in this section, and the simple model sections. This will
help me to see how they all come together.
Before I talk about how the multiple regression model compares to the
simple sugar concentration model one, I need to preface one
thing. In order for me to have made the multiple regression model I had
to find the mean of all the non-NA values in the sap weight
and tree thickness columns, respectively, and then I had to
replace all of the NAs in each of those two columns with the average I
found. This is the only way that I could have done the multiple
regression model because R would not let me compute the multiple
regression model without it. Now to how these two models compare.
Overall, they are basically the same. There is not much separating them.
They have very similar RSE’s, R^2’s, and the regression coefficients are
all just decimals away from each other. Even the p-values for the
regression coefficient of tree species are exactly the
same. Finally, the residual distribution histograms I made for the both
of them are also very similar to each other. This means that the
confounding variables I found do not make that much of an impact on how
tree species affects sap concentration. We can very much
just use the simple regression model and be able to conclude the same
thing as the multiple one. The reason why I chose to only use
sap weight and tree thickness is because I
believe that these two are the only logical confounding variables, in my
opinion, that can affect sugar concentration based on
tree species. Even though I know that my R^2 values are
low, the good thing is that I was not trying every possible combination
of confounding variables to over-fit my model to this particular data
set. Instead of low R^2 values that show my model is not good, I would
of had high artificial R^2 values that are just as bad, maybe worse.
The final thing we want to do is to look back at the main question
that we have been trying to answer from the start. Does the non-masting
red maple species exhibit muted dynamics compared to the masting sugar
maple species? We can answer this by looking at the multiple and the
three simple regression models I have made. The reason why we can look
at all of it is because the multiple model is not that different to the
simple sugar concentration model, so looking at all of the
models would be helpful.
What I can conclude from the single multiple and three simple
regression models, also comparing the multiple and simple
sugar concentration models, is that yes, the non-masting
red maple species does in fact exhibit muted dynamics compared to the
masting sugar maple species. All of the simple models shows that the
ACSA (sugar maple) has a higher sugar concentration,
sap weight, and tree thickness than the ACRU
(red maple). Also the multiple regression model shows that sugar maple
trees have higher sugar concentration than red maple trees.
This shows that ACRU trees do exhibit muted dynamics than the ACSA
trees. The only caveat is that all of these models are not that reliable
to use. They all either have really low R^2 values which are dead
give-a-ways that these models are not usable for any kind of statistical
analysis, bad RSE values, not so good residual distribution histograms,
or a mixture of all of them. Overall, I think that these models, and
their conclusions, are not totally appropriate to claim anything about
if non-masting red maple species exhibit muted dynamics compared to
masting sugar maple species. At face value they do show that the main
question we are trying to answer is true. There could be a different
statistical model out there that could be way better to use for this
question than regression models.
In conclusion, we can conclude that based on everything that I have
looked at, non-masting red maple species do exhibit muted
dynamics compared to the masting sugar maple species. What I
have found and think are the most likely variables that
demonstrate if a certain tree species is healthy
or not are sugar concentration, sap weight,
and tree thickness. All of this means that certain tree
characteristic variables do have an affect on if a tree has
muted dynamics or not. The visualizations, data set, and all of the
simple and multiple regression models I have made help in proving my
answer to the main question of this report.
Hess, Anna, and Mark Hamilton. “Sugar Content of Maple Sap.” The Walden Effect, Branchable Wiki Hosting, 23 Feb. 2014, www.waldeneffect.org/blog/Sugar_content_of_maple_sap/#google_vignette (Accessed 2023-12-11).
Lancaster, Lynda. “Sweet Signs of Spring.” National Parks Service, U.S. Department of the Interior, 9 Apr. 2020, www.nps.gov/indu/learn/education/sweet-signs-of-spring.htm#:~:text=Maple%20sap%20%2D%20maple%20sap%20looks,sap%20weighs%208.34%20pounds%20per (Accessed 2023-12-11).
McLane, Eben. “Masting: Survival of the Seediest.” Finger Lakes Land Trust, fllt.org, 10 July 2011, www.fllt.org/cl-masting-survival-of-the-seediest/ (Accessed 2023-12-11).
Rapp, J., E. Crone, and K. Stinson. 2021. Maple Reproduction and Sap Flow at Harvard Forest since 2011 ver 4. Environmental Data Initiative. https://doi.org/10.6073/pasta/c74eba9dc8ddc41c19dc85e002a3f046 (Accessed 2023-12-11).