An Analysis of Three Tree Variables affecting Tree Species

Introduction

This report seeks to answer the following question:

Does the non-masting red maple species exhibit muted dynamics compared to the masting sugar maple species? (In botany/biology, masting means when trees of the same species periodically produce an overabundance of seeds in a couple of years to ensure their species survive. Mast could be fruits, seeds, and nuts of trees. Dynamics means how alive and healthy the tree is. This could mean how many leaves, flowers, sap, etc. the tree produces, and how big the tree, nuts, branches, etc. are. Muted Dynamics are then the opposite. Where it is how not alive and unhealthy the tree is.)

We will be using a data set called maple_tap_w_sap_collection_data_full which I created from the original maple_tapping_data and sap_collection_data. The two original data sets were obtained from https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-hfr.285.4. The maple_tap_w_sap_collection_data_full contains data about the two different type of trees that were in the Harvard forest from 2012 to 2018 and the characteristics about the trees. There are 10 total variables in the data set, but the most important ones for this analysis are; tree species (the type of tree it is; ACSA = sugar maple and ACRU = red maple), sugar concentration (sugar concentration measured in Brixx (weight percent) with Misco digital refractometer of sap collected directly from tap), sap weight (the weight of the sap when collected), tree thickness (How thick the tree is (diameter, in centimeters, at 1.4 m above ground)), and tree identification number (How the tree is identified; prefix is either HF (for sugar maples) or HFR (for red maples)). To see all of the original data sets, go to the link above. That has the ones I used and did not use. Showing them all would be too much and too large. The full data set I created can be viewed below:

Throughout, we will need the functionality of the tidyverse package, mainly to create visualizations and transformations. The modelr package is for helping me with the regression modeling. The DT package is to help me with creating nice looking data tables. Finally, the readr package is there to import the two original data set csv files.

library(tidyverse)
library(modelr)
library(DT)
library(readr)

Cleaning Process:

Before I start to try and answer the main question that this whole report is about. I want to talk a little bit about the cleaning that I did to get my data set, maple_tap_w_sap_collection_data_full. The first thing I did was change any columns that had the wrong data type. Then, I removed any columns that I felt like were not needed for this analysis. Next, I made it so that all of the prefix’s in the tree identification number column were the same two. This means HF or HFR for each of the two tree species. After that, I found, what I believe, to be a data entry error in the sugar concentration column with it having one row that had a value that was much higher than the rest, so I changed it to what I believed to be the correct entry. After that I combined the two data sets into one. Then I finished it off by giving any of the columns I think needed it better names.

Visualizations:

For the visualizations, I thought the best way to tackle the main problem of this report is to see how, what I think, the main variables that showcase differences in tree species/tree identification number look individually and together.

Visualization of Tree Species Based on Sugar Concentration

The question we are investigating deals with the relationship between the species that a tree is and the percentage of sugar concentration the trees sap has. We could suggest that based on the species a tree is can potentially affect how much sugar concentration its sap has. What we want to do here is to make a simple visualization that can show if there is in fact any correlation between these two variables. I hypothesis that this visualization will be able to show that there is a difference between these two variables. We can test this with a box plot:

ggplot(maple_tap_w_sap_collection_data_full) +
  geom_boxplot(mapping = aes(`tree species`, `sugar concentration`, color = `tree species`)) +
  labs(x = "tree species (ACSA = sugar maple and ACRU = red maple)",
       y = "concentration of sugar (Brixx)",
       color = "tree species",
       title = "Tree Species as a Function of Sugar Concentration",
       caption = "Data obtained from https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-hfr.285.4")

My hypothesis was correct. By looking at the box plot, we can see that there is in fact a difference in how much sugar concentration a tree has based on what type of species it is. This graph shows us that ACSAs (sugar maples) have higher concentration of sugar in their sap compared to their ACRUs (red maples) counterpart. What this means is that ACSAs have about a 2.5 Brixx median sugar concentration and ACRUs have about a 1.8 Brixx median.

Visualization of Tree Species Based on Sap Weight

Next, we are trying to examine if there is another relationship with the species that a tree is, but this time with the weight of the sap. Based on the species a tree is, we can propose that, perhaps, this can impact how much the sap collected weighs. To see if this is true, we can create a quick visualization that can show a connection between the species of a tree and its sap weight. I theorize that there will be a difference between the species the tree is and how much their sap weighs when looking at the visualization created. To prove my theory correct, we can create a box plot:

ggplot(maple_tap_w_sap_collection_data_full) +
  geom_boxplot(aes(`tree species`, `sap weight`, color = `tree species`)) +
  labs(x = "tree species (ACSA = sugar maple and ACRU = red maple)",
       y = "weight of sap (kilogram)",
       color = "tree species",
       title = "Tree Species as a Function of Sap Weight",
       caption = "Data obtained from https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-hfr.285.4")

My theory was correct. If we look at the box plot above, we can see that there is indeed a difference in how heavy a trees sap is based on what species of tree it is. This graph basically shows us that ACRUs (red maples) have sap weight that is lighter than that of the ACSAs (sugar maples). To be more specific, ACRUs have a median sap weight of about 1.8 kg, while ACSAs median sap weight is about 3.2 kg.

Visualization of Tree Species Based on Tree Thickness

Finally, the last thing we want to analyze in this section is if there is one last relationship between the species that a tree is and, this time, the thickness of the tree. We also want to answer if these three visualizations I created are effective enough to answer the main question. We could insinuate that based on the species a tree is can perchance influence how thick a tree is. To investigate this problem, we want to create a basic visualization that we can use to show a link between the trees species and how thick the tree is. For this visualization, I speculate that it will display that there is a disparity between the type of species the tree is and how thick it can get. Overall, I do not think that these three visualizations are good enough to answer the main question. The reason why is because this is a very complex problem, and it will take more than just some basic visualizations to answer it. We can make a box plot to see if I am correct:

ggplot(maple_tap_w_sap_collection_data_full) +
  geom_boxplot(aes(`tree species`, `tree thickness`, color = `tree species`)) +
  labs(x = "tree species (ACSA = sugar maple and ACRU = red maple)",
       y = "thickness of tree (centimeters)",
       color = "tree species",
       title = "Tree Species as a Function of Tree Thickness",
       caption = "Data obtained from https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-hfr.285.4")

My first idea about the two tree species having different tree thicknesses was true. The box plot shows that there is a huge difference in how thick a tree is based on what tree species it is. What this shows us is that ACSAs (sugar maples) have thicker trees than those of the ACRU (red maple) variety. In general, ACSAs have about a 66 centimeter diameter compared to ACRUs having about a 43 centimeter diameter. What all of these visualizations show us is that ACSAs are much healthier than ACRUs.

My second idea, about the three visualizations not being enough, is also true. The visualizations do not tell me enough information to adequately answer the overall question that I am trying to get at. The main reason why is because these box plots are just way too simple of a tool to effectively answer anything. Yes, they do show that sugar maples are healthier trees because they have higher medians in all of the three variables we looked at. But, they do not take into account how all of these variables affect each other, and how tree species affects each one of them together. This is where we need to look at how they all work in tandem. This will help us to truly get to the bottom of if non-masting red maple trees really do exhibit muted dynamics, or not, compared to masting sugar maple trees.

Tree Identification vs Three Important Variables

The problem that we are trying to solve here is if the tree, and its identification, were the ones from the sugar maple group, and if that made them have better averages for the three main variables we talked about earlier than the trees from the red maple group. Those variables were sugar concentration, sap weight, and tree thickness. We can hypothesis that the averages will be greater for the HF (sugar maples) than the HFR (red maples), and I believe that they will be pretty big. The reason why I believe this is because, in this data set, there are far more data points for HF than there are for HFR. This will allow HF to not be affected by small outliers as much as HFR does because it can be safeguarded by the sheer amount of data points. We can test this hypothesis by creating a new data set from maple_tap_w_sap_collection_data_full:

average_maple_tap_w_sap_collection_data_full <- maple_tap_w_sap_collection_data_full %>%
  group_by(`tree identification number`) %>%
  summarize(`average sugar concentration` = mean(`sugar concentration`, na.rm = TRUE), `average sap weight` = mean(`sap weight`, na.rm = TRUE), `average tree thickness` = mean(`tree thickness`, na.rm = TRUE)) %>%
  separate(`tree identification number`, into = c("tree identification letter", "tree identification number"), sep = "(?<=[A-Za-z])(?=[0-9])") %>%
  arrange(`tree identification number`)

datatable(`average_maple_tap_w_sap_collection_data_full`, options = list(scrollX = TRUE))

I was correct with my hypothesis. Almost every HF (sugar maple) has a bigger average sugar concentration, sap weight, and tree thickness then any HFR (red maple). There were some HFRs that had bigger averages than a couple HFs, but HFs had the most highest averages out of the both of them. For example, all of the comparable trees, ones that had the same tree identification number, had the HF have higher averages. There was one exception to this though, and that was HF and HFR5. HFR5 had a higher tree thickness than HF5, while the other two variables were higher with HF5. This shows that HFs are, on average, more healthier trees than HFRs are. Another thing is that Anna Hess and Mark Hamilton talk about a New England study they found that said the range of sap sugar content in maple trees is 1.8% to 8.4%. This is for a single tree. Another website I found from Lynda Lancaster from the National Park Service of Indiana said that 2-3% is the sugar content of an average maple tree. From my table we can see that all, except 1 out of the 20, sugar maple trees are in that range. While red maple trees only have 2 out of 10 of their trees averages in that range. This shows that sugar maples trees have better sugar concentration than their red maple tree counterparts.

Regression Modeling:

For the regression modeling, I thought it would be most beneficial, and appropriate, to do three individual simple regression models about the three main variables that I have been using so far. Those being sugar concentration, sap weight, and tree thickness. to see how tree species affects them all individually. Then, after the simple regression models, to make a multiple regression model using all three of the variables and see how tree species affects then as well.

Simple Sugar Concentration Regression Model

maple_tap_and_sap_simple_sugar_model <- lm(`sugar concentration` ~ `tree species`, data = maple_tap_w_sap_collection_data_full)

summary(maple_tap_and_sap_simple_sugar_model)

## 
## Call:
## lm(formula = `sugar concentration` ~ `tree species`, data = maple_tap_w_sap_collection_data_full)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7656 -0.3656 -0.0656  0.3344  4.7344 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.83537    0.02102   87.31   <2e-16 ***
## `tree species`ACSA  0.73019    0.02258   32.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5893 on 5890 degrees of freedom
##   (953 observations deleted due to missingness)
## Multiple R-squared:  0.1508, Adjusted R-squared:  0.1506 
## F-statistic:  1046 on 1 and 5890 DF,  p-value: < 2.2e-16

The Regression Coefficient and P-Value

To start things off, we want to see what does the regression coefficient and the p-value of tree species tell me about the amount of sugar concentration each of the two species of trees get? We can see what these are by looking at the summary statistics for the simple sugar concentration regression model.

The regression coefficient of tree species tells me that when the predictor variable changes from ACRU (red maple) to ACSA (sugar maple), the sugar concentration, the response variable, is more than red maple by about 0.73019 for sugar maple. The null hypothesis is that tree species has no effect on sugar concentration. The p-value for tree species is 0.0000000000000002, which is less than the 0.05 threshold. This means that we can reject the null hypothesis. Additionally, this states that tree species does have a significant impact on sugar concentration and that difference seems to be 0.73019 Brixx.

How Good are the Goodness-of-Fit Statistics?

The question we want to answer here is if the conclusion from the previous question is sufficient based on the goodness-of-fit statistics. We can test this by seeing what values the RSE and the R^2 are in the summary statistics of the simple sugar concentration regression model.

No, I should not believe the conclusion from the previous part. The main reason why is because of the two R^2 values. The multiple R^2 value is 0.1508 and the adjusted R^2 value is 0.1506. These values are way too far from 1, which means that this regression model is not very good to use for any kind of statistical analysis. Although, the RSE value is on the low side being only 0.5893. Because the R^2 values are so low they have a bigger impact in telling if the model is good or not.

The Models Residual Distribution Visualization

The problem we want to address here is to see how the residuals of the model are distributed and what this means for the model. In order to do this we need to create a visualization that can show the spread of the residuals. We can solve this problem by creating a histogram of the model:

maple_tap_w_sap_collection_data_full_w_sugar_resid <- maple_tap_w_sap_collection_data_full %>%
  add_residuals(maple_tap_and_sap_simple_sugar_model)

ggplot(maple_tap_w_sap_collection_data_full_w_sugar_resid) +
  geom_histogram(aes(resid)) +
  labs(x = "residual",
       y = "total amount of data points",
       title = "Residual Sugar Concentration Regression Model",
       caption = "Data obtained from https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-hfr.285.4")

The residual distribution of this model shows that it is decently normally distributed because it does have a very close peak at 0, but it is a little bit right skewed because of how much it is trailing to the right. What this distribution means is that the model is under predicting the sugar concentration values compared to the actual values in the data set. This means that there is somewhat of a bias and that there could be a pattern in the data set that is not being detected.

Simple Regression Models Conclusion

The final question we want to answer, in this section, is what can we conclude from the simple sugar concentration model? In order to answer this we need to look at what the model says about the non-masting red maple muted dynamics exhibited compared to masting sugar maple question at face-value. To do this we can look at the regression coefficient and p-value of the model.

What I can conclude from the simple sugar concentration regression model is that there is in fact a correlation between how much sugar concentration there is in a trees sap and the species of that tree. The simple regression model tells me that the tree species variable does have a significant impact on the sugar concentration variable. Which means that sugar maples (ACSA) have higher sugar concentrations than red maples (ACRU) by about 0.73019 Brixx.

Simple Sap Weight Regression Model

maple_tap_and_sap_simple_weight_model <- lm(`sap weight` ~ `tree species`, data = maple_tap_w_sap_collection_data_full)

summary(maple_tap_and_sap_simple_weight_model)

## 
## Call:
## lm(formula = `sap weight` ~ `tree species`, data = maple_tap_w_sap_collection_data_full)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0855 -2.1355 -0.5755  1.5445 14.6445 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.89603    0.09344   20.29   <2e-16 ***
## `tree species`ACSA  2.19949    0.10187   21.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.925 on 6169 degrees of freedom
##   (674 observations deleted due to missingness)
## Multiple R-squared:  0.07025,    Adjusted R-squared:  0.0701 
## F-statistic: 466.1 on 1 and 6169 DF,  p-value: < 2.2e-16

The Regression Coefficient and P-Value

For this section, what does the p-value and the regression coefficient of tree species say about how heavy the sap weighs based on the two species of trees we are comparing? To test this out, we can look at the summary statistics of the simple sap weight regression model.

The tree species regression coefficient is 2.19949. What this means is that when the predictor variable changes from ACRU (red maple) to ACSA (sugar maple), the sap weight, the response variable, for sugar maple is more than red maple by about 2.19949. The null hypothesis is that tree species has no effect on sap weight. The p-value for this simple model is 0.0000000000000002. With the p-value being less than the 0.05 threshold, this means that tree species does have a substantial impact on sap weight. With that difference looking to be 2.19949 kilograms.

How Good are the Goodness-of-Fit Statistics?

Next, we want to see if the conclusion from the previous problem is good according to the goodness-of-fit statistics. The best way to test this, to see if it is true, is by looking at the summary statistics in the simple sap weight regression model. More specifically, the R^2 and RSE values.

The previous parts conclusion should not be believed. The first reason why is because this simple model has a pretty high RSE value of 2.925. This is something that we do not want. What we want instead is a pretty low RSE. The main reason why we should not trust this models conclusion is because it has some really low R^2 values. The multiple R^2 is 0.07025 and the adjusted R^2 is 0.0701. With them being so far from 1, and the RSE being on the high side, this model is not great to use for statistical analysis of any kind.

The Models Residual Distribution Visualization

The second to last thing we need to do is to look at the models residuals. What we want to do specifically is to see how they are distributed and what that means for the model. We can do this by making a visualization that display the dispersion of the residuals. A histogram would be the best tool for this kind of visualization:

maple_tap_w_sap_collection_data_full_w_weight_resid <- maple_tap_w_sap_collection_data_full %>%
  add_residuals(maple_tap_and_sap_simple_weight_model)

ggplot(maple_tap_w_sap_collection_data_full_w_weight_resid) +
  geom_histogram(aes(resid)) +
  labs(x = "residual",
       y = "total amount of data points",
       title = "Residual Sap Weight Regression Model",
       caption = "Data obtained from https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-hfr.285.4")

This models residual distribution is somewhat normally distributed. The only bad thing is that the left side of the peak stops abruptly just above the 400 mark before ever getting to the bottom, and the right side goes all the way down to the bottom. The peak is somewhat close to 0, but is very much right skewed because it trails off way to much to the right. This residual distribution shows us that this model is under predicting the sap weight values compared to the actual values in the data set. This shows that there is very much a bias in the data set and that there could very well be a pattern that is not being detected as well.

Simple Regression Models Conclusion

Lastly, we need to see what we can infer from the simple sap weight model? We want to specifically see what the model says at face-value about the main question we are trying to answer. The regression coefficient and the p-value of the simple model will help us in answering this.

The simple sap weight regression model lets me conclude that there is in face a relationship between the species a tree is and how heavy the trees sap is. It shows me that the tree species variable does have a important impact on the sap weight variable. What this all means is that sugar maples (ACSA) have heavier sap weights than red maples (ACRU) by about 2.19949 kilograms.

Simple Tree Thickness Regression Model

maple_tap_and_sap_simple_thickness_model <- lm(`tree thickness` ~ `tree species`, data = maple_tap_w_sap_collection_data_full)

summary(maple_tap_and_sap_simple_thickness_model)

## 
## Call:
## lm(formula = `tree thickness` ~ `tree species`, data = maple_tap_w_sap_collection_data_full)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.403  -8.648  -2.103  10.697  21.017 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          46.583      2.405  19.373  < 2e-16 ***
## `tree species`ACSA   20.120      2.673   7.528 9.19e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.78 on 124 degrees of freedom
##   (6719 observations deleted due to missingness)
## Multiple R-squared:  0.3137, Adjusted R-squared:  0.3081 
## F-statistic: 56.67 on 1 and 124 DF,  p-value: 9.191e-12

The Regression Coefficient and P-Value

We want to look at the regression coefficient and the p-value of tree species to see what it shows about how thick a tree is according to the tree species it is. The summary statistics in the simple tree thickness regression model can show us the relationship between the two.

The regression coefficient shows how when tree species, the predictor variable, goes from ACRU (red maple) to ACSA (sugar maple), the tree thickness, the response variable, for sugar maple is more than red maple by about 20.120. The null hypothesis is that tree species has no effect on tree thickness. With the p-value being less than the 0.05 threshold-0.00000000000919. This means that species the tree is does have a major impact on tree thickness. The difference between the two species being 20.120 centimeter.

How Good are the Goodness-of-Fit Statistics?

In order to move on, we need to see if the conclusion from the previous question is satisfactory derived from the goodness-of-fit statistics. In order to see if this is true, we can look at the simple tree thickness regression models summary statistics to see what the R^2 and RSE values are.

For the conclusion in the previous section, I think that we should probably not believe it. The first reason why is because the simple model has such a considerably high RSE value, with it being 11.78. With RSE values, you want to make sure that it is as low as possible, so in the decimals is best. The main reason is because compared to the other R^2 values, this simple model does have better ones. With them being 0.3137 for the multiple R^2 and 0.3081 for the adjusted R^2. But, these are still somewhat too far away from 1 to be as good as they can be. I still think that this simple model, based on its goodness-of-fit statistics, should probably not be used for any statistical analysis.

The Models Residual Distribution Visualization

We want to next create a visualization of the model. We want to do this so we can look at how the models residuals are distributed, and what that can mean for the model as well. In order to do this we can create a histogram of the model:

maple_tap_w_sap_collection_data_full_w_thickness_resid <- maple_tap_w_sap_collection_data_full %>%
  add_residuals(maple_tap_and_sap_simple_thickness_model)

ggplot(maple_tap_w_sap_collection_data_full_w_thickness_resid) +
  geom_histogram(aes(resid)) +
  labs(x = "residual",
       y = "total amount of data points",
       title = "Residual Tree Thickness Regression Model",
       caption = "Data obtained from https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-hfr.285.4")

This residual distribution kind of looks like it is normally distributed. But, because it does not have that many data points it does not look that great. The peak of the distribution is somewhat close to 0, with it only being a just a few over to the left. The graph at least does not look like it is skewed to any side. This distribution is telling me that it is able to somewhat predict the tree thickness values compared to the actual values in the data set.

Simple Regression Models Conclusion

The last problem we need to do before going onto the next section, is to look at the simple tree thickness model, and detect what we can deduce from it. The model will help us to see a simple answer to the main problem that we are trying to answer. To answer the problem, we need to look at the p-value and the regression coefficient of the model.

After looking at the simple tree thickness regression model, I can conclude that there is a connection between how thick a tree is and the species that tree is from. The model also demonstrates that the tree species variable does have a substantial impact on the tree thickness variable. Overall, this shows that sugar maple trees (ACSA) are thicker than red maple trees (ACRU) by about 20.120 centimeters.

Choosing the Response and Confounding Variables

There are three questions we need to answer only in this section. Those questions are, from the three variables I have been using what is the most likely one to be the response variable, what two are most likely to be the confounding variables, and answer why for all of them? The three main variables I have been using are sugar concentration, sap weight, and tree thickness.

I think that sugar concentration is the response variable for the multiple regression model. The first reason why I believe this is because based on the simple regression models I made earlier it had the best one. The sugar concentration simple model had the better RSE value and the better residual distribution. It did not have the best R^2 values, but taking everything into consideration it is the most well rounded and best simple model out of the three. The second reason why I think it should be the response variable is because I believe that it is the best indication of how healthy a maple tree is. If a maple tree produces more sap, then it will not have muted dynamics and everything else about the tree will be healthier. The first variable that I believe to be a confounding variable is sap weight. The reason why is because you would assume that if a sap has a higher weight, then it would also have a higher sugar concentration. Also the healthier the tree is, the more it would be able to produce sap resulting in having a higher weight. The second variable I believe is tree thickness. The first reason why is because the healthier the tree is the bigger it would be. This would also allow the tree to produce more sap, which would means it can most likely have a higher sugar concentration.

Multiple Model

maple_tap_w_sap_collection_data_full2 <- maple_tap_w_sap_collection_data_full %>%
  select(-date, -`tree identification number`, -`which tap`, -time, -`tap cardinal direction`, -`tap height`)

entryNA <- which(is.na(maple_tap_w_sap_collection_data_full2$`sap weight`))

maple_tap_w_sap_collection_data_full2$`sap weight`[entryNA] <- mean(maple_tap_w_sap_collection_data_full2$`sap weight`, na.rm = TRUE)

entryNA2 <- which(is.na(maple_tap_w_sap_collection_data_full2$`tree thickness`))

maple_tap_w_sap_collection_data_full2$`tree thickness`[entryNA2] <- mean(maple_tap_w_sap_collection_data_full2$`tree thickness`, na.rm = TRUE)

maple_tap_and_sap_multiple_model <- lm(`sugar concentration` ~ `tree species` + `tree thickness` + `sap weight`, data = maple_tap_w_sap_collection_data_full2)

summary(maple_tap_and_sap_multiple_model)

## 
## Call:
## lm(formula = `sugar concentration` ~ `tree species` + `tree thickness` + 
##     `sap weight`, data = maple_tap_w_sap_collection_data_full2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8275 -0.3973 -0.0674  0.3418  4.6705 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.514103   0.662534   2.285   0.0223 *  
## `tree species`ACSA  0.762622   0.023103  33.010  < 2e-16 ***
## `tree thickness`    0.005750   0.010532   0.546   0.5851    
## `sap weight`       -0.016863   0.002702  -6.241 4.66e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5875 on 5888 degrees of freedom
##   (953 observations deleted due to missingness)
## Multiple R-squared:  0.1564, Adjusted R-squared:  0.156 
## F-statistic: 363.8 on 3 and 5888 DF,  p-value: < 2.2e-16

Comparing the Regression Coefficient and P-Value

What we first want to accomplish, in this section, is to see what is the regression coefficient and the p-value of tree species is in the multiple regression model? The next thing we want to look at is how does it compare to the simple sugar concentration regression model? In order to see and compare the regression model and the p-value, we need to look at the summary statistic of the multiple regression model.

The regression coefficient of tree species is 0.762622 in the multiple regression model. What this means is that when the predictor variable changes from ACRU (red maple) to ACSA (sugar maple), the sugar concentration, the response variable, is more than red maple by about 0.762622 for sugar maple. The way that the multiple regression models regression coefficient compares to the simple sugar concentrations model, is that it is slightly higher. The multiple one is bigger by about 0.032432 (0.762622 - 0.73019). The null hypothesis is that tree species has no effect on sugar concentration. We can reject the null hypothesis, because the p-value for tree species is 0.0000000000000002, which is less than the 0.05 threshold. This means that tree species does have a significant impact on sugar concentration and that difference seems to be 0.762622 Brixx.

Comparing the Goodness-of-Fit Statistics

The next thing is to look at the goodness-of-fit statistics for the multiple regression model to see what it means, and how it compares to the simple sugar concentration regression model. To see the statistics and compare them, we need to look at the multiple models summary statistics.

This multiple regression model has a pretty bad goodness-of-fit statistics compared to the sugar concentration simple model. The first thing is the RSE is 0.5875, which is low. What this tell me is that the 70% confidence interval of the residuals is between the intervals of -0.5875 and 0.5875. The main thing that ruins this models credibility to be used for any statistical modeling is the multiple R^2 being 0.1564 and the adjusted R^2 being 0.156. These are not anywhere near close enough to 1. What these two R^2s tell me is that the model can confidently only explain about 15%, respectively, of the variation in sap concentration and the rest is from random noise. Overall, this models goodness-of-fit statistics are actually about the same as the simple regression models one. The multiple regression model only has a slightly lower RSE and negligibly higher multiple and adjusted R^2.

Comparing the Residual Distribution Visualizations

Up next, we want to see how the multiple regression models residual distribution looks like. We want to also see how it compares to the simple sugar concentration models as well. What we want to see, when we are comparing them, is if there were any improvements at all when we moved from the simple to the multiple model. To test this out we need to create a histogram of the multiple model and compare it to the simple one:

maple_tap_w_sap_collection_data_full_w_multiple_resid <- maple_tap_w_sap_collection_data_full2 %>%
  add_residuals(maple_tap_and_sap_multiple_model)

ggplot(maple_tap_w_sap_collection_data_full_w_multiple_resid) +
  geom_histogram(aes(resid)) +
  labs(x = "residual",
       y = "total amount of data points",
       title = "Multiple Regression Models Residual Distribution",
       caption = "Data obtained from https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-hfr.285.4")

I only see tiny improvements between the multiple regression model residual distribution and the simple sugar concentration regression model one. The first improvement I see is that the peak looks a bit more like it is at 0, which is good. Also this distribution looks a tiny bit more normally distributed around 0, but is still skewed the same amount to the right as the simple model one. This distribution also stops at the same x-value mark at the graph on the right as the simple model does.

How Does the Multiple Compare to the Simple Overall?

The second to last question I need to answer here is how does the multiple regression model I created compare to the simple sugar concentration regression model and the two other simple models? More specifically how does my multiple model match against the simple ones and is it really better? I also need to address how I created the multiple model in the first place. In order to fully compare all of the models, I need to look at everything that I looked at previously in this section, and the simple model sections. This will help me to see how they all come together.

Before I talk about how the multiple regression model compares to the simple sugar concentration model one, I need to preface one thing. In order for me to have made the multiple regression model I had to find the mean of all the non-NA values in the sap weight and tree thickness columns, respectively, and then I had to replace all of the NAs in each of those two columns with the average I found. This is the only way that I could have done the multiple regression model because R would not let me compute the multiple regression model without it. Now to how these two models compare. Overall, they are basically the same. There is not much separating them. They have very similar RSE’s, R^2’s, and the regression coefficients are all just decimals away from each other. Even the p-values for the regression coefficient of tree species are exactly the same. Finally, the residual distribution histograms I made for the both of them are also very similar to each other. This means that the confounding variables I found do not make that much of an impact on how tree species affects sap concentration. We can very much just use the simple regression model and be able to conclude the same thing as the multiple one. The reason why I chose to only use sap weight and tree thickness is because I believe that these two are the only logical confounding variables, in my opinion, that can affect sugar concentration based on tree species. Even though I know that my R^2 values are low, the good thing is that I was not trying every possible combination of confounding variables to over-fit my model to this particular data set. Instead of low R^2 values that show my model is not good, I would of had high artificial R^2 values that are just as bad, maybe worse.

How the Multiple Model Answers the Main Question

The final thing we want to do is to look back at the main question that we have been trying to answer from the start. Does the non-masting red maple species exhibit muted dynamics compared to the masting sugar maple species? We can answer this by looking at the multiple and the three simple regression models I have made. The reason why we can look at all of it is because the multiple model is not that different to the simple sugar concentration model, so looking at all of the models would be helpful.

What I can conclude from the single multiple and three simple regression models, also comparing the multiple and simple sugar concentration models, is that yes, the non-masting red maple species does in fact exhibit muted dynamics compared to the masting sugar maple species. All of the simple models shows that the ACSA (sugar maple) has a higher sugar concentration, sap weight, and tree thickness than the ACRU (red maple). Also the multiple regression model shows that sugar maple trees have higher sugar concentration than red maple trees. This shows that ACRU trees do exhibit muted dynamics than the ACSA trees. The only caveat is that all of these models are not that reliable to use. They all either have really low R^2 values which are dead give-a-ways that these models are not usable for any kind of statistical analysis, bad RSE values, not so good residual distribution histograms, or a mixture of all of them. Overall, I think that these models, and their conclusions, are not totally appropriate to claim anything about if non-masting red maple species exhibit muted dynamics compared to masting sugar maple species. At face value they do show that the main question we are trying to answer is true. There could be a different statistical model out there that could be way better to use for this question than regression models.

Conclusion

In conclusion, we can conclude that based on everything that I have looked at, non-masting red maple species do exhibit muted dynamics compared to the masting sugar maple species. What I have found and think are the most likely variables that demonstrate if a certain tree species is healthy or not are sugar concentration, sap weight, and tree thickness. All of this means that certain tree characteristic variables do have an affect on if a tree has muted dynamics or not. The visualizations, data set, and all of the simple and multiple regression models I have made help in proving my answer to the main question of this report.

Works Cited

Hess, Anna, and Mark Hamilton. “Sugar Content of Maple Sap.” The Walden Effect, Branchable Wiki Hosting, 23 Feb. 2014, www.waldeneffect.org/blog/Sugar_content_of_maple_sap/#google_vignette (Accessed 2023-12-11).

Lancaster, Lynda. “Sweet Signs of Spring.” National Parks Service, U.S. Department of the Interior, 9 Apr. 2020, www.nps.gov/indu/learn/education/sweet-signs-of-spring.htm#:~:text=Maple%20sap%20%2D%20maple%20sap%20looks,sap%20weighs%208.34%20pounds%20per (Accessed 2023-12-11).

McLane, Eben. “Masting: Survival of the Seediest.” Finger Lakes Land Trust, fllt.org, 10 July 2011, www.fllt.org/cl-masting-survival-of-the-seediest/ (Accessed 2023-12-11).

Rapp, J., E. Crone, and K. Stinson. 2021. Maple Reproduction and Sap Flow at Harvard Forest since 2011 ver 4. Environmental Data Initiative. https://doi.org/10.6073/pasta/c74eba9dc8ddc41c19dc85e002a3f046 (Accessed 2023-12-11).

An Analysis of Three Tree Variables affecting Tree Species

Haris Sendijarevic

12/15/2023

Introduction

Cleaning Process:

Visualizations:

Visualization of Tree Species Based on Sugar Concentration

Visualization of Tree Species Based on Sap Weight

Visualization of Tree Species Based on Tree Thickness

Tree Identification vs Three Important Variables

Regression Modeling:

Simple Sugar Concentration Regression Model

The Regression Coefficient and P-Value

How Good are the Goodness-of-Fit Statistics?

The Models Residual Distribution Visualization

Simple Regression Models Conclusion

Simple Sap Weight Regression Model

The Regression Coefficient and P-Value

How Good are the Goodness-of-Fit Statistics?

The Models Residual Distribution Visualization

Simple Regression Models Conclusion

Simple Tree Thickness Regression Model

The Regression Coefficient and P-Value

How Good are the Goodness-of-Fit Statistics?

The Models Residual Distribution Visualization

Simple Regression Models Conclusion

Choosing the Response and Confounding Variables

Multiple Model

Comparing the Regression Coefficient and P-Value

Comparing the Goodness-of-Fit Statistics

Comparing the Residual Distribution Visualizations

How Does the Multiple Compare to the Simple Overall?

How the Multiple Model Answers the Main Question

Conclusion

Works Cited