Today we will be delving into fourteen data sets that will help us answer if our non-masting tree species (red maple) exhibits muted dynamics compared to our masting tree species (sugar maple). Available for public usage if it is sourced properly, below we do this. It is data spread our over fourteen different files, each said file measures their own individual aspect, for example, file three provides data about the sex of the tree and the flower intensity. But, in this analysis, we will delve into only two of the fourteen data sets. We made this call due to the fact that there are only two data sets that have both species labeled. And in this analysis we are comparing the two tree species to see which is more lively. Trying to compare two species is only possible if you have data on both of the said species, that’s how a comparison works. Below we will be discussing the two data sets in length. But, before we go into that, let’s load our libraries that we will be using this analysis.

Loading Libraries

Before we can begin with our data sets, we have to first load up our libraries that we will be using to help us in this analysis. Let us describe what each one will be used for. tidyverse allows us to do our plots and such. modelr allows us to build linear and exponential models. readr allows us to import the data sets into r. And finally, DT allows us to use the datatable function to properly showcase our data sets.

library(tidyverse)
library(modelr)
library(readr)
library(readxl)
library(DT)

Source

Rapp, J., E. Crone, and K. Stinson. 2021. Maple Reproduction and Sap Flow at Harvard Forest since 2011 ver 4. Environmental Data Initiative. https://doi.org/10.6073/pasta/c74eba9dc8ddc41c19dc85e002a3f046 (Accessed 2022-12-12).

What is in Our Data Sets?

Now that we have loaded our libraries and sourced our work, let’s talk about our data sets and what they contain. Our two data sets have 6936 entries and 11 variables combined, we’ll go into what the variables are now and what they mean. date: Date of data collection formatted as YYYY-MM-DD. tree: the tree species prefix, HF for sugar maples and HFR or AR for red maples, followed by their ID number. tap: A or B means the tree was tapped for at least 2 times, consistent within years but not between. time: The time of the data collection, formatted as hh-mm-ss. species: This specifies the tree species, ACSA is our sugar maple while ACRU is our red maple. dbh: This is the diameter of the tree measured above 1.4 m above the sea level, it is measured in cm. tap.bearing: This is the compass direction of the tap hole facing into the tree. tap.height: The tap hole (number)cm distance above the ground. datetime: shows the date and time of the sap collection, formatted as YYYY-MM-DD and hh-mm-ss. sugar: The concentration of sugar in the sap by weight percent. And finally, sap.weight: The weight of the sap that was collected. Below are our two data sets.

Maple_Tap Data Table

Maple_Tap <- read_csv("C:/Users/Edin/Downloads/hf285-01-maple-tap.csv")
datatable(Maple_Tap)

And here is our second data set.

Maple_Sap Data Table

Maple_Sap <- read_csv("C:/Users/Edin/Downloads/hf285-02-maple-sap.csv")
datatable(Maple_Sap)

How Will we Answer Our Question?

Now that we have looked over our two data sets extensively, we can finally start discussing on how we will determine if our non-masting red maple tree species exhibits muted dynamics compared to our masting sugar maple tree species. We will do the following, in order, to answer our question, firstly, we will tidy up our data sets, then merge the two data sets, if needed, do some more cleaning up, then we will provide a visualization of our year and sugar variable to see if the sugar percentage changes over time, next, how do our dbh and sugar variable have any correlation to each other, then we will do the same but instead of using our sugar variable, we will use our dbh variable, next we’ll make visualizations of our two species and their correlation with our dbh and sugar variable, then we will build linear regression models with our species variable being our predictor and two separate response variables of dbh and sugar. Finally, we will give a conclusive statement on our findings and see if we can answer our question. Let’s jump right into it!

Tidied up Data Sets

We’ll start with cleaning up our Maple_Tap data set. Let’s use our summary function to see if there are any entry errors or a large sum of NA’s for us to remove.

Maple Tap Cleaning

Maple_Tap <- read_csv("C:/Users/Edin/Downloads/hf285-01-maple-tap.csv")
summary(Maple_Tap)
##       date                tree               tap              species         
##  Min.   :2012-02-14   Length:246         Length:246         Length:246        
##  1st Qu.:2013-02-22   Class :character   Class :character   Class :character  
##  Median :2016-02-01   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2015-06-27                                                           
##  3rd Qu.:2017-02-18                                                           
##  Max.   :2018-02-14                                                           
##                                                                               
##       dbh         tap.bearing      tap.height 
##  Min.   :31.00   Min.   :  0.0   Min.   : 50  
##  1st Qu.:54.70   1st Qu.: 76.0   1st Qu.: 97  
##  Median :63.85   Median :180.0   Median :123  
##  Mean   :62.87   Mean   :175.3   Mean   :120  
##  3rd Qu.:74.88   3rd Qu.:263.0   3rd Qu.:141  
##  Max.   :86.20   Max.   :359.0   Max.   :185  
##  NA's   :120     NA's   :7       NA's   :7

There aren’t any clear data entry errors that are obvious to us but there is a large sum of NA’s for our dbh variable. Let’s filter those out. Also, let us rename some of our variables so they look nicer or more understandable.

`Maple Tap Clean`<- Maple_Tap%>%
  filter(!is.na(dbh))%>%
   rename("tap bearing" = tap.bearing)%>%
   rename("tap height" = tap.height)%>%
   rename("diameter" = dbh)

datatable(`Maple Tap Clean`)

And that should be all the cleaning, for now, that we will have to do for our Maple_Tap data set. Let’s move onto our Maple_Sap cleaning procedure.

Maple Sap Cleaning

Maple_Sap <- read_csv("C:/Users/Edin/Downloads/hf285-02-maple-sap.csv")
summary(Maple_Sap)
##       date                tree               tap                time         
##  Min.   :2012-02-14   Length:6690        Length:6690        Length:6690      
##  1st Qu.:2014-02-21   Class :character   Class :character   Class1:hms       
##  Median :2016-02-04   Mode  :character   Mode  :character   Class2:difftime  
##  Mean   :2015-08-03                                         Mode  :numeric   
##  3rd Qu.:2017-03-21                                                          
##  Max.   :2018-03-31                                                          
##                                                                              
##     datetime                          sugar          species         
##  Min.   :2012-02-14 16:25:00.00   Min.   : 0.800   Length:6690       
##  1st Qu.:2012-03-07 10:19:30.00   1st Qu.: 2.000   Class :character  
##  Median :2013-02-26 16:24:30.00   Median : 2.400   Mode  :character  
##  Mean   :2013-07-05 00:50:14.97   Mean   : 2.471                     
##  3rd Qu.:2013-04-09 16:36:15.00   3rd Qu.: 2.900                     
##  Max.   :2018-03-31 16:02:00.00   Max.   :22.000                     
##  NA's   :5804                     NA's   :798                        
##      sap.wt      
##  Min.   : 0.010  
##  1st Qu.: 1.400  
##  Median : 2.980  
##  Mean   : 3.746  
##  3rd Qu.: 5.400  
##  Max.   :18.740  
##  NA's   :519

From our summary function, we see there is a large sum of NA’s in our sugar variable. Another issue we see is a huge max value for our sugar variable, as well as our sap.wt variable. Let us clean the NA’s and check if those high values for our variables are possible or if they are entry errors.

Maple_Sap_Check<- Maple_Sap%>%
  arrange(desc(sugar))

head(Maple_Sap_Check, 10)
## # A tibble: 10 × 8
##    date       tree  tap   time   datetime sugar species sap.wt
##    <date>     <chr> <chr> <time> <dttm>   <dbl> <chr>    <dbl>
##  1 2018-03-24 HF33  A        NA  NA        22   ACSA     NA   
##  2 2017-03-08 HF41  B        NA  NA         7.3 ACSA      0.52
##  3 2017-02-24 HF12  A        NA  NA         6   ACSA      1.78
##  4 2017-02-23 HF12  A        NA  NA         5.9 ACSA      1.92
##  5 2017-02-22 HF12  A        NA  NA         5.5 ACSA      2.96
##  6 2017-02-21 HF12  A        NA  NA         5.2 ACSA      0.96
##  7 2017-02-20 HF12  A        NA  NA         5.1 ACSA      1.12
##  8 2017-02-27 HF12  A        NA  NA         5.1 ACSA      3.44
##  9 2015-03-26 HF12  A        NA  NA         5   ACSA      0.74
## 10 2015-03-31 HF41  A        NA  NA         5   ACSA      5.26

Looking at our head function we can see that there is a massive gap between our top two values for our sugar variable. It goes from 7.3% to 22%. Because of this large gap, we can confidently claim this is an error entry. Let us filter out the NA’s, the error entry for our sugar variable, and rename our sap.wt variable.

`Maple Sap Clean` <- Maple_Sap%>%
  filter(!is.na(sugar))%>%
  filter(sugar<15)%>%
  rename("sap weight" = sap.wt)

datatable(`Maple Sap Clean`)

We know we still have that datetime variable, and it is unnecessary, but we will clean that up after we merge the two data sets.

Merging The Two Data Sets

`Full Maple Clean` <- full_join(`Maple Tap Clean`, `Maple Sap Clean`, by =c("date", "tap", "tree", "species"))

datatable(`Full Maple Clean`)

Now let’s do some more cleaning here. For instance, removing the datetime variable and change our date variable so year, month, and day have their own separate columns.

`Full Maple Clean V2`<- `Full Maple Clean`[,-9]

`Full Maple Clean V3` <-`Full Maple Clean V2`%>%
  separate(date, into = c("year", "month-day"), sep = 4, convert = TRUE)%>%
  separate(`month-day`, into = c("dash", "month-day"), sep = 1, convert = TRUE)%>%
  separate(`month-day`, into = c("month", "day"), sep = 2, convert = TRUE)%>%
  separate(day, into = c("dash2", "day"), sep = 1, convert=TRUE)%>%
select(-dash, -dash2)%>%
  arrange(year, month, day)

datatable(`Full Maple Clean V3`)

Now that we have our fully cleaned up data set, let’s move onto answering our question.

How Has Our year Variable Changed Our sugar Variable.

Below is our code. Also, in order for us to use our boxplot function, we have to change our year variable to be a character type rather than an integer type.

Maple_year_boxplot<- `Full Maple Clean V3`%>%
 mutate(year = as.character(year))

ggplot(Maple_year_boxplot) +
  geom_boxplot(aes(year,sugar))

Our plot shows us that 2015 had the highest average of sugar percentage in our sap, while 2012 or 2016 seem to be the years with the lowest average of sugar percentage in our sap. We can also say that 2017 had the most fluctuation in data points for that year. Let’s move onto our visualization of our year and diameter variable.

How Has Our year Variable Changed Our diameter Variable.

We still have the same issue as we had in our previous part. We will be using the same work around by changing our year variable to a character type of variable. Below is our work.

ggplot(Maple_year_boxplot) +
  geom_boxplot(aes(year,diameter))

Because some of the years didn’t have data for the diameter of the trees, those columns were left blank on our box plot. But, as we should have guessed, the diameter of the trees have gone slightly through the years. Now, let’s move onto seeing if there is a correlation between our sugar and diameter variable.

Is There a Correlation Between Our sugar and diameter variable?

Below is our code.

ggplot(data = `Full Maple Clean V3`) +
  geom_point(aes(diameter, sugar)) 

Looking at our scatter plot, we can see a slight incline in when our diameter variable increases our sugar variable follows slightly. This makes sense since trees need all sorts of minerals and such for survival. And for many plants, sugar is an essential building block. Glucose is energy, the fuel that allows plants to survive. Sugar for humans is different, since our essential needs are highly accessible. But as stated previously, this is completely different in the plant kingdom. Now, let’s move onto how our species variable might be of influence in our sugar and diameter variables.

How Does Our species Variable Affect Our sugar and diameter Variable?

sugar Section

Firstly, let’s start with a quick box plot of our two species and their sugar contents.

ggplot(`Full Maple Clean V3`) +
  geom_boxplot(aes(species,sugar))

As we can see from our box plot, our ACSA (masting species) value has, on average, a higher sugar content when compared to our ACRU value (non-masting species). This makes sense, since masting tree species produce a large quantity of seeds, they need a lot of energy to do so. And what’s one of the most efficient forms of energy in this case? Sugar/glucose. The sugar comparison alone here signifies how more alive the masting species is compared to its non-masting counterpart. But let’s see what our linear model tells us about this.

sugar_model <- lm(sugar ~ species,`Full Maple Clean V3`)

summary(sugar_model)
## 
## Call:
## lm(formula = sugar ~ species, data = `Full Maple Clean V3`)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7656 -0.3656 -0.0656  0.3344  4.7344 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.83537    0.02102   87.31   <2e-16 ***
## speciesACSA  0.73026    0.02258   32.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5894 on 5889 degrees of freedom
##   (107 observations deleted due to missingness)
## Multiple R-squared:  0.1508, Adjusted R-squared:  0.1507 
## F-statistic:  1046 on 1 and 5889 DF,  p-value: < 2.2e-16

Looking at our linear model, it tells us the following. When our species variable has the value of ACSA, our sugar variable goes up by .73%. We can confirm this is true with our low p-values for both of our coefficients. Looking at our R^2 value, it tells us that we can explain 15.08% of our model. With that explained, let’s move onto our diameter variable, and how our species variable interacts with it.

diameter Section

Let us start with our box plot.

ggplot(`Full Maple Clean V3`) +
  geom_boxplot(aes(species,diameter))

Now looking at our box plot, the species type does affect the diameter of the tree. This once again makes sense. Since the ACSA species is also the one with a higher sugar content, that would make it possible for the tree to support a larger diameter compared to one with a lesser diameter. Since we’ve previously discussed the importance of sugar for plant life. Also, another thing to add, generally speaking, trees with a larger diameter have a higher likelihood to produce more foliage compared to trees with a lower diameter. This is due to the fact that a large diameter means the tree is more stable, less likely to fall over, which means the tree can produce more leaves in order to convert sunlight into more energy. Making the tree more dynamic and living compared to one with a lesser diameter. Let’s move onto our linear model.

diameter_model <- lm(diameter ~ species,`Full Maple Clean V3`)

summary(diameter_model)
## 
## Call:
## lm(formula = diameter ~ species, data = `Full Maple Clean V3`)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.403  -8.648  -2.103  10.697  21.017 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   46.583      2.405  19.373  < 2e-16 ***
## speciesACSA   20.120      2.673   7.528 9.19e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.78 on 124 degrees of freedom
##   (5872 observations deleted due to missingness)
## Multiple R-squared:  0.3137, Adjusted R-squared:  0.3081 
## F-statistic: 56.67 on 1 and 124 DF,  p-value: 9.191e-12

Looking at our linear model, we see our coefficient p-values are below the 0.05 threshold, so we can be confident in claiming that our coefficient values are what they say they are. Our species coefficient tells us that when it has a value of ACSA the tree’s diameter increases by 20.12 cm. Now looking at our R^2 value, we see that it is .3137, meaning, we can explain 31.37% of our linear model. With that value being double of what we got from our previous model, we can be more confident in this one. And since the p-values of our coefficient is below the 0.05 threshold, we can confidently claim that our species variable is impactful on our diameter variable. Same goes for our previous linear model, where sugar is our response and species is our predictor.

Does The Non-masting Species Exhibit Muted Dynamics Compared to Our Masting Species?

The quick answer, yes. The longer answer is a little more complex. The sugar contents and diameter of the trees were the only variables we could compare for this analysis, but at least looking over these two variables paints a picture that the masting tree species is more active compared to our non-masting tree species. So, the variables we were able to check seem to indicate this. But, there were 12 other data sets we could have looked into, however, they only had data on the masting species. So, trying to make a comparison between two different things would be impossible in that case since you only have one side to the story. What we worked with was enough for us to at least scratch the surface to this question. And just to reiterate, the sugar contents of a tree helps the tree survive and allows said tree to be more lively, since sugar is energy and energy is needed for a tree to be more dynamic. diameter follows in this line of thinking as well. More area for things to grow out of is another important factor in determining the livelihood of a tree. With a more stable core the tree can focus more on growing appendages to collect more sunlight for example.

Conclusion

A lot was discussed and done in this analysis. We started with cleaning the data sets that we were going to use for our analysis, then to merge the two, next we did some more cleaning since we had a variable that a carbon copy of another variable, then we built visualizations of (year and sugar) and (year and diameter) to see if there was any sort of change over time for these two variables, next we did the same thing but used species instead of year for both of them, said some things about both of them, then we built linear models of diameter and sugar as our response and species as our predictor as two separate linear models, discussed the values and what they meant in their case, and finally, we gave our conclusive statement to our question, where it seems to be that the non-masting species exhibits muted dynamics compared to our masting species.