Today we will be delving into fourteen data sets that will help us answer if our non-masting tree species (red maple) exhibits muted dynamics compared to our masting tree species (sugar maple). Available for public usage if it is sourced properly, below we do this. It is data spread our over fourteen different files, each said file measures their own individual aspect, for example, file three provides data about the sex of the tree and the flower intensity. But, in this analysis, we will delve into only two of the fourteen data sets. We made this call due to the fact that there are only two data sets that have both species labeled. And in this analysis we are comparing the two tree species to see which is more lively. Trying to compare two species is only possible if you have data on both of the said species, that’s how a comparison works. Below we will be discussing the two data sets in length. But, before we go into that, let’s load our libraries that we will be using this analysis.
Before we can begin with our data sets, we have to first load up our libraries that we will be using to help us in this analysis. Let us describe what each one will be used for. tidyverse allows us to do our plots and such. modelr allows us to build linear and exponential models. readr allows us to import the data sets into r. And finally, DT allows us to use the datatable function to properly showcase our data sets.
library(tidyverse)
library(modelr)
library(readr)
library(readxl)
library(DT)
Rapp, J., E. Crone, and K. Stinson. 2021. Maple Reproduction and Sap Flow at Harvard Forest since 2011 ver 4. Environmental Data Initiative. https://doi.org/10.6073/pasta/c74eba9dc8ddc41c19dc85e002a3f046 (Accessed 2022-12-12).
Now that we have loaded our libraries and sourced our work, let’s talk about our data sets and what they contain. Our two data sets have 6936 entries and 11 variables combined, we’ll go into what the variables are now and what they mean. date: Date of data collection formatted as YYYY-MM-DD. tree: the tree species prefix, HF for sugar maples and HFR or AR for red maples, followed by their ID number. tap: A or B means the tree was tapped for at least 2 times, consistent within years but not between. time: The time of the data collection, formatted as hh-mm-ss. species: This specifies the tree species, ACSA is our sugar maple while ACRU is our red maple. dbh: This is the diameter of the tree measured above 1.4 m above the sea level, it is measured in cm. tap.bearing: This is the compass direction of the tap hole facing into the tree. tap.height: The tap hole (number)cm distance above the ground. datetime: shows the date and time of the sap collection, formatted as YYYY-MM-DD and hh-mm-ss. sugar: The concentration of sugar in the sap by weight percent. And finally, sap.weight: The weight of the sap that was collected. Below are our two data sets.
Maple_Tap <- read_csv("C:/Users/Edin/Downloads/hf285-01-maple-tap.csv")
datatable(Maple_Tap)
And here is our second data set.
Maple_Sap <- read_csv("C:/Users/Edin/Downloads/hf285-02-maple-sap.csv")
datatable(Maple_Sap)
Now that we have looked over our two data sets extensively, we can finally start discussing on how we will determine if our non-masting red maple tree species exhibits muted dynamics compared to our masting sugar maple tree species. We will do the following, in order, to answer our question, firstly, we will tidy up our data sets, then merge the two data sets, if needed, do some more cleaning up, then we will provide a visualization of our year and sugar variable to see if the sugar percentage changes over time, next, how do our dbh and sugar variable have any correlation to each other, then we will do the same but instead of using our sugar variable, we will use our dbh variable, next we’ll make visualizations of our two species and their correlation with our dbh and sugar variable, then we will build linear regression models with our species variable being our predictor and two separate response variables of dbh and sugar. Finally, we will give a conclusive statement on our findings and see if we can answer our question. Let’s jump right into it!
We’ll start with cleaning up our Maple_Tap data set. Let’s use our summary function to see if there are any entry errors or a large sum of NA’s for us to remove.
Maple_Tap <- read_csv("C:/Users/Edin/Downloads/hf285-01-maple-tap.csv")
summary(Maple_Tap)
## date tree tap species
## Min. :2012-02-14 Length:246 Length:246 Length:246
## 1st Qu.:2013-02-22 Class :character Class :character Class :character
## Median :2016-02-01 Mode :character Mode :character Mode :character
## Mean :2015-06-27
## 3rd Qu.:2017-02-18
## Max. :2018-02-14
##
## dbh tap.bearing tap.height
## Min. :31.00 Min. : 0.0 Min. : 50
## 1st Qu.:54.70 1st Qu.: 76.0 1st Qu.: 97
## Median :63.85 Median :180.0 Median :123
## Mean :62.87 Mean :175.3 Mean :120
## 3rd Qu.:74.88 3rd Qu.:263.0 3rd Qu.:141
## Max. :86.20 Max. :359.0 Max. :185
## NA's :120 NA's :7 NA's :7
There aren’t any clear data entry errors that are obvious to us but there is a large sum of NA’s for our dbh variable. Let’s filter those out. Also, let us rename some of our variables so they look nicer or more understandable.
`Maple Tap Clean`<- Maple_Tap%>%
filter(!is.na(dbh))%>%
rename("tap bearing" = tap.bearing)%>%
rename("tap height" = tap.height)%>%
rename("diameter" = dbh)
datatable(`Maple Tap Clean`)
And that should be all the cleaning, for now, that we will have to do for our Maple_Tap data set. Let’s move onto our Maple_Sap cleaning procedure.
Maple_Sap <- read_csv("C:/Users/Edin/Downloads/hf285-02-maple-sap.csv")
summary(Maple_Sap)
## date tree tap time
## Min. :2012-02-14 Length:6690 Length:6690 Length:6690
## 1st Qu.:2014-02-21 Class :character Class :character Class1:hms
## Median :2016-02-04 Mode :character Mode :character Class2:difftime
## Mean :2015-08-03 Mode :numeric
## 3rd Qu.:2017-03-21
## Max. :2018-03-31
##
## datetime sugar species
## Min. :2012-02-14 16:25:00.00 Min. : 0.800 Length:6690
## 1st Qu.:2012-03-07 10:19:30.00 1st Qu.: 2.000 Class :character
## Median :2013-02-26 16:24:30.00 Median : 2.400 Mode :character
## Mean :2013-07-05 00:50:14.97 Mean : 2.471
## 3rd Qu.:2013-04-09 16:36:15.00 3rd Qu.: 2.900
## Max. :2018-03-31 16:02:00.00 Max. :22.000
## NA's :5804 NA's :798
## sap.wt
## Min. : 0.010
## 1st Qu.: 1.400
## Median : 2.980
## Mean : 3.746
## 3rd Qu.: 5.400
## Max. :18.740
## NA's :519
From our summary function, we see there is a large sum of NA’s in our sugar variable. Another issue we see is a huge max value for our sugar variable, as well as our sap.wt variable. Let us clean the NA’s and check if those high values for our variables are possible or if they are entry errors.
Maple_Sap_Check<- Maple_Sap%>%
arrange(desc(sugar))
head(Maple_Sap_Check, 10)
## # A tibble: 10 × 8
## date tree tap time datetime sugar species sap.wt
## <date> <chr> <chr> <time> <dttm> <dbl> <chr> <dbl>
## 1 2018-03-24 HF33 A NA NA 22 ACSA NA
## 2 2017-03-08 HF41 B NA NA 7.3 ACSA 0.52
## 3 2017-02-24 HF12 A NA NA 6 ACSA 1.78
## 4 2017-02-23 HF12 A NA NA 5.9 ACSA 1.92
## 5 2017-02-22 HF12 A NA NA 5.5 ACSA 2.96
## 6 2017-02-21 HF12 A NA NA 5.2 ACSA 0.96
## 7 2017-02-20 HF12 A NA NA 5.1 ACSA 1.12
## 8 2017-02-27 HF12 A NA NA 5.1 ACSA 3.44
## 9 2015-03-26 HF12 A NA NA 5 ACSA 0.74
## 10 2015-03-31 HF41 A NA NA 5 ACSA 5.26
Looking at our head function we can see that there is a massive gap between our top two values for our sugar variable. It goes from 7.3% to 22%. Because of this large gap, we can confidently claim this is an error entry. Let us filter out the NA’s, the error entry for our sugar variable, and rename our sap.wt variable.
`Maple Sap Clean` <- Maple_Sap%>%
filter(!is.na(sugar))%>%
filter(sugar<15)%>%
rename("sap weight" = sap.wt)
datatable(`Maple Sap Clean`)
We know we still have that datetime variable, and it is unnecessary, but we will clean that up after we merge the two data sets.
`Full Maple Clean` <- full_join(`Maple Tap Clean`, `Maple Sap Clean`, by =c("date", "tap", "tree", "species"))
datatable(`Full Maple Clean`)
Now let’s do some more cleaning here. For instance, removing the datetime variable and change our date variable so year, month, and day have their own separate columns.
`Full Maple Clean V2`<- `Full Maple Clean`[,-9]
`Full Maple Clean V3` <-`Full Maple Clean V2`%>%
separate(date, into = c("year", "month-day"), sep = 4, convert = TRUE)%>%
separate(`month-day`, into = c("dash", "month-day"), sep = 1, convert = TRUE)%>%
separate(`month-day`, into = c("month", "day"), sep = 2, convert = TRUE)%>%
separate(day, into = c("dash2", "day"), sep = 1, convert=TRUE)%>%
select(-dash, -dash2)%>%
arrange(year, month, day)
datatable(`Full Maple Clean V3`)
Now that we have our fully cleaned up data set, let’s move onto answering our question.
Below is our code. Also, in order for us to use our boxplot function, we have to change our year variable to be a character type rather than an integer type.
Maple_year_boxplot<- `Full Maple Clean V3`%>%
mutate(year = as.character(year))
ggplot(Maple_year_boxplot) +
geom_boxplot(aes(year,sugar))
Our plot shows us that 2015 had the highest average of sugar percentage in our sap, while 2012 or 2016 seem to be the years with the lowest average of sugar percentage in our sap. We can also say that 2017 had the most fluctuation in data points for that year. Let’s move onto our visualization of our year and diameter variable.
We still have the same issue as we had in our previous part. We will be using the same work around by changing our year variable to a character type of variable. Below is our work.
ggplot(Maple_year_boxplot) +
geom_boxplot(aes(year,diameter))
Because some of the years didn’t have data for the diameter of the trees, those columns were left blank on our box plot. But, as we should have guessed, the diameter of the trees have gone slightly through the years. Now, let’s move onto seeing if there is a correlation between our sugar and diameter variable.
Below is our code.
ggplot(data = `Full Maple Clean V3`) +
geom_point(aes(diameter, sugar))
Looking at our scatter plot, we can see a slight incline in when our diameter variable increases our sugar variable follows slightly. This makes sense since trees need all sorts of minerals and such for survival. And for many plants, sugar is an essential building block. Glucose is energy, the fuel that allows plants to survive. Sugar for humans is different, since our essential needs are highly accessible. But as stated previously, this is completely different in the plant kingdom. Now, let’s move onto how our species variable might be of influence in our sugar and diameter variables.
Firstly, let’s start with a quick box plot of our two species and their sugar contents.
ggplot(`Full Maple Clean V3`) +
geom_boxplot(aes(species,sugar))
As we can see from our box plot, our ACSA (masting species) value has, on average, a higher sugar content when compared to our ACRU value (non-masting species). This makes sense, since masting tree species produce a large quantity of seeds, they need a lot of energy to do so. And what’s one of the most efficient forms of energy in this case? Sugar/glucose. The sugar comparison alone here signifies how more alive the masting species is compared to its non-masting counterpart. But let’s see what our linear model tells us about this.
sugar_model <- lm(sugar ~ species,`Full Maple Clean V3`)
summary(sugar_model)
##
## Call:
## lm(formula = sugar ~ species, data = `Full Maple Clean V3`)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7656 -0.3656 -0.0656 0.3344 4.7344
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.83537 0.02102 87.31 <2e-16 ***
## speciesACSA 0.73026 0.02258 32.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5894 on 5889 degrees of freedom
## (107 observations deleted due to missingness)
## Multiple R-squared: 0.1508, Adjusted R-squared: 0.1507
## F-statistic: 1046 on 1 and 5889 DF, p-value: < 2.2e-16
Looking at our linear model, it tells us the following. When our species variable has the value of ACSA, our sugar variable goes up by .73%. We can confirm this is true with our low p-values for both of our coefficients. Looking at our R^2 value, it tells us that we can explain 15.08% of our model. With that explained, let’s move onto our diameter variable, and how our species variable interacts with it.
Let us start with our box plot.
ggplot(`Full Maple Clean V3`) +
geom_boxplot(aes(species,diameter))
Now looking at our box plot, the species type does affect the diameter of the tree. This once again makes sense. Since the ACSA species is also the one with a higher sugar content, that would make it possible for the tree to support a larger diameter compared to one with a lesser diameter. Since we’ve previously discussed the importance of sugar for plant life. Also, another thing to add, generally speaking, trees with a larger diameter have a higher likelihood to produce more foliage compared to trees with a lower diameter. This is due to the fact that a large diameter means the tree is more stable, less likely to fall over, which means the tree can produce more leaves in order to convert sunlight into more energy. Making the tree more dynamic and living compared to one with a lesser diameter. Let’s move onto our linear model.
diameter_model <- lm(diameter ~ species,`Full Maple Clean V3`)
summary(diameter_model)
##
## Call:
## lm(formula = diameter ~ species, data = `Full Maple Clean V3`)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.403 -8.648 -2.103 10.697 21.017
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.583 2.405 19.373 < 2e-16 ***
## speciesACSA 20.120 2.673 7.528 9.19e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.78 on 124 degrees of freedom
## (5872 observations deleted due to missingness)
## Multiple R-squared: 0.3137, Adjusted R-squared: 0.3081
## F-statistic: 56.67 on 1 and 124 DF, p-value: 9.191e-12
Looking at our linear model, we see our coefficient p-values are below the 0.05 threshold, so we can be confident in claiming that our coefficient values are what they say they are. Our species coefficient tells us that when it has a value of ACSA the tree’s diameter increases by 20.12 cm. Now looking at our R^2 value, we see that it is .3137, meaning, we can explain 31.37% of our linear model. With that value being double of what we got from our previous model, we can be more confident in this one. And since the p-values of our coefficient is below the 0.05 threshold, we can confidently claim that our species variable is impactful on our diameter variable. Same goes for our previous linear model, where sugar is our response and species is our predictor.
The quick answer, yes. The longer answer is a little more complex. The sugar contents and diameter of the trees were the only variables we could compare for this analysis, but at least looking over these two variables paints a picture that the masting tree species is more active compared to our non-masting tree species. So, the variables we were able to check seem to indicate this. But, there were 12 other data sets we could have looked into, however, they only had data on the masting species. So, trying to make a comparison between two different things would be impossible in that case since you only have one side to the story. What we worked with was enough for us to at least scratch the surface to this question. And just to reiterate, the sugar contents of a tree helps the tree survive and allows said tree to be more lively, since sugar is energy and energy is needed for a tree to be more dynamic. diameter follows in this line of thinking as well. More area for things to grow out of is another important factor in determining the livelihood of a tree. With a more stable core the tree can focus more on growing appendages to collect more sunlight for example.
A lot was discussed and done in this analysis. We started with cleaning the data sets that we were going to use for our analysis, then to merge the two, next we did some more cleaning since we had a variable that a carbon copy of another variable, then we built visualizations of (year and sugar) and (year and diameter) to see if there was any sort of change over time for these two variables, next we did the same thing but used species instead of year for both of them, said some things about both of them, then we built linear models of diameter and sugar as our response and species as our predictor as two separate linear models, discussed the values and what they meant in their case, and finally, we gave our conclusive statement to our question, where it seems to be that the non-masting species exhibits muted dynamics compared to our masting species.