Assignment 1

The code in the chunk above sets the global behaviour for all R code chunks in this document. The “include=FALSE” part of the code will hide the section above when you knit this document.

The code in the chunk above loads the library and data that you need.

Question 1 – Forest composition

#1A
table(trees$forestType) #count trees in each forest type
## 
##   BF  DDF DDFP  DEF DMDF  LMF MMDF   PF 
##  116 3087  316  429  907  168  546  756
#1B
totalno.oftrees <- sum(table(trees$forestType))
DDFpercentage <- 3087/sum(table(trees$forestType))
DDFpercentage
## [1] 0.4880632

1A) Which forest type has the largest number of trees?
DDF (deciduous dipterocarp forest) has the largest number of trees, 3087.

1B) What percentage of the dataset do trees in this forest type represent?

DDF percentage in total number of trees is 48.81%. It is not evenly distributed. There are 8 forest types in total and one forest type (DDF) has been found to make up nearly 50% of the number of trees, indicating an uneven distribution of forest types. Additionally, this is indicated in the table (#1A) showing the uneven distribution of trees (out of a total of 6325) within each forest type.

Question 2 – Tree size characteristics

#2A. mean dbh
mean(trees$dbh)
## [1] 40.18248
#2B. median dbh
median(trees$dbh)
## [1] 34.5
#2C. minimum tree height
min(trees$height)
## [1] 2
#2D. maximum tree height
max(trees$height)
## [1] 52.5

2A) dbh mean
40.18cm

2B) dbh median
34.50cm

2C) tree height minimum
2m

2D) tree height maximum
52.50m

Question 3 – Mapping vs setting aesthetics

p <- ggplot(data = trees, mapping = aes(x = dbh, y = height))
#3A, Version 1: changing colour
p + geom_point(colour = "green")+labs(x = "DBH (cm)", y = "Height (m)", 
                                      title = "DBH-Height relationships in Thai forests",
                                      caption = "Source: Thai ForestGEO")

#3A, Version 2: changing variables
p + geom_point(aes(colour = "green"))+labs(x = "DBH (cm)", y = "Height (m)", 
                                           title = "DBH-Height relationships in Thai forests",
                                           caption = "Source: Thai ForestGEO")

3A) How do the plots differ?
The first plot (Version 1) has displayed the points on the graph in the colour green. The second plot (Version 2) has displayed the points on the graph in red, but labelled the variables under the colour, green.

3B) Why does this happen?
The first plot (Version 1) has modified the display of the graph, by modifying the geometry that is displaying our data (geom_point) and changing the geometry to the colour green. The second plot (Version 2) has modified our graph, resulting in changing the mapping of the variables of the graph (which are misleadingly called “aesthetics” in ggplot functions) and labeling the variables under “green” rather than changing the geometry of how we wish to display our data, to the colour green. It’s important to note with ggplots the difference between changing aesthetics (graph variables) and geometry (graph display) when creating graphs.

Question 4 – DBH–height relationships by forest type

#grouping data showing the relationship between DBH and height for each forest type
ggplot(trees, aes(x = dbh, y = height, colour = forestType)) +
  geom_point()+labs(x = "DBH (cm)", y = "Height (m)", 
                    title = "DBH-Height relationships in Thai forests",
                    caption = "Source: Thai ForestGEO")

#4 faceted plot showing the relationship between DBH and height for each forest type
p <- ggplot(data = trees, mapping = aes(x = dbh, y = height))
p + geom_point() + facet_wrap(~forestType)+labs(x = "DBH (cm)", y = "Height (m)", 
                                                title = "DBH-Height relationships between Forest Types, in Thai forests",
                                                caption = "Source: Thai ForestGEO")

#4A faceted plot + smooth curve (for easier interpretation)
p <- ggplot(data = trees, mapping = aes(x = dbh, y = height))
p + facet_wrap(~forestType)+ geom_smooth()+labs(x = "DBH (cm)", y = "Height (m)", 
                                                title = "DBH-Height relationships between Forest Types, in Thai forests",
                                                caption = "Source: Thai ForestGEO")

4A) Which forest type appears to have the tallest trees for a given DBH?
DDFP (deciduous dipterocarp and pine forest) appear to have the tallest trees (~45m) to dbh (~60cm), no other tree exceeds the height of DDFP at a dbh of around 60cm.

4B) Do the DBH–height relationships appear similar across forest types?
Not across all forest types, they vary between forest types quite dramatically. There are a few forest types that appear to have a similar relationship between height and dbh, such as DEF,DMDF & MMDF.

Question 5 – Species comparison

#Q5 - Species Comparison
#grouping showing the relationship between DBH and height for each tree species
p <- ggplot(data = trees, mapping = aes(x = dbh, y = height,colour = sppCode))
p + geom_point() +labs(x = "DBH (cm)", y = "Height (m)", 
                                              title = "DBH-Height relationships between Tree Species, in Thai forests",
                                              caption = "Source: Thai ForestGEO")

#faceting showing the relationship between DBH and height for each tree species
p <- ggplot(data = trees, mapping = aes(x = dbh, y = height, colour=sppCode))
p + geom_point() + facet_wrap(~sppCode)+labs(x = "DBH (cm)", y = "Height (m)", 
                                              title = "DBH-Height relationships between Tree Species, in Thai forests",
                                              caption = "Source: Thai ForestGEO")

#faceting & smooth curve showing the relationship between DBH and height for each tree species
p <- ggplot(data = trees, mapping = aes(x = dbh, y = height))
p + geom_smooth() + facet_wrap(~sppCode)+labs(x = "DBH (cm)", y = "Height (m)", 
                                             title = "DBH-Height relationships between Tree Species, in Thai forests",
                                             caption = "Source: Thai ForestGEO")

Q5) Why might faceting by species be harder to interpret than faceting by forest type?
Many more relationships to compare. We had only 8 Forest Types, in comparison to 20 Tree Species. Depends what we are trying to interpret, are we wanting to distinguish patterns in different Forest Types or different Tree Species? It depends on the question we are trying to answer. If we are looking to answer more generally, DBH-height relationships in Thai Forests, simplifying the data into more broad groups, such as Forest Types, could make for more simple data interpretations.

Question 6 – Adding observed data

#Q6 faceting, smooth curve and observed point data showing the relationship between DBH and height for each tree species
p <- ggplot(data = trees, mapping = aes(x = dbh, y = height))
p + geom_point() + geom_smooth() + facet_wrap(~sppCode)+labs(x = "DBH (cm)", y = "Height (m)", 
                                                             title = "DBH-Height relationships between Tree Species, in Thai forests",
                                                             caption = "Source: Thai ForestGEO")

Q6) Why can including observed data improve interpretation of a figure?
The help function in RStudio defines a “geom_point scatterplot [to be] most useful for displaying the relationship between two continuous variables” and that a “geom_smooth plot aids the eye in seeing patterns in the presence of overplotting.”

Using both a smooth curve overlaid onto a scatterplot of observed data points, allows the perception of both general patterns present (particuarly useful in dense scatterplots as observed here) as well as individually observed points and their relationships the general patterns made apparent, being able to see how many outliers exist, and where they exist on the graph can help deeper interpretation of data (such as understanding it’s overall variability).

Question 7 – Histogram bin width

#7A. Calculate the bin width used in the figure.
ggplot(trees, aes(x = dbh)) +
  geom_histogram()+labs(x = "DBH (cm)", y = "No. of Trees", 
                        title = "No. of Trees to DBH in Thai forests (30 bins)",
                        caption = "Source: Thai ForestGEO")

#default number of bins is always 30, unless specified otherwise

#7B. Recreate the histogram using 10 bins and 50 bins.
#10 bins
ggplot(trees, aes(x = dbh)) +
  geom_histogram(bins = 10)+labs(x = "DBH (cm)", y = "No. of Trees", 
                        title = "No. of Trees to DBH in Thai forests (10 bins)",
                        caption = "Source: Thai ForestGEO")

#50 bins
ggplot(trees, aes(x = dbh)) +
  geom_histogram(bins = 50)+labs(x = "DBH (cm)", y = "No. of Trees", 
                                     title = "No. of Trees to DBH in Thai forests (50 bins)",
                                     caption = "Source: Thai ForestGEO")

#7A bin width
(max(trees$dbh)-min(trees$dbh))/30 
## [1] 5.85
#7B bin widths
(max(trees$dbh)-min(trees$dbh))/10 
## [1] 17.55
(max(trees$dbh)-min(trees$dbh))/50 
## [1] 3.51

Q7A/B) How does changing the number of bins influence your interpretation of the distribution?
“You should always override the default value of bins in histogram function, exploring multiple widths to find the best to illustrate the stories in your data.” - RStudio Help function.

You can make the histogram more simplistic or detailed, depending on the number of bins you choose. In #7B, we are varying between collating dbh data within 17.55cm’s, or within 3.51cm’s. A ~20cm dbh difference is a large jump for tree growth, so this would be potentially an overly simplistic/ overly large collation of dbh data. Collating dbh data within 3.51cm’s seems more appropriate for assessing tree ring growth (dbh), as this occurs slowly, with only cm’s of growth representing a significant change in data. Therefore, assessing smaller amounts of change in cm’s would produce a more accurate interpretation of the data. The scale of bin width chosen (through the number of bins selected for) is dependent on the data you are assessing, whether it is more accurately interpreted at finer scales, or broader scales.

Question 8 – Density plots

#Q8 density plot
ggplot(trees, aes(x = dbh)) +
  geom_density()+labs(x = "DBH (cm)", y = "No. of Trees", 
                      title = "No. of Trees by DBH in Thai forests",
                      caption = "Source: Thai ForestGEO")

8A) How does the density plot differ from the histogram?
Presents the percentage of trees per dbh (cm), rather than the total count of tree per dbh (cm), which a histogram presents.

8B) What additional information does a density plot reveal about the distribution?
The density plot gives a clearer picture on the amount of trees, within the total number of trees, that exhibit a particular trait of dbh (cm). For example, we can easily assess that around a quarter of the total trees in our data, have a dbh of approx. 25cm.

Question 9 – Investigating a subset

#Q9 - Investigating a Subset
#Histogram of DDF Forest Type, no. of trees x dbh
ggplot(subset(trees, forestType == "DDF"),aes(x = dbh)) +
  geom_histogram(bins = 50)+labs(x = "DBH (cm)", y = "No. of DDF Forest Type Trees", 
                                 title = "No. of DDF Forest Type Trees by DBH in Thai forests (50 bins)",
                                 caption = "Source: Thai ForestGEO")

9A) Is it symmetric (balanced on both sides) or skewed (with a tail to the left or right)?
It is skewed to the left, the forest type of DDF have more trees with a lower dbh, than higher (above 50cm dbh), with a tail extending to the right of the histogram.

9B) What ecological process might explain this pattern?
Perhaps, the recruitment stand of DDF is younger, so there are less trees that have reached a wider dbh of 50cm, due to being a younger stand of trees.

Question 10 – Design your own figure

#Q10 Design your own figure
#ggplot of dbh and height relationship in DDF forest type, faceted into species with smooth curve overlay
p <- ggplot(data = subset(trees, forestType == "DDF"), mapping = aes(x = dbh, y = height))
p + geom_point(colour = 'purple', size = 0.50, shape = 5) + geom_smooth(colour='red') + facet_wrap(~sppCode) +
  labs(x = "DBH (cm)", y = "Height (m)", 
       title = "DBH-Height relationships in Forest Type DDF(deciduous dipterocarp forest) by Species, in Thai forests",
       caption = "Source: Thai ForestGEO")

10A) What the figure reveals about the data set.
There is a lot of variability in each species dbh x height relationship, within the Forest Type of DDF. It also show’s the variance in number of data points that were collected within each species, with much higher numbers collected for some (DIPTOB, SHOROB, DIPTTU and SHORSI) than others (LAGECA, PINUKE, TECTGR, etc.).

10B) What does it show that would be harder to see using a different type of visualisation? The smooth curve helps depict the relationship, correlation, and trend between the two continuous variables correlation of height x dbh. This would be harder to visualise using a box plot, which would help at comparing each species dbh x height side by side, outlining each species’ center, spread, skewness, and potential outliers.