I’m using R here to do some quick error checking on the data. You should always do some simple plotting and checking of your data before you do anything more complicated with your data.
Don’t worry about the R code - I’m using it because I can use it quickly and it means I can record what I’ve done. If you already know how to use R, great; if you don’t, you will soon!
The code below just loads each of the three forms into a data.frame, which is basically very similar to a table in Excel. First, you will need to download the data from Epicollect. You’ll get a ZIP file containing 3 text files containing the data from each three levels in CSV format.
# Make sure you are in the right working directory. Don't worry if this is unfamiliar right now.
# Import the three sets of form data from the Epicollect Download
transect_points <- read.csv('form-1__transect-point-data.csv', stringsAsFactors=FALSE)
tree_quarters <- read.csv('form-2__tree-quarters.csv', stringsAsFactors=FALSE)
stem_data <- read.csv('form-3__stem-data.csv', stringsAsFactors=FALSE)
If we look at a table of the sample points by group, we immediately see the first issue:
with(transect_points, table(Sample_point, Group, Transect))
## , , Transect = T1
##
## Group
## Sample_point 10 11 13 14 2 3 4 5 6 7 8 9 G1 G13 G14 G5 G6 G7 G8 Group 5
## P1 1 1 1 1 1 1 1 1 1 2 1 1 1 0 0 0 1 1 1 1
## P2 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0
## P3 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 1 1 0 1 0
## P4 1 1 1 1 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 0
## P5 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 1 0 0 1 0
##
## , , Transect = T2
##
## Group
## Sample_point 10 11 13 14 2 3 4 5 6 7 8 9 G1 G13 G14 G5 G6 G7 G8 Group 5
## P1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 0 0 0 0
## P2 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0
## P3 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 0 0 0 0
## P4 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0
## P5 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 0 0 0 0
So, we should have agreed on how to style group number, or should have made the form only accept numbers. We can remove the text (either G or Group) and look again. Note that one advantage of using R here is that the original raw data files are untouched and we can record the changes being made.
# strip out the text
transect_points$Group <- sub('G|Group ', '', transect_points$Group)
# convert the text back to a number
transect_points$Group <- as.numeric(transect_points$Group)
# get the table again
with(transect_points, table(Sample_point, Group, Transect))
## , , Transect = T1
##
## Group
## Sample_point 1 2 3 4 5 6 7 8 9 10 11 13 14
## P1 1 1 1 1 2 2 3 2 1 1 1 1 1
## P2 1 1 1 1 1 2 1 2 1 1 1 1 1
## P3 1 1 1 1 1 2 1 2 1 1 1 1 1
## P4 1 1 1 1 1 1 1 2 1 1 1 1 1
## P5 1 1 1 1 1 1 1 2 1 1 1 1 1
##
## , , Transect = T2
##
## Group
## Sample_point 1 2 3 4 5 6 7 8 9 10 11 13 14
## P1 1 1 1 1 1 1 1 1 1 1 1 1 1
## P2 1 1 1 1 1 1 1 1 1 1 1 1 1
## P3 1 1 1 1 1 1 1 1 1 1 1 1 1
## P4 1 1 1 1 1 1 1 1 1 1 1 1 1
## P5 1 1 1 1 1 1 1 1 1 1 1 1 1
Issue 1: There is some duplication of points in Transect 1
The next chunk of code carries out a merge - it uses the linking columns between the transect observations and the tree observations to join the rows together.
# Join transect point data to the rows in the tree quarter data
tree_level_data <- merge(tree_quarters, transect_points,
by.x='ec5_parent_uuid', by.y='ec5_uuid')
First, we already know there is some duplication of quadrat points but we can look at a table of the number of trees:
# Tabulate the quarter points by group and transect
with(tree_level_data, table(Quarter, Group, Transect))
## , , Transect = T1
##
## Group
## Quarter 1 2 3 4 5 6 7 8 9 10 11 13 14
## QI 5 5 5 5 6 6 6 5 5 5 6 0 5
## QII 5 5 5 5 6 7 5 5 5 5 5 0 5
## QIII 5 5 5 5 6 6 6 5 5 5 4 0 5
## QIV 5 5 5 5 5 6 5 5 5 5 5 0 5
##
## , , Transect = T2
##
## Group
## Quarter 1 2 3 4 5 6 7 8 9 10 11 13 14
## QI 5 5 5 5 5 5 5 5 5 5 5 5 5
## QII 5 5 5 5 5 5 5 5 5 5 5 5 5
## QIII 5 5 5 5 5 5 5 5 5 5 5 5 5
## QIV 5 5 5 5 5 5 5 5 5 5 5 5 5
That immediately reveals another issue:
Issue 2: Missing tree data from Group 13 in Transect 1.
The main data from this table are the distances from the sampling point to the stem, so we check those look sensible:
# Check the sampling distances by group.
plot(Distance_SamplePoint ~ as.factor(Group), data=tree_level_data)
Issue 3: Group 7 used centimetres to record stem distance from sampling point.
Before we get too complacent, lets just check what happens if we convert Group 7’s distance back to metres
tree_level_data$Distance_SamplePoint <- with(tree_level_data, ifelse(Group == 7, Distance_SamplePoint/100, Distance_SamplePoint))
plot(Distance_SamplePoint ~ as.factor(Group), data=tree_level_data)
Those look a lot more reasonable.
The first thing to check is that we have stem measurements for all of the quarter trees
# check to see if each tree ID appears in the set of stem data
tree_level_data$stem_found <- tree_level_data$ec5_uuid %in% stem_data$ec5_parent_uuid
table(tree_level_data$stem_found)
##
## FALSE TRUE
## 13 497
OK, so we are missing some stem data, and we can look up which ones:
subset(tree_level_data, subset=stem_found == FALSE, select=c(Group, Transect, Sample_point, Quarter))
## Group Transect Sample_point Quarter
## 16 6 T1 P1 QI
## 113 3 T1 P2 QIII
## 114 3 T1 P2 QII
## 115 3 T1 P2 QI
## 116 3 T1 P2 QIV
## 237 6 T1 P3 QII
## 243 6 T1 P1 QIII
## 244 6 T1 P1 QIV
## 245 6 T1 P1 QII
## 298 5 T1 P1 QIII
## 299 5 T1 P1 QII
## 300 5 T1 P1 QI
## 346 7 T1 P1 QIII
Issue 4: A handful of missing stems across various groups.
We’ll join up the stem data to the tree data, so we can look for structure in any issues:
# Join transect point data to the rows in the tree quarter data
stem_level_data <- merge(tree_level_data, stem_data,
by.x='ec5_uuid', by.y='ec5_parent_uuid')
First, we can look at the subset of data collected on fallen stems:
# Extract the quantitative data for fallen stems
fallen_stem_data <- subset(stem_level_data, Stem_status == 'fallen', select=c(Stem_distance, Angle_to_base, Angle_to_canopy, Fallen_stem_length))
summary(fallen_stem_data)
## Stem_distance Angle_to_base Angle_to_canopy Fallen_stem_length
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.:10.000 1st Qu.: 5.000 1st Qu.:40.00 1st Qu.: 4.750
## Median :10.000 Median :10.000 Median :55.00 Median : 9.000
## Mean : 8.985 Mean : 9.754 Mean :49.63 Mean : 7.833
## 3rd Qu.:10.000 3rd Qu.:15.000 3rd Qu.:70.00 3rd Qu.:10.000
## Max. :15.000 Max. :25.000 Max. :81.00 Max. :15.000
## NA's :53
Issue 5: We weren’t sufficiently clear about how to handle fallen live stems. With a fallen stem, we can simply measure the length and girth directly, but most people are still measuring angles and distances. If the fallen stem is at right angles to the distance line, we can still work out length, but we need to decide how to handle this.
First, we extract the data for standing stems:
# Extract the quantitative data for fallen stems
standing_stem_data <- subset(stem_level_data, Stem_status == 'standing')
Now we can plot the quantitative data by group:
par(mfrow=c(1,3), mar=c(3,3,1,1), mgp=c(2,1,0))
plot(Stem_distance~ as.factor(Group), data=standing_stem_data)
plot(Angle_to_base~ as.factor(Group), data=standing_stem_data)
plot(Angle_to_canopy~ as.factor(Group), data=standing_stem_data)
“Issue” 6: Some of the distances are shorter than 10 metres - if they are paired with steep angles, then they may be correct and we may just get a bit more error on those height estimates. Some of the distances are very large - these may be typos too.