Mesic, native-dominated forest in Hawaiian Volcanoes National Park (S. Yelenik).
Welcome to a journey through Hawai’i Volcanoes National Park (HAVO) where you’ll vicariously experience HAVO via data collected by researchers Jonathan J. Henn, Stephanie G. Yelenik, and Ellen I. Damschen. This document provides an example for processing ecological data through various techniques using RStudio and RMarkdown.
I selected this dataset from HAVO because my graduate research at University of Hawai’i at Hilo is also based within the park’s boundaries. My graduate research involves detecting invasive plants found within HAVO, so this vegetation dataset is not directly related, but both are vegetation focused. My goal is to present various methods to approach, visualize, and analyze ecological data using RStudio and RMarkdown. I hope to discover endless options to maximize data and find joy in the process.
havo <- read.csv("HAVO_plant_cover_trait_and_environmental_data.csv") # Loading the .csv file into R
Behold, the Data! This table is simply showing you the datasheet in its entirety through a series of pages thanks to the R Markdown + paged_table() function.
library(rmarkdown)
paged_table(havo)
The str() function is a simple way to view data’s “basic” structure. It communicates the classes of your variables and the number of observations. In this case, we see this dataset contains continuous and categorical (or factor) data across 30 different variables.
str(havo)
## 'data.frame': 247 obs. of 30 variables:
## $ UTM_X : int 266654 270444 267456 273367 274006 267290 271015 270637 270480 271101 ...
## $ UTM_Y : int 2138394 2138782 2140371 2140273 2139901 2140451 2137926 2137440 2138110 2137237 ...
## $ Species : Factor w/ 24 levels "ANDVIR","BULCAP",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Plot : int 2012 2026 2030 2053 2054 2076 2077 2078 2191 2194 ...
## $ Cover : num 0.02 0.1 0.01 0.01 0.05 0.01 0.24 0.08 0.05 0.05 ...
## $ METPOLcov : num 0 0.32 0.13 0.07 0.15 0.09 0 0.08 0.135 0.1 ...
## $ LDMC.g.g. : num 0.43 0.408 0.413 0.347 0.342 0.445 0.402 0.407 0.42 0.407 ...
## $ LDMCstd : num 0.037 0.014 0.025 0.015 0.023 0.058 0.023 0.023 0.034 0.021 ...
## $ AREA.cm2. : num 2.54 3.29 3.74 3.14 3.55 ...
## $ AREAstd : num 0.302 0.162 0.731 0.771 0.731 0.438 0.152 0.337 0.192 0.199 ...
## $ SLA.cm2.g. : num 242 237 270 130 173 ...
## $ SLAstd : num 27.4 12.9 15.6 37.2 29.3 ...
## $ Ldens.g.cm3.: num 0.618 0.455 0.532 1.037 0.904 ...
## $ Ldensstd : num 0.078 0.078 0.199 0.303 1.523 ...
## $ Wdensity : num NA NA NA NA NA NA NA NA NA NA ...
## $ height.m. : num 1 1 1 1 1 1 1 1 1 1 ...
## $ seedmass.g. : num 1.3 1.3 1.3 1.3 1.3 ...
## $ X.N : num 0.932 0.822 1.078 0.864 0.78 ...
## $ X.C : num 43.5 42.4 41.9 43 42.9 ...
## $ C.N : num 46.7 51.6 38.9 49.8 55 ...
## $ growth : Factor w/ 5 levels "F","G","H","S",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ status : Factor w/ 3 levels "E","I","N": 3 3 3 3 3 3 3 3 3 3 ...
## $ Numsamples : int 5 5 10 12 10 10 5 5 5 5 ...
## $ ET.mm. : num 751 855 931 885 897 ...
## $ Rad.w.m2. : num 225 212 221 204 203 ...
## $ T.C. : num 18.6 18.5 17.7 18.3 18.4 ...
## $ ele.m. : num 763 753 876 764 750 ...
## $ precip.mm. : num 2009 2254 2136 2631 2603 ...
## $ rain : Factor w/ 3 levels "DRY","INT","WET": 2 2 3 3 3 3 2 2 2 2 ...
## $ temp : Factor w/ 2 levels "COOL","HOT": 2 2 2 2 2 2 2 2 2 2 ...
To get a better sense of the distribution of your variables in the dataset, the summary() function shows descriptive statistics per variable (depending on the type of variable).
summary(havo)
## UTM_X UTM_Y Species Plot
## Min. :258616 Min. :2136979 METPOL : 32 Min. :2006
## 1st Qu.:263262 1st Qu.:2137779 DODVIS : 25 1st Qu.:2054
## Median :265326 Median :2139901 LEPTAM : 25 Median :2078
## Mean :266562 Mean :2141562 MORFAY : 19 Mean :2109
## 3rd Qu.:271015 3rd Qu.:2147403 SCHCON : 16 3rd Qu.:2194
## Max. :274420 Max. :2149009 ANDVIR : 14 Max. :2212
## (Other):116
## Cover METPOLcov LDMC.g.g. LDMCstd
## Min. :0.0000 Min. :0.0000 Min. :0.1490 Min. :0.00400
## 1st Qu.:0.0100 1st Qu.:0.0150 1st Qu.:0.3285 1st Qu.:0.02000
## Median :0.0200 Median :0.1000 Median :0.4080 Median :0.03200
## Mean :0.1014 Mean :0.2294 Mean :0.3944 Mean :0.03589
## 3rd Qu.:0.0950 3rd Qu.:0.4000 3rd Qu.:0.4610 3rd Qu.:0.04100
## Max. :0.8900 Max. :0.7650 Max. :0.6710 Max. :0.21100
##
## AREA.cm2. AREAstd SLA.cm2.g. SLAstd
## Min. : 0.144 Min. : 0.012 Min. : 0.0 Min. : 0.000
## 1st Qu.: 3.299 1st Qu.: 0.582 1st Qu.: 67.0 1st Qu.: 7.599
## Median : 7.314 Median : 1.567 Median :104.9 Median : 12.613
## Mean :12.879 Mean : 3.448 Mean :116.6 Mean : 16.180
## 3rd Qu.:14.904 3rd Qu.: 3.693 3rd Qu.:150.7 3rd Qu.: 19.827
## Max. :86.492 Max. :23.593 Max. :452.6 Max. :163.617
## NA's :1 NA's :1
## Ldens.g.cm3. Ldensstd Wdensity height.m.
## Min. :0.1450 Min. :0.0070 Min. :0.4241 Min. : 0.300
## 1st Qu.:0.4298 1st Qu.:0.1890 1st Qu.:0.6648 1st Qu.: 1.500
## Median :0.5425 Median :0.4390 Median :0.7971 Median : 6.000
## Mean :0.5545 Mean :0.6517 Mean :0.8043 Mean : 6.571
## 3rd Qu.:0.6647 3rd Qu.:0.8450 3rd Qu.:0.9446 3rd Qu.:10.000
## Max. :1.3510 Max. :4.4130 Max. :1.1760 Max. :16.000
## NA's :1 NA's :1 NA's :114
## seedmass.g. X.N X.C C.N growth
## Min. : 0.00001 Min. :0.3902 Min. :26.31 Min. :18.99 F:27
## 1st Qu.: 0.00006 1st Qu.:0.7274 1st Qu.:42.76 1st Qu.:36.77 G:64
## Median : 0.04470 Median :0.9106 Median :45.32 Median :48.42 H: 2
## Mean : 0.83702 Mean :1.0091 Mean :44.85 Mean :50.57 S:63
## 3rd Qu.: 0.47000 3rd Qu.:1.1786 3rd Qu.:47.07 3rd Qu.:62.69 T:91
## Max. :11.06031 Max. :2.3247 Max. :51.52 Max. :99.36
## NA's :30 NA's :15 NA's :15 NA's :15
## status Numsamples ET.mm. Rad.w.m2. T.C.
## E: 77 Min. : 3.000 Min. :730.8 Min. :197.4 Min. :15.42
## I: 68 1st Qu.: 5.000 1st Qu.:799.1 1st Qu.:209.0 1st Qu.:15.90
## N:102 Median : 6.000 Median :863.2 Median :213.0 Median :17.90
## Mean : 7.239 Mean :853.6 Mean :213.6 Mean :17.60
## 3rd Qu.: 8.000 3rd Qu.:895.5 3rd Qu.:221.5 3rd Qu.:18.56
## Max. :25.000 Max. :989.8 Max. :225.8 Max. :19.44
##
## ele.m. precip.mm. rain temp
## Min. : 620.0 Min. :1904 DRY: 46 COOL: 69
## 1st Qu.: 751.3 1st Qu.:2009 INT: 69 HOT :178
## Median : 851.9 Median :2189 WET:132
## Mean : 890.2 Mean :2346
## 3rd Qu.:1135.0 3rd Qu.:2631
## Max. :1190.4 Max. :3316
##
Luckily, this dataset is accompanied by substantial metadata, additional information that describes and provides details about the dataset. It is nearly thorough, but it lacks explanations for missing data values for a handful of variables. Check out the Data Issues tab for more details on how many NA values are missing from which columns.
You can find this metadata here.
Below is a table that compiles some details of who collected this data and when:
| Originators | Beginning of data collection | End of data collection | |
|---|---|---|---|
| Jonathan J. Henn | 06/01/2014 | 09/01/2014 | |
| Stephanie G. Yelenik | 06/01/2014 | 09/01/2014 | |
| Ellen I. Damschen | 06/01/2014 | 09/01/2014 |
Determining the characteristics of non-native plants that can successfully establish and spread is central to pressing questions in invasion ecology. Evidence suggests that some non-native species establish and spread in new environments because they possess characteristics (functional traits) that allow them to either successfully compete with native residents or fill previously unfilled niches. However, the relative importance of out-competing native species vs. filling empty niche space as potential mechanisms of invasion may depend on environmental characteristics. Here, we measured plant functional traits, proxies indicative of competitive and establishment strategies, to determine if these traits vary among native and invasive species and if their prevalence is dependent on environmental conditions. Using a natural environmental gradient in Hawai’i Volcanoes National Park, we evaluated how functional traits differ between native and non-native plant communities and if these differences change along an environmental gradient from hot, dry to cool, wet conditions.
You may have noticed from flipping through that there are missing data represented by NA. Below is a count and sum function. The count function below counts how many NA values are found in each variable. The sum function provides the total NA count across all variables. We find out there are a total of 193 missing values in this dataset. Don’t panic. We’ll attempt to bootstrap or strategically replace these values later.
NA_count <-sapply(havo, function(y) sum(length(which(is.na(y))))) # locating all NA entries
NA_count <- data.frame(NA_count) # creating a data table out of NA entries
print(NA_count) # displays the NA count table
## NA_count
## UTM_X 0
## UTM_Y 0
## Species 0
## Plot 0
## Cover 0
## METPOLcov 0
## LDMC.g.g. 0
## LDMCstd 0
## AREA.cm2. 1
## AREAstd 1
## SLA.cm2.g. 0
## SLAstd 0
## Ldens.g.cm3. 1
## Ldensstd 1
## Wdensity 114
## height.m. 0
## seedmass.g. 30
## X.N 15
## X.C 15
## C.N 15
## growth 0
## status 0
## Numsamples 0
## ET.mm. 0
## Rad.w.m2. 0
## T.C. 0
## ele.m. 0
## precip.mm. 0
## rain 0
## temp 0
NA_total <- sum(NA_count) # sum of all NA entries/variable
print(NA_total) # displays the total NA entries
## [1] 193
Data analysis is often preceded with cleaning and preparing data. Data tidying is the process of structuring datasets to facilitate analysis. Tidy data is a standardized way to map the meaning of a dataset to its structure or how its rows, columns, and tables are matched up with observations, variables, and types. You have tidy data when:
library(dplyr)
library(kableExtra)
dt <- havo[1:5, 1:6] # creating a snippet of the data table for viewing purposes
kable(dt) %>% # displaying the snippet table
kable_styling(bootstrap_options = "striped", full_width = F, position = "float_right")
| UTM_X | UTM_Y | Species | Plot | Cover | METPOLcov |
|---|---|---|---|---|---|
| 266654 | 2138394 | ANDVIR | 2012 | 0.02 | 0.00 |
| 270444 | 2138782 | ANDVIR | 2026 | 0.10 | 0.32 |
| 267456 | 2140371 | ANDVIR | 2030 | 0.01 | 0.13 |
| 273367 | 2140273 | ANDVIR | 2053 | 0.01 | 0.07 |
| 274006 | 2139901 | ANDVIR | 2054 | 0.05 | 0.15 |
Our data is indeed tidy! Each variable has its own column, each observation has its own row, and each value has its own cell. However, we are missing one component to get a more organized dataset: we need unique identifiers for each observation. We’ll create a unique ID based on the UTM_X column.
We’ll run one command to make a unique key for each observation.
library(dplyr)
havo_uid <- mutate(havo,id = dplyr::row_number(UTM_X)) # command to create a unique ID column
new_dt <- havo_uid %>% select(id, everything()) # moving unique ID column to first column
show_new_dt <- new_dt[1:5, 1:7] # creating a snippet of the data table for viewing purposes
kable(show_new_dt) %>% # displaying the snippet table
kable_styling(bootstrap_options = "striped", full_width = F, position = "float_right")
| id | UTM_X | UTM_Y | Species | Plot | Cover | METPOLcov |
|---|---|---|---|---|---|---|
| 131 | 266654 | 2138394 | ANDVIR | 2012 | 0.02 | 0.00 |
| 164 | 270444 | 2138782 | ANDVIR | 2026 | 0.10 | 0.32 |
| 152 | 267456 | 2140371 | ANDVIR | 2030 | 0.01 | 0.13 |
| 226 | 273367 | 2140273 | ANDVIR | 2053 | 0.01 | 0.07 |
| 233 | 274006 | 2139901 | ANDVIR | 2054 | 0.05 | 0.15 |
Success!
Here’s a snippet of our new data table complete with unique IDs for each observation.
Remember when we found out there are 193 missing values in this dataset? Let’s address those now.
A commonly used package for missing values imputation is MICE (Multivariate Imputation via Chained Equations). This package creates multiple imputations as compared to a single imputation (such as mean) to take care of uncertainty in missing values. It imputes data on a variable by variable basis by specifying an imputation model per variable.
The MICE plot is indicating the Wdensity (wood density) variable has the most NA values (about 41% of observations are missing Wdensity data). We’ll perform imputations on the Wdensity to compute values for the 115 missing observations. We’ll use the cart (imputation by classification and regression trees) method.
library(mice)
library(VIM)
# Plot of the distribution of NA values per variable
mice_plot <- aggr(new_dt, col=c('pink', 'yellow'), numbers=TRUE, sortVars=TRUE, labels=names(new_dt), cex.axis=.7, gap=3, ylab=c("Missing data", "Pattern"))
new_dt <- havo_uid %>% select(id, everything())
# Imputing missing values of 3 different iterations
new_dt_i <- mice(new_dt, m = 3, maxit = 50, method = 'cart', seed = 500)
# Create a plot of each iteration of imputed data
plot(new_dt_i) # Create a plot of each iteration of imputed data
Time to compare the 3 new imputations data vs. the original values from the Wdensity column. First we see the 3 imputed datasets of Wdensity, the original data, then visual comparisons of both the imputed and observed data.
new_dt_i$imp$Wdensity # Viewing 3 iterations of imputation values for Wdensity column
## 1 2 3
## 1 0.8775906 0.7970461 0.8198617
## 2 0.9139951 0.8597114 1.0059142
## 3 0.9508514 0.7970461 0.9894424
## 4 0.8086734 0.9054732 1.0089507
## 5 0.7939860 0.7970461 1.0875365
## 6 0.8301417 0.7172554 0.8914270
## 7 0.9014814 0.9054732 1.0732105
## 8 0.7898590 0.9508514 1.0732105
## 9 0.9054732 0.7970461 1.0296478
## 10 0.7939860 0.8086734 0.8198617
## 11 0.8775906 0.8597114 0.9358146
## 12 0.9054732 0.9139951 1.1720072
## 13 0.9014814 0.9508514 1.1720072
## 14 0.9014814 0.7939860 1.0526980
## 15 0.9508514 0.8350667 0.8914270
## 16 0.9109866 0.8597114 1.0087957
## 17 0.8757702 0.7946580 0.9767145
## 18 0.9054732 0.8301417 0.8736248
## 19 0.8757702 0.7172554 1.0732105
## 20 0.9054732 0.7600926 0.9915416
## 21 0.9139951 0.7946580 0.8321862
## 22 0.8086734 0.7939860 0.8736248
## 23 0.8086734 0.7970461 0.9581061
## 24 0.9508514 0.7939860 1.1338318
## 25 0.6450106 0.6739045 0.5585264
## 26 0.6362979 0.6739045 0.6982613
## 27 0.6717506 0.6739045 0.6140330
## 28 0.5832948 0.7638803 0.6450106
## 29 0.7131416 0.6362979 0.6450106
## 30 0.7131416 0.6347277 0.6140330
## 31 0.5882865 0.6371286 1.1759775
## 32 0.6347277 0.6982613 0.5706291
## 33 0.7131416 0.5667743 0.6086379
## 34 0.6347277 0.6717506 1.1759775
## 35 0.8256663 0.6371286 1.1759775
## 36 0.5882865 0.5950523 0.5706291
## 37 0.6717506 0.5950523 0.6422176
## 41 0.5834137 0.6200396 0.6898228
## 42 1.1603136 0.9574198 0.6867500
## 44 0.9781657 0.5950523 0.9814638
## 45 0.7008337 0.7424247 0.6128773
## 46 0.9154403 0.6917519 0.6235502
## 47 0.8123461 0.6917519 0.8914270
## 48 0.5992829 0.6371286 0.6235502
## 49 0.8729342 0.6739045 0.6586052
## 50 0.6200396 0.6917519 0.8729342
## 51 0.8729342 0.6371286 0.7268824
## 64 1.0526980 0.8649921 0.9894424
## 69 0.9888211 0.8123461 0.9983258
## 70 0.9267537 1.0128508 0.9476086
## 77 0.6235502 0.6422176 0.8729342
## 78 0.8757702 0.9123416 0.8914270
## 79 0.7623218 0.5992829 0.6586052
## 80 0.6128773 0.5882865 0.6586052
## 81 0.6953691 0.6235502 0.7970461
## 82 0.5992829 0.6086379 0.5992829
## 83 0.7623218 0.6235502 0.5992829
## 84 0.5992829 0.6086379 0.7623218
## 103 0.6895958 1.0087957 0.9576641
## 106 0.8350667 0.8321862 0.9440869
## 110 0.8585566 0.7800303 0.9109866
## 115 0.9888211 0.7799323 0.9306841
## 116 0.8928684 0.5170285 0.9054732
## 117 0.9983258 0.5667743 0.7581210
## 118 0.8928684 0.7246467 1.0310442
## 119 0.9888211 0.5706291 0.8928684
## 120 0.9781657 0.5521894 0.9054732
## 121 0.8928684 0.5838757 0.9446464
## 122 1.0528247 0.7377041 0.9571675
## 123 0.9305814 0.6140330 0.9915416
## 124 0.9983258 0.5706291 0.8123461
## 125 0.9305814 0.6717506 0.8649921
## 126 1.0128508 0.5667743 0.8650877
## 146 0.6347277 0.7638803 0.6647740
## 172 0.7898590 0.8597114 0.7172554
## 180 0.8775906 0.8350667 0.8301417
## 189 0.9983258 0.6703886 0.9983258
## 190 0.6441729 0.6450106 0.7268824
## 191 0.9425009 0.5950523 1.0059142
## 192 0.6128773 0.6788970 0.7008337
## 193 0.8123461 0.6450106 1.0667986
## 194 0.8928684 0.6982613 0.8585566
## 200 0.9508514 0.6086379 0.6953691
## 201 0.9054732 0.5882865 0.7970461
## 202 0.9508514 0.6422176 0.7970461
## 203 0.5992829 1.0528247 0.7623218
## 204 0.7800303 0.9139951 0.9825617
## 205 0.7600926 0.6898228 0.9571675
## 208 0.8886295 0.5834137 0.8585566
## 210 0.9425009 0.9305814 1.0128508
## 215 0.9571675 0.7172554 0.9894424
## 216 0.8650877 0.6200396 0.7581210
## 221 0.7970461 0.9508514 1.0089507
## 222 0.7600926 0.8086734 1.0296478
## 223 0.8757702 0.7946580 1.0089507
## 224 0.9508514 0.7946580 1.0528247
## 225 0.8301417 0.7172554 0.9014814
## 226 0.8086734 0.7946580 0.8585566
## 227 0.7600926 0.8597114 0.9919641
## 228 0.8350667 0.7939860 0.8585566
## 229 0.9508514 0.8301417 1.0526980
## 230 0.9014814 0.7172554 0.9814638
## 231 0.7521280 0.7600926 0.6895958
## 232 0.7255751 0.9054732 1.0089507
## 233 0.8350667 0.7946580 0.9014814
## 234 0.8086734 0.7600926 1.0875365
## 235 0.7898590 0.9054732 0.9814638
## 236 0.9508514 0.7946580 1.0732105
## 240 0.9781657 0.8123461 0.8585566
## 241 0.8736248 0.8928684 0.9358146
## 242 0.8928684 0.9267537 0.9888211
## 243 0.9814638 0.7909873 0.9571675
## 244 0.9267537 0.7909873 0.9425009
## 245 0.9571675 0.8650877 0.9425009
new_dt$Wdensity # Viewing Wdensity column from original dataset
## [1] NA NA NA NA NA NA NA
## [8] NA NA NA NA NA NA NA
## [15] NA NA NA NA NA NA NA
## [22] NA NA NA NA NA NA NA
## [29] NA NA NA NA NA NA NA
## [36] NA NA 0.5834137 0.6867500 0.6422176 NA NA
## [43] 0.6086379 NA NA NA NA NA NA
## [50] NA NA 0.9894424 0.9825617 0.8649921 0.8650877 1.0310442
## [57] 0.9781657 0.8928684 0.7581210 0.9306841 0.9476086 0.9814638 0.9888211
## [64] NA 0.9571675 0.8736248 0.8123461 1.0128508 NA NA
## [71] 0.9446464 0.9919641 0.8198617 1.0918601 1.0526980 1.0059142 NA
## [78] NA NA NA NA NA NA NA
## [85] 0.4631419 0.4984586 0.5562989 0.4240881 0.6118715 0.9440869 0.7869858
## [92] 1.0732105 0.9109866 0.6895958 1.1338318 0.9767145 1.0860136 0.9559145
## [99] 0.8585566 1.0875365 0.8081711 0.7800303 NA 0.8321862 0.8471155
## [106] NA 1.0296478 0.7255751 0.9915416 NA 0.7898590 0.8886295
## [113] 0.8914270 1.0087957 NA NA NA NA NA
## [120] NA NA NA NA NA NA NA
## [127] 0.6200396 0.6441729 0.5992829 0.7377041 0.6613461 0.5585264 0.8256663
## [134] 0.5667743 0.5950523 0.6450106 0.6788970 0.5832948 0.7424247 0.4675798
## [141] 0.6140330 0.5521894 0.6513954 0.6371286 0.6362979 NA 0.6917519
## [148] 0.6647740 0.6347277 0.7799323 0.7246467 0.6703886 0.5170285 0.7131416
## [155] 0.4887833 0.5838757 0.6717506 0.5706291 0.6982613 0.6739045 0.7638803
## [162] 0.7600926 0.9508514 0.6953691 0.8775906 0.8757702 0.7939860 0.9054732
## [169] 0.8301417 0.9139951 0.7172554 NA 0.7521280 0.7268824 0.8086734
## [176] 0.8597114 0.7946580 0.7970461 0.8350667 NA 0.6235502 0.9123416
## [183] 0.7008337 0.6128773 0.6586052 0.7623218 0.9154403 0.8729342 NA
## [190] NA NA NA NA NA 0.9014814 0.9358146
## [197] 1.1720072 1.0089507 1.0667986 NA NA NA NA
## [204] NA NA 0.9305814 0.9581061 NA 0.9576641 NA
## [211] 1.0528247 0.9425009 0.7494724 0.9983258 NA NA 0.5453998
## [218] 0.6993783 0.6898228 0.5882865 NA NA NA NA
## [225] NA NA NA NA NA NA NA
## [232] NA NA NA NA NA 1.1759775 1.1603136
## [239] 0.9574198 NA NA NA NA NA NA
## [246] 0.7909873 0.9267537
The density of the imputed data for each imputed dataset is showed in magenta while the density of the observed data is showed in blue. The distributions between the observed dataset and the imputed dataset are fairly similar.
densityplot(new_dt_i)
Another useful visual take on the distributions can be obtained using the stripplot() function that shows the distributions of the variables as individual points. Below we see the imputed points (in magenta) within reasonable range with the observed dataset.
stripplot(new_dt_i, pch = 20, cex = 1.2)
Before we begin our relational database journey, we need to make sure we have variables to base the join and connect the datasets. Our goal is to combine the plant functional trait dataset with another vegetation dataset from HAVO. ### Join another dataset
havoplots <- read.csv("havoplots.csv")
Remember our unique ID creation for the original dataset? The original dataset lacks a primary key or a key that uniquely identify an observation in a table, so that’s why we created one. The new, unique “id” field is also known as a surrogate key. This will act as our primary key for the original dataset now.
Here’s a snippet of the havoplots dataset, fully loaded with a unique ID (a surrogate ID that was created in the same way as above) AND the original havo functional trait dataset:
library(dplyr)
havoplots_uid <- mutate(havoplots,id = dplyr::row_number(Plot_Code)) # Creating a surrogate key for havoplots dataset based on column Plot_Code
havoplots_uid_2 <- havoplots_uid %>% select(id, everything()) # Moving the surrogate "id" key to the first column
havoplots_snip <- havoplots_uid_2[1:5, 1:5] # Creating a snippet of the data table for viewing purposes
kable(havoplots_snip) %>% # Displaying a snippet of the HAVO plot
kable_styling(bootstrap_options = "striped", full_width = T, position = "center")
| id | Plot_Code | Project_Name | Provisional.Community.Name | NVC_Elcode |
|---|---|---|---|---|
| 1 | HAVO.0001L | Hawaii Volcanoes National Park | Leptam-Dodvis/Melmin_Shrubland | CEGL008164 |
| 2 | HAVO.0002L | Hawaii Volcanoes National Park | Schcon-Melmin_Grassland | CEGL008169 |
| 3 | HAVO.0003L | Hawaii Volcanoes National Park | Schcon-Melmin_Grassland | CEGL008169 |
| 4 | HAVO.0004L | Hawaii Volcanoes National Park | Metpol/Leptam_Woodland | CEGL008132 |
| 5 | HAVO.0005L | Hawaii Volcanoes National Park | Sparse | CEGL008132 |
kable(show_new_dt) %>% # Displaying a snippet of the HAVO functional trait table
kable_styling(bootstrap_options = "striped", full_width = T, position = "center")
| id | UTM_X | UTM_Y | Species | Plot | Cover | METPOLcov |
|---|---|---|---|---|---|---|
| 131 | 266654 | 2138394 | ANDVIR | 2012 | 0.02 | 0.00 |
| 164 | 270444 | 2138782 | ANDVIR | 2026 | 0.10 | 0.32 |
| 152 | 267456 | 2140371 | ANDVIR | 2030 | 0.01 | 0.13 |
| 226 | 273367 | 2140273 | ANDVIR | 2053 | 0.01 | 0.07 |
| 233 | 274006 | 2139901 | ANDVIR | 2054 | 0.05 | 0.15 |
It’s 2 table party time! Remember: the original dataset’s “id” column will act as the primary key. The HAVO plots dataset’s “id” column will act as the foreign key.
# Joining HAVO plots data with the original dataset
havo_extra <- havoplots_uid %>% left_join(new_dt, by = "id")
# "left join" refers to how the join will be assembled. Performing a "Left join" preserves the original observations even when
# there isn’t a match.
# HAVO plots data is being left joined to the original dataset by the "id" column (surrogate key for both datasets)
Check out the new filtered table that pulls data from 2 different datasets:
# Selecting columns to be in the new data frame
havo_new <- havo_extra %>% select (Plot_Code:Provisional.Community.Name, Species,
seedmass.g., Surveyors)
# Choosing to display the 5 rows of the 5 selected columns
havo_new_snip <- havo_new[1:5, 1:5]
# Displaying a snippet of the HAVO functional trait table
kable(havo_new_snip) %>%
kable_styling(bootstrap_options = "striped", full_width = T, position = "center")
| Plot_Code | Project_Name | Provisional.Community.Name | Species | seedmass.g. |
|---|---|---|---|---|
| HAVO.0001L | Hawaii Volcanoes National Park | Leptam-Dodvis/Melmin_Shrubland | DODVIS | 0.04470 |
| HAVO.0002L | Hawaii Volcanoes National Park | Schcon-Melmin_Grassland | LEPTAM | NA |
| HAVO.0003L | Hawaii Volcanoes National Park | Schcon-Melmin_Grassland | MELMIN | 0.00011 |
| HAVO.0004L | Hawaii Volcanoes National Park | Metpol/Leptam_Woodland | METPOL | 0.00006 |
| HAVO.0005L | Hawaii Volcanoes National Park | Sparse | MORFAY | 0.08400 |
To get to know HAVO a little bit better, we have questions to ask of it. Personally, I am curious if seed mass is influenced by elevation and % Nitrogen content in the leaves. I’m looking into whether the position along an elevation gradient and leaf nutrient content has an effect on seed mass.
Question: Is seed mass influenced by elevation and % leaf Nitrogen content?
Picture this [dataset]! First, let’s do some exploratory work:
# Exploratory plot
# This plot shows each variable in relationship to one another as well as a histogram of each variable's distribution.
havo_quest <- havo_uid %>% select (Cover:C.N, ET.mm.:precip.mm.) # Selecting our data frame to only include continuous variables
havo_quest_short <- havo_uid %>% select (LDMC.g.g.:Wdensity, seedmass.g., X.N, ele.m.) # Decreasing columns in data frame
library(psych)
pairs.panels(havo_quest_short[,-5], # exploratory plot
method = "pearson", # correlation method
hist.col = "#00AFBB",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)
#Spatial location of observations by seed mass
#We can see by this visualization that there is variability in seed mass. We'll further investigate what may be causing that using multiple regression (next tab!).
spa <- havo_uid %>% select (UTM_X,UTM_Y) # Data frame of only UTM coordinates
plot(spa, asp = 1, main = "Seed mass weights per location", pch = 21,col = "white", bg = "orange", # Seed mass weights shown by geographic location of observation
cex = havo_uid$seedmass.g., xlab = "x coordinate (UTM)", ylab = "y coordinate (UTM)")
We want to build a model for determining whether seed mass is influenced by elevation, and % leaf Nitrogen content.
What’s the seed mass variation like per observation location?
Our equation for this analysis is: seed mass = b0 + b1elevation + b2LNitrogen
Our statistical model looks like this:
model <- lm(seedmass.g. ~ele.m. + X.N, data = havo_uid) # multiple regression model
Based on this model, we see there are no statistically significant variables in this model.
summary(model) # provides result overview of the multiple regression model
##
## Call:
## lm(formula = seedmass.g. ~ ele.m. + X.N, data = havo_uid)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9374 -0.9447 -0.5586 -0.0018 10.2538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.3051353 0.7945153 -1.643 0.1020
## ele.m. 0.0015743 0.0009026 1.744 0.0827 .
## X.N 0.6929380 0.4029255 1.720 0.0870 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.171 on 199 degrees of freedom
## (45 observations deleted due to missingness)
## Multiple R-squared: 0.04183, Adjusted R-squared: 0.0322
## F-statistic: 4.344 on 2 and 199 DF, p-value: 0.01424
This regression output suggests seed mass is not highly influenced by elevation and % leaf nitrogen content. 4% of the variance in the measure of seed mass can be predicted by elevation and % leaf nitrogen content.
#Visualizing the regression model
library(splines)
library(ggplot2)
library(visreg)
fit <- lm(seedmass.g. ~ ele.m. +ns(X.N, df=2), data=havo_uid) # Visualizing the regression model against the % leaf Nitrogen variable
visreg(fit, "X.N", gg=TRUE) + geom_smooth(method="loess", col='#FF4E37', fill='#FF4E37')
If you’ve come this far, we’ve reached the end of our vegetation data tour in Hawai’i Volcanoes National Park. We got to experience many data stops like reviewing data, imputing missing values, database creation, as well as data visualization and analysis using multiple regression. I hope you enjoyed your time here. Mahalo for checking this out!
Henn, J.J. and Yelenik, S.G. 2020, Hawaii Volcanoes National Park Plant Trait, Percent Cover, and Environmental Data 2014: U.S. Geological Survey data release, https://doi.org/10.5066/P9WCNWEJ.