Mesic, native-dominated forest in Hawaiian Volcanoes National Park (S. Yelenik).

Mesic, native-dominated forest in Hawaiian Volcanoes National Park (S. Yelenik).

E komo mai!

Welcome to a journey through Hawai’i Volcanoes National Park (HAVO) where you’ll vicariously experience HAVO via data collected by researchers Jonathan J. Henn, Stephanie G. Yelenik, and Ellen I. Damschen. This document provides an example for processing ecological data through various techniques using RStudio and RMarkdown.

Introduction

I selected this dataset from HAVO because my graduate research at University of Hawai’i at Hilo is also based within the park’s boundaries. My graduate research involves detecting invasive plants found within HAVO, so this vegetation dataset is not directly related, but both are vegetation focused. My goal is to present various methods to approach, visualize, and analyze ecological data using RStudio and RMarkdown. I hope to discover endless options to maximize data and find joy in the process.

havo <- read.csv("HAVO_plant_cover_trait_and_environmental_data.csv") # Loading the .csv file into R
Data was downloaded from USGS ScienceBase-Catalog

Let’s take a closer look at this vegetation dataset:

Before running any analyses, the procedures and code found in each tab provide a general overview of our data by describing it. Check out each tab to learn more.

View Data

Behold, the Data! This table is simply showing you the datasheet in its entirety through a series of pages thanks to the R Markdown + paged_table() function.

library(rmarkdown)
paged_table(havo)

Structure

The str() function is a simple way to view data’s “basic” structure. It communicates the classes of your variables and the number of observations. In this case, we see this dataset contains continuous and categorical (or factor) data across 30 different variables.

str(havo)
## 'data.frame':    247 obs. of  30 variables:
##  $ UTM_X       : int  266654 270444 267456 273367 274006 267290 271015 270637 270480 271101 ...
##  $ UTM_Y       : int  2138394 2138782 2140371 2140273 2139901 2140451 2137926 2137440 2138110 2137237 ...
##  $ Species     : Factor w/ 24 levels "ANDVIR","BULCAP",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Plot        : int  2012 2026 2030 2053 2054 2076 2077 2078 2191 2194 ...
##  $ Cover       : num  0.02 0.1 0.01 0.01 0.05 0.01 0.24 0.08 0.05 0.05 ...
##  $ METPOLcov   : num  0 0.32 0.13 0.07 0.15 0.09 0 0.08 0.135 0.1 ...
##  $ LDMC.g.g.   : num  0.43 0.408 0.413 0.347 0.342 0.445 0.402 0.407 0.42 0.407 ...
##  $ LDMCstd     : num  0.037 0.014 0.025 0.015 0.023 0.058 0.023 0.023 0.034 0.021 ...
##  $ AREA.cm2.   : num  2.54 3.29 3.74 3.14 3.55 ...
##  $ AREAstd     : num  0.302 0.162 0.731 0.771 0.731 0.438 0.152 0.337 0.192 0.199 ...
##  $ SLA.cm2.g.  : num  242 237 270 130 173 ...
##  $ SLAstd      : num  27.4 12.9 15.6 37.2 29.3 ...
##  $ Ldens.g.cm3.: num  0.618 0.455 0.532 1.037 0.904 ...
##  $ Ldensstd    : num  0.078 0.078 0.199 0.303 1.523 ...
##  $ Wdensity    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ height.m.   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ seedmass.g. : num  1.3 1.3 1.3 1.3 1.3 ...
##  $ X.N         : num  0.932 0.822 1.078 0.864 0.78 ...
##  $ X.C         : num  43.5 42.4 41.9 43 42.9 ...
##  $ C.N         : num  46.7 51.6 38.9 49.8 55 ...
##  $ growth      : Factor w/ 5 levels "F","G","H","S",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ status      : Factor w/ 3 levels "E","I","N": 3 3 3 3 3 3 3 3 3 3 ...
##  $ Numsamples  : int  5 5 10 12 10 10 5 5 5 5 ...
##  $ ET.mm.      : num  751 855 931 885 897 ...
##  $ Rad.w.m2.   : num  225 212 221 204 203 ...
##  $ T.C.        : num  18.6 18.5 17.7 18.3 18.4 ...
##  $ ele.m.      : num  763 753 876 764 750 ...
##  $ precip.mm.  : num  2009 2254 2136 2631 2603 ...
##  $ rain        : Factor w/ 3 levels "DRY","INT","WET": 2 2 3 3 3 3 2 2 2 2 ...
##  $ temp        : Factor w/ 2 levels "COOL","HOT": 2 2 2 2 2 2 2 2 2 2 ...

Summary

To get a better sense of the distribution of your variables in the dataset, the summary() function shows descriptive statistics per variable (depending on the type of variable).

summary(havo)
##      UTM_X            UTM_Y            Species         Plot     
##  Min.   :258616   Min.   :2136979   METPOL : 32   Min.   :2006  
##  1st Qu.:263262   1st Qu.:2137779   DODVIS : 25   1st Qu.:2054  
##  Median :265326   Median :2139901   LEPTAM : 25   Median :2078  
##  Mean   :266562   Mean   :2141562   MORFAY : 19   Mean   :2109  
##  3rd Qu.:271015   3rd Qu.:2147403   SCHCON : 16   3rd Qu.:2194  
##  Max.   :274420   Max.   :2149009   ANDVIR : 14   Max.   :2212  
##                                     (Other):116                 
##      Cover          METPOLcov        LDMC.g.g.         LDMCstd       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.1490   Min.   :0.00400  
##  1st Qu.:0.0100   1st Qu.:0.0150   1st Qu.:0.3285   1st Qu.:0.02000  
##  Median :0.0200   Median :0.1000   Median :0.4080   Median :0.03200  
##  Mean   :0.1014   Mean   :0.2294   Mean   :0.3944   Mean   :0.03589  
##  3rd Qu.:0.0950   3rd Qu.:0.4000   3rd Qu.:0.4610   3rd Qu.:0.04100  
##  Max.   :0.8900   Max.   :0.7650   Max.   :0.6710   Max.   :0.21100  
##                                                                      
##    AREA.cm2.         AREAstd         SLA.cm2.g.        SLAstd       
##  Min.   : 0.144   Min.   : 0.012   Min.   :  0.0   Min.   :  0.000  
##  1st Qu.: 3.299   1st Qu.: 0.582   1st Qu.: 67.0   1st Qu.:  7.599  
##  Median : 7.314   Median : 1.567   Median :104.9   Median : 12.613  
##  Mean   :12.879   Mean   : 3.448   Mean   :116.6   Mean   : 16.180  
##  3rd Qu.:14.904   3rd Qu.: 3.693   3rd Qu.:150.7   3rd Qu.: 19.827  
##  Max.   :86.492   Max.   :23.593   Max.   :452.6   Max.   :163.617  
##  NA's   :1        NA's   :1                                         
##   Ldens.g.cm3.       Ldensstd         Wdensity        height.m.     
##  Min.   :0.1450   Min.   :0.0070   Min.   :0.4241   Min.   : 0.300  
##  1st Qu.:0.4298   1st Qu.:0.1890   1st Qu.:0.6648   1st Qu.: 1.500  
##  Median :0.5425   Median :0.4390   Median :0.7971   Median : 6.000  
##  Mean   :0.5545   Mean   :0.6517   Mean   :0.8043   Mean   : 6.571  
##  3rd Qu.:0.6647   3rd Qu.:0.8450   3rd Qu.:0.9446   3rd Qu.:10.000  
##  Max.   :1.3510   Max.   :4.4130   Max.   :1.1760   Max.   :16.000  
##  NA's   :1        NA's   :1        NA's   :114                      
##   seedmass.g.            X.N              X.C             C.N        growth
##  Min.   : 0.00001   Min.   :0.3902   Min.   :26.31   Min.   :18.99   F:27  
##  1st Qu.: 0.00006   1st Qu.:0.7274   1st Qu.:42.76   1st Qu.:36.77   G:64  
##  Median : 0.04470   Median :0.9106   Median :45.32   Median :48.42   H: 2  
##  Mean   : 0.83702   Mean   :1.0091   Mean   :44.85   Mean   :50.57   S:63  
##  3rd Qu.: 0.47000   3rd Qu.:1.1786   3rd Qu.:47.07   3rd Qu.:62.69   T:91  
##  Max.   :11.06031   Max.   :2.3247   Max.   :51.52   Max.   :99.36         
##  NA's   :30         NA's   :15       NA's   :15      NA's   :15            
##  status    Numsamples         ET.mm.        Rad.w.m2.          T.C.      
##  E: 77   Min.   : 3.000   Min.   :730.8   Min.   :197.4   Min.   :15.42  
##  I: 68   1st Qu.: 5.000   1st Qu.:799.1   1st Qu.:209.0   1st Qu.:15.90  
##  N:102   Median : 6.000   Median :863.2   Median :213.0   Median :17.90  
##          Mean   : 7.239   Mean   :853.6   Mean   :213.6   Mean   :17.60  
##          3rd Qu.: 8.000   3rd Qu.:895.5   3rd Qu.:221.5   3rd Qu.:18.56  
##          Max.   :25.000   Max.   :989.8   Max.   :225.8   Max.   :19.44  
##                                                                          
##      ele.m.         precip.mm.    rain       temp    
##  Min.   : 620.0   Min.   :1904   DRY: 46   COOL: 69  
##  1st Qu.: 751.3   1st Qu.:2009   INT: 69   HOT :178  
##  Median : 851.9   Median :2189   WET:132             
##  Mean   : 890.2   Mean   :2346                       
##  3rd Qu.:1135.0   3rd Qu.:2631                       
##  Max.   :1190.4   Max.   :3316                       
## 

Metadata

Luckily, this dataset is accompanied by substantial metadata, additional information that describes and provides details about the dataset. It is nearly thorough, but it lacks explanations for missing data values for a handful of variables. Check out the Data Issues tab for more details on how many NA values are missing from which columns.

You can find this metadata here.

Below is a table that compiles some details of who collected this data and when:

Originators Beginning of data collection End of data collection
Jonathan J. Henn 06/01/2014 09/01/2014
Stephanie G. Yelenik 06/01/2014 09/01/2014
Ellen I. Damschen 06/01/2014 09/01/2014

Purpose of Study

Determining the characteristics of non-native plants that can successfully establish and spread is central to pressing questions in invasion ecology. Evidence suggests that some non-native species establish and spread in new environments because they possess characteristics (functional traits) that allow them to either successfully compete with native residents or fill previously unfilled niches. However, the relative importance of out-competing native species vs. filling empty niche space as potential mechanisms of invasion may depend on environmental characteristics. Here, we measured plant functional traits, proxies indicative of competitive and establishment strategies, to determine if these traits vary among native and invasive species and if their prevalence is dependent on environmental conditions. Using a natural environmental gradient in Hawai’i Volcanoes National Park, we evaluated how functional traits differ between native and non-native plant communities and if these differences change along an environmental gradient from hot, dry to cool, wet conditions.

Data Issues

You may have noticed from flipping through that there are missing data represented by NA. Below is a count and sum function. The count function below counts how many NA values are found in each variable. The sum function provides the total NA count across all variables. We find out there are a total of 193 missing values in this dataset. Don’t panic. We’ll attempt to bootstrap or strategically replace these values later.

NA_count <-sapply(havo, function(y) sum(length(which(is.na(y))))) # locating all NA entries

NA_count <- data.frame(NA_count) # creating a data table out of NA entries

print(NA_count) # displays the NA count table
##              NA_count
## UTM_X               0
## UTM_Y               0
## Species             0
## Plot                0
## Cover               0
## METPOLcov           0
## LDMC.g.g.           0
## LDMCstd             0
## AREA.cm2.           1
## AREAstd             1
## SLA.cm2.g.          0
## SLAstd              0
## Ldens.g.cm3.        1
## Ldensstd            1
## Wdensity          114
## height.m.           0
## seedmass.g.        30
## X.N                15
## X.C                15
## C.N                15
## growth              0
## status              0
## Numsamples          0
## ET.mm.              0
## Rad.w.m2.           0
## T.C.                0
## ele.m.              0
## precip.mm.          0
## rain                0
## temp                0
NA_total <- sum(NA_count) # sum of all NA entries/variable

print(NA_total) # displays the total NA entries
## [1] 193

Tidying Up

Data analysis is often preceded with cleaning and preparing data. Data tidying is the process of structuring datasets to facilitate analysis. Tidy data is a standardized way to map the meaning of a dataset to its structure or how its rows, columns, and tables are matched up with observations, variables, and types. You have tidy data when:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each value must have its own cell.

Is this dataset tidy?

library(dplyr)
library(kableExtra)
dt <- havo[1:5, 1:6] # creating a snippet of the data table for viewing purposes
kable(dt) %>% # displaying the snippet table
  kable_styling(bootstrap_options = "striped", full_width = F, position = "float_right")
UTM_X UTM_Y Species Plot Cover METPOLcov
266654 2138394 ANDVIR 2012 0.02 0.00
270444 2138782 ANDVIR 2026 0.10 0.32
267456 2140371 ANDVIR 2030 0.01 0.13
273367 2140273 ANDVIR 2053 0.01 0.07
274006 2139901 ANDVIR 2054 0.05 0.15

Our data is indeed tidy! Each variable has its own column, each observation has its own row, and each value has its own cell. However, we are missing one component to get a more organized dataset: we need unique identifiers for each observation. We’ll create a unique ID based on the UTM_X column.

Adding Unique Identifiers

We’ll run one command to make a unique key for each observation.

library(dplyr)

havo_uid <- mutate(havo,id = dplyr::row_number(UTM_X)) # command to create a unique ID column
new_dt <- havo_uid %>% select(id, everything()) # moving unique ID column to first column

show_new_dt <- new_dt[1:5, 1:7] # creating a snippet of the data table for viewing purposes

kable(show_new_dt) %>% # displaying the snippet table
  kable_styling(bootstrap_options = "striped", full_width = F, position = "float_right")
id UTM_X UTM_Y Species Plot Cover METPOLcov
131 266654 2138394 ANDVIR 2012 0.02 0.00
164 270444 2138782 ANDVIR 2026 0.10 0.32
152 267456 2140371 ANDVIR 2030 0.01 0.13
226 273367 2140273 ANDVIR 2053 0.01 0.07
233 274006 2139901 ANDVIR 2054 0.05 0.15

Success!

Here’s a snippet of our new data table complete with unique IDs for each observation.

Imputation

Remember when we found out there are 193 missing values in this dataset? Let’s address those now.

MICE

A commonly used package for missing values imputation is MICE (Multivariate Imputation via Chained Equations). This package creates multiple imputations as compared to a single imputation (such as mean) to take care of uncertainty in missing values. It imputes data on a variable by variable basis by specifying an imputation model per variable.

The MICE plot is indicating the Wdensity (wood density) variable has the most NA values (about 41% of observations are missing Wdensity data). We’ll perform imputations on the Wdensity to compute values for the 115 missing observations. We’ll use the cart (imputation by classification and regression trees) method.

library(mice)
library(VIM)
# Plot of the distribution of NA values per variable
mice_plot <- aggr(new_dt, col=c('pink', 'yellow'), numbers=TRUE, sortVars=TRUE, labels=names(new_dt), cex.axis=.7, gap=3, ylab=c("Missing data", "Pattern"))

Results

new_dt <- havo_uid %>% select(id, everything())
# Imputing missing values of 3 different iterations
new_dt_i <- mice(new_dt, m = 3, maxit = 50, method = 'cart', seed = 500)

# Create a plot of each iteration of imputed data
plot(new_dt_i) # Create a plot of each iteration of imputed data

Comparison

Time to compare the 3 new imputations data vs. the original values from the Wdensity column. First we see the 3 imputed datasets of Wdensity, the original data, then visual comparisons of both the imputed and observed data.

new_dt_i$imp$Wdensity # Viewing 3 iterations of imputation values for Wdensity column
##             1         2         3
## 1   0.8775906 0.7970461 0.8198617
## 2   0.9139951 0.8597114 1.0059142
## 3   0.9508514 0.7970461 0.9894424
## 4   0.8086734 0.9054732 1.0089507
## 5   0.7939860 0.7970461 1.0875365
## 6   0.8301417 0.7172554 0.8914270
## 7   0.9014814 0.9054732 1.0732105
## 8   0.7898590 0.9508514 1.0732105
## 9   0.9054732 0.7970461 1.0296478
## 10  0.7939860 0.8086734 0.8198617
## 11  0.8775906 0.8597114 0.9358146
## 12  0.9054732 0.9139951 1.1720072
## 13  0.9014814 0.9508514 1.1720072
## 14  0.9014814 0.7939860 1.0526980
## 15  0.9508514 0.8350667 0.8914270
## 16  0.9109866 0.8597114 1.0087957
## 17  0.8757702 0.7946580 0.9767145
## 18  0.9054732 0.8301417 0.8736248
## 19  0.8757702 0.7172554 1.0732105
## 20  0.9054732 0.7600926 0.9915416
## 21  0.9139951 0.7946580 0.8321862
## 22  0.8086734 0.7939860 0.8736248
## 23  0.8086734 0.7970461 0.9581061
## 24  0.9508514 0.7939860 1.1338318
## 25  0.6450106 0.6739045 0.5585264
## 26  0.6362979 0.6739045 0.6982613
## 27  0.6717506 0.6739045 0.6140330
## 28  0.5832948 0.7638803 0.6450106
## 29  0.7131416 0.6362979 0.6450106
## 30  0.7131416 0.6347277 0.6140330
## 31  0.5882865 0.6371286 1.1759775
## 32  0.6347277 0.6982613 0.5706291
## 33  0.7131416 0.5667743 0.6086379
## 34  0.6347277 0.6717506 1.1759775
## 35  0.8256663 0.6371286 1.1759775
## 36  0.5882865 0.5950523 0.5706291
## 37  0.6717506 0.5950523 0.6422176
## 41  0.5834137 0.6200396 0.6898228
## 42  1.1603136 0.9574198 0.6867500
## 44  0.9781657 0.5950523 0.9814638
## 45  0.7008337 0.7424247 0.6128773
## 46  0.9154403 0.6917519 0.6235502
## 47  0.8123461 0.6917519 0.8914270
## 48  0.5992829 0.6371286 0.6235502
## 49  0.8729342 0.6739045 0.6586052
## 50  0.6200396 0.6917519 0.8729342
## 51  0.8729342 0.6371286 0.7268824
## 64  1.0526980 0.8649921 0.9894424
## 69  0.9888211 0.8123461 0.9983258
## 70  0.9267537 1.0128508 0.9476086
## 77  0.6235502 0.6422176 0.8729342
## 78  0.8757702 0.9123416 0.8914270
## 79  0.7623218 0.5992829 0.6586052
## 80  0.6128773 0.5882865 0.6586052
## 81  0.6953691 0.6235502 0.7970461
## 82  0.5992829 0.6086379 0.5992829
## 83  0.7623218 0.6235502 0.5992829
## 84  0.5992829 0.6086379 0.7623218
## 103 0.6895958 1.0087957 0.9576641
## 106 0.8350667 0.8321862 0.9440869
## 110 0.8585566 0.7800303 0.9109866
## 115 0.9888211 0.7799323 0.9306841
## 116 0.8928684 0.5170285 0.9054732
## 117 0.9983258 0.5667743 0.7581210
## 118 0.8928684 0.7246467 1.0310442
## 119 0.9888211 0.5706291 0.8928684
## 120 0.9781657 0.5521894 0.9054732
## 121 0.8928684 0.5838757 0.9446464
## 122 1.0528247 0.7377041 0.9571675
## 123 0.9305814 0.6140330 0.9915416
## 124 0.9983258 0.5706291 0.8123461
## 125 0.9305814 0.6717506 0.8649921
## 126 1.0128508 0.5667743 0.8650877
## 146 0.6347277 0.7638803 0.6647740
## 172 0.7898590 0.8597114 0.7172554
## 180 0.8775906 0.8350667 0.8301417
## 189 0.9983258 0.6703886 0.9983258
## 190 0.6441729 0.6450106 0.7268824
## 191 0.9425009 0.5950523 1.0059142
## 192 0.6128773 0.6788970 0.7008337
## 193 0.8123461 0.6450106 1.0667986
## 194 0.8928684 0.6982613 0.8585566
## 200 0.9508514 0.6086379 0.6953691
## 201 0.9054732 0.5882865 0.7970461
## 202 0.9508514 0.6422176 0.7970461
## 203 0.5992829 1.0528247 0.7623218
## 204 0.7800303 0.9139951 0.9825617
## 205 0.7600926 0.6898228 0.9571675
## 208 0.8886295 0.5834137 0.8585566
## 210 0.9425009 0.9305814 1.0128508
## 215 0.9571675 0.7172554 0.9894424
## 216 0.8650877 0.6200396 0.7581210
## 221 0.7970461 0.9508514 1.0089507
## 222 0.7600926 0.8086734 1.0296478
## 223 0.8757702 0.7946580 1.0089507
## 224 0.9508514 0.7946580 1.0528247
## 225 0.8301417 0.7172554 0.9014814
## 226 0.8086734 0.7946580 0.8585566
## 227 0.7600926 0.8597114 0.9919641
## 228 0.8350667 0.7939860 0.8585566
## 229 0.9508514 0.8301417 1.0526980
## 230 0.9014814 0.7172554 0.9814638
## 231 0.7521280 0.7600926 0.6895958
## 232 0.7255751 0.9054732 1.0089507
## 233 0.8350667 0.7946580 0.9014814
## 234 0.8086734 0.7600926 1.0875365
## 235 0.7898590 0.9054732 0.9814638
## 236 0.9508514 0.7946580 1.0732105
## 240 0.9781657 0.8123461 0.8585566
## 241 0.8736248 0.8928684 0.9358146
## 242 0.8928684 0.9267537 0.9888211
## 243 0.9814638 0.7909873 0.9571675
## 244 0.9267537 0.7909873 0.9425009
## 245 0.9571675 0.8650877 0.9425009
new_dt$Wdensity # Viewing Wdensity column from original dataset
##   [1]        NA        NA        NA        NA        NA        NA        NA
##   [8]        NA        NA        NA        NA        NA        NA        NA
##  [15]        NA        NA        NA        NA        NA        NA        NA
##  [22]        NA        NA        NA        NA        NA        NA        NA
##  [29]        NA        NA        NA        NA        NA        NA        NA
##  [36]        NA        NA 0.5834137 0.6867500 0.6422176        NA        NA
##  [43] 0.6086379        NA        NA        NA        NA        NA        NA
##  [50]        NA        NA 0.9894424 0.9825617 0.8649921 0.8650877 1.0310442
##  [57] 0.9781657 0.8928684 0.7581210 0.9306841 0.9476086 0.9814638 0.9888211
##  [64]        NA 0.9571675 0.8736248 0.8123461 1.0128508        NA        NA
##  [71] 0.9446464 0.9919641 0.8198617 1.0918601 1.0526980 1.0059142        NA
##  [78]        NA        NA        NA        NA        NA        NA        NA
##  [85] 0.4631419 0.4984586 0.5562989 0.4240881 0.6118715 0.9440869 0.7869858
##  [92] 1.0732105 0.9109866 0.6895958 1.1338318 0.9767145 1.0860136 0.9559145
##  [99] 0.8585566 1.0875365 0.8081711 0.7800303        NA 0.8321862 0.8471155
## [106]        NA 1.0296478 0.7255751 0.9915416        NA 0.7898590 0.8886295
## [113] 0.8914270 1.0087957        NA        NA        NA        NA        NA
## [120]        NA        NA        NA        NA        NA        NA        NA
## [127] 0.6200396 0.6441729 0.5992829 0.7377041 0.6613461 0.5585264 0.8256663
## [134] 0.5667743 0.5950523 0.6450106 0.6788970 0.5832948 0.7424247 0.4675798
## [141] 0.6140330 0.5521894 0.6513954 0.6371286 0.6362979        NA 0.6917519
## [148] 0.6647740 0.6347277 0.7799323 0.7246467 0.6703886 0.5170285 0.7131416
## [155] 0.4887833 0.5838757 0.6717506 0.5706291 0.6982613 0.6739045 0.7638803
## [162] 0.7600926 0.9508514 0.6953691 0.8775906 0.8757702 0.7939860 0.9054732
## [169] 0.8301417 0.9139951 0.7172554        NA 0.7521280 0.7268824 0.8086734
## [176] 0.8597114 0.7946580 0.7970461 0.8350667        NA 0.6235502 0.9123416
## [183] 0.7008337 0.6128773 0.6586052 0.7623218 0.9154403 0.8729342        NA
## [190]        NA        NA        NA        NA        NA 0.9014814 0.9358146
## [197] 1.1720072 1.0089507 1.0667986        NA        NA        NA        NA
## [204]        NA        NA 0.9305814 0.9581061        NA 0.9576641        NA
## [211] 1.0528247 0.9425009 0.7494724 0.9983258        NA        NA 0.5453998
## [218] 0.6993783 0.6898228 0.5882865        NA        NA        NA        NA
## [225]        NA        NA        NA        NA        NA        NA        NA
## [232]        NA        NA        NA        NA        NA 1.1759775 1.1603136
## [239] 0.9574198        NA        NA        NA        NA        NA        NA
## [246] 0.7909873 0.9267537

The density of the imputed data for each imputed dataset is showed in magenta while the density of the observed data is showed in blue. The distributions between the observed dataset and the imputed dataset are fairly similar.

densityplot(new_dt_i) 

Another useful visual take on the distributions can be obtained using the stripplot() function that shows the distributions of the variables as individual points. Below we see the imputed points (in magenta) within reasonable range with the observed dataset.

stripplot(new_dt_i, pch = 20, cex = 1.2)

Relational Database

Before we begin our relational database journey, we need to make sure we have variables to base the join and connect the datasets. Our goal is to combine the plant functional trait dataset with another vegetation dataset from HAVO. ### Join another dataset

havoplots <- read.csv("havoplots.csv")

Keys, keys, keys

Remember our unique ID creation for the original dataset? The original dataset lacks a primary key or a key that uniquely identify an observation in a table, so that’s why we created one. The new, unique “id” field is also known as a surrogate key. This will act as our primary key for the original dataset now.

Here’s a snippet of the havoplots dataset, fully loaded with a unique ID (a surrogate ID that was created in the same way as above) AND the original havo functional trait dataset:

library(dplyr)

havoplots_uid <- mutate(havoplots,id = dplyr::row_number(Plot_Code)) # Creating a surrogate key for havoplots dataset based on column Plot_Code

havoplots_uid_2 <- havoplots_uid %>% select(id, everything()) # Moving the surrogate "id" key to the first column

havoplots_snip <- havoplots_uid_2[1:5, 1:5] # Creating a snippet of the data table for viewing purposes

kable(havoplots_snip) %>% # Displaying a snippet of the HAVO plot
  kable_styling(bootstrap_options = "striped", full_width = T, position = "center")
id Plot_Code Project_Name Provisional.Community.Name NVC_Elcode
1 HAVO.0001L Hawaii Volcanoes National Park Leptam-Dodvis/Melmin_Shrubland CEGL008164
2 HAVO.0002L Hawaii Volcanoes National Park Schcon-Melmin_Grassland CEGL008169
3 HAVO.0003L Hawaii Volcanoes National Park Schcon-Melmin_Grassland CEGL008169
4 HAVO.0004L Hawaii Volcanoes National Park Metpol/Leptam_Woodland CEGL008132
5 HAVO.0005L Hawaii Volcanoes National Park Sparse CEGL008132
kable(show_new_dt) %>% # Displaying a snippet of the HAVO functional trait table
  kable_styling(bootstrap_options = "striped", full_width = T, position = "center")
id UTM_X UTM_Y Species Plot Cover METPOLcov
131 266654 2138394 ANDVIR 2012 0.02 0.00
164 270444 2138782 ANDVIR 2026 0.10 0.32
152 267456 2140371 ANDVIR 2030 0.01 0.13
226 273367 2140273 ANDVIR 2053 0.01 0.07
233 274006 2139901 ANDVIR 2054 0.05 0.15

Tables x2

It’s 2 table party time! Remember: the original dataset’s “id” column will act as the primary key. The HAVO plots dataset’s “id” column will act as the foreign key.

# Joining HAVO plots data with the original dataset
havo_extra <- havoplots_uid %>% left_join(new_dt, by = "id")

# "left join" refers to how the join will be assembled. Performing a "Left join" preserves the original observations even when
# there isn’t a match. 
# HAVO plots data is being left joined to the original dataset by the "id" column (surrogate key for both datasets)

Final Product

Check out the new filtered table that pulls data from 2 different datasets:

# Selecting columns to be in the new data frame
havo_new <- havo_extra %>% select (Plot_Code:Provisional.Community.Name, Species,
seedmass.g., Surveyors)
# Choosing to display the 5 rows of the 5 selected columns
havo_new_snip <- havo_new[1:5, 1:5]
# Displaying a snippet of the HAVO functional trait table
kable(havo_new_snip) %>% 
  kable_styling(bootstrap_options = "striped", full_width = T, position = "center")
Plot_Code Project_Name Provisional.Community.Name Species seedmass.g.
HAVO.0001L Hawaii Volcanoes National Park Leptam-Dodvis/Melmin_Shrubland DODVIS 0.04470
HAVO.0002L Hawaii Volcanoes National Park Schcon-Melmin_Grassland LEPTAM NA
HAVO.0003L Hawaii Volcanoes National Park Schcon-Melmin_Grassland MELMIN 0.00011
HAVO.0004L Hawaii Volcanoes National Park Metpol/Leptam_Woodland METPOL 0.00006
HAVO.0005L Hawaii Volcanoes National Park Sparse MORFAY 0.08400

Q&A

To get to know HAVO a little bit better, we have questions to ask of it. Personally, I am curious if seed mass is influenced by elevation and % Nitrogen content in the leaves. I’m looking into whether the position along an elevation gradient and leaf nutrient content has an effect on seed mass.

Question: Is seed mass influenced by elevation and % leaf Nitrogen content?

Visuals

Picture this [dataset]! First, let’s do some exploratory work:

# Exploratory plot
# This plot shows each variable in relationship to one another as well as a histogram of each variable's distribution.

havo_quest <- havo_uid %>% select (Cover:C.N, ET.mm.:precip.mm.) # Selecting our data frame to only include continuous variables
havo_quest_short <- havo_uid %>% select (LDMC.g.g.:Wdensity, seedmass.g., X.N, ele.m.) # Decreasing columns in data frame
library(psych)
pairs.panels(havo_quest_short[,-5], # exploratory plot
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
             )

#Spatial location of observations by seed mass
#We can see by this visualization that there is variability in seed mass. We'll further investigate what may be causing that using multiple regression (next tab!).
spa <- havo_uid %>% select (UTM_X,UTM_Y) # Data frame of only UTM coordinates

plot(spa, asp = 1, main = "Seed mass weights per location", pch = 21,col = "white", bg = "orange",  # Seed mass weights shown by geographic location of observation
cex = havo_uid$seedmass.g., xlab = "x coordinate (UTM)", ylab = "y coordinate (UTM)")

Multiple Regression

We want to build a model for determining whether seed mass is influenced by elevation, and % leaf Nitrogen content.

What’s the seed mass variation like per observation location?

Our equation for this analysis is: seed mass = b0 + b1elevation + b2LNitrogen

Our statistical model looks like this:

model <- lm(seedmass.g. ~ele.m. + X.N, data = havo_uid) # multiple regression model

Based on this model, we see there are no statistically significant variables in this model.

summary(model) # provides result overview of the multiple regression model
## 
## Call:
## lm(formula = seedmass.g. ~ ele.m. + X.N, data = havo_uid)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9374 -0.9447 -0.5586 -0.0018 10.2538 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -1.3051353  0.7945153  -1.643   0.1020  
## ele.m.       0.0015743  0.0009026   1.744   0.0827 .
## X.N          0.6929380  0.4029255   1.720   0.0870 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.171 on 199 degrees of freedom
##   (45 observations deleted due to missingness)
## Multiple R-squared:  0.04183,    Adjusted R-squared:  0.0322 
## F-statistic: 4.344 on 2 and 199 DF,  p-value: 0.01424

This regression output suggests seed mass is not highly influenced by elevation and % leaf nitrogen content. 4% of the variance in the measure of seed mass can be predicted by elevation and % leaf nitrogen content.

#Visualizing the regression model
library(splines)
library(ggplot2)
library(visreg)

fit <- lm(seedmass.g. ~ ele.m. +ns(X.N, df=2), data=havo_uid) # Visualizing the regression model against the % leaf Nitrogen variable
visreg(fit, "X.N", gg=TRUE) + geom_smooth(method="loess", col='#FF4E37', fill='#FF4E37')

Final thoughts

If you’ve come this far, we’ve reached the end of our vegetation data tour in Hawai’i Volcanoes National Park. We got to experience many data stops like reviewing data, imputing missing values, database creation, as well as data visualization and analysis using multiple regression. I hope you enjoyed your time here. Mahalo for checking this out!

References

Henn, J.J. and Yelenik, S.G. 2020, Hawaii Volcanoes National Park Plant Trait, Percent Cover, and Environmental Data 2014: U.S. Geological Survey data release, https://doi.org/10.5066/P9WCNWEJ.