Well Measurement Data 2001-2015: QA/QC

Introduction

This document provides a basic interrogation of the manual well measurement data archived on the “Digital Collections of Colorado” (DCC) library site: Well Measurement Data 2001 - 2015

These analyses are aimed at characterizing basic data structure and identifying potential issues such as:

Inconsistently named factors such as site name or wel ID
Missing values
Data values outside of expected range or showing unusual patterns

My initital assumption is that the DCC data set represents the canonical version of the water table data, to which updates from 2015-2017 should be added. However, if there are errors in the DCC repository, these should be addressed first before attempting to update with more recent data.

Some basic exploratory analysis

wt.nsf <- read_tsv("data/NSF_data_archive_20180208/data/Well_measurement_data2001-2015/Yell_Well_Data_Manual2001_2015.txt")

wt.nsf.meta <- read_csv("data/NSF_data_archive_20180208/data/Well_measurement_data2001-2015/Yell_Well_Data_Manual2001_2015_METADATA.csv")

Field names from metadata file on DCC repository site

wt.nsf.meta %>% 
  knitr::kable()

Site	Site_1
Site2	Site/Treatment
Plot	Plot Treatment
Date	Date
Day	Day of Month
Month	Month
Year	Year
JULIAN_DAY	Day of Year
Site_Typ	Experimental or Observation
Wat_ID	Unique Well ID (EXP and OBS lumped)
Wat_ID2	Unique Well ID
Wat_Type	1 = Well, 2 = Staff Gauge
rellevel	Distance to water relative to ground surface
X	Easting (UTM Zone 12 - NAD83)
Y	Northing (UTM Zone 12 - NAD 83)
Z	Ground Surface Elevation (meters)
abslevel	Water Table Elevation (meters)

Questions/Issues:
1. How are staff gauge data recorded? Is this a corrected elevation or a distance from the gauge (t-post?) top?
2. What are the sources of x, y and z coordinates? Interogating the many folders of data, I found reference to total station survey work (from DK and earlier), but also GPS and (maybe) GIS-derived elevations floating about (e.g., pulled off of DEM). Do we have any info on how z-coordinates were obtained?

Comparison of missing values among fields (missing)

# first, make "Wat_ID" a character...
wt.nsf <- wt.nsf %>% 
  mutate(Wat_ID = as.character(Wat_ID))

# use some functions in skimr pkg to evaluate missing values.
s1 <- skimr::skim(wt.nsf)
s1 %>% 
  filter(stat=="missing") %>% 
  select(-c(level,formatted)) %>%
  arrange(-value) %>% 
  datatable()

Questions/Issues:
1. There are different levels of “missingness” between fields. Fields such as “abslevel”, “JULIAN_DAY” have the most missing values, but other fields also have issues…
2. The well id fields

Comparison of the number of unique values among fields (n unique)

The site and site2 fields differ in how they encode treatments. “Site” explicitly encodes treatments (e.g., DC, DX, etc.); “Site2” leaves these out. BUT, not consistent…

s1 %>% 
  filter(stat=="n_unique") %>% 
  select(-c(level,formatted)) %>% 
  datatable()

Questions/Issues:
1. Which concept of site (“Site” vs. “Site2”) should be used going forward? From disucssions on 2018-02-22, the “Site2” approach (once error-corrected) seems like the way?
2. How do the different Site ID approaches relate to well ID’s? are these unique to “Site”, “Site2”, or globally distinct?

Count of observations: Site

## distinct "Site"
wt.nsf %>%
  tabyl(Site) %>% 
  datatable()

Count of observations: Site2

## distinct "Site2"
wt.nsf %>%
  tabyl(Site2) %>% 
  datatable()

Count of observations: Plot

## distinct "plot"
wt.nsf %>%
  tabyl(Plot) %>% 
  select(-valid_percent) %>% 
  knitr::kable()

Plot	n	percent
CC	1983	0.1864773
CX	1679	0.1578898
DC	1711	0.1608990
DX	1918	0.1803649
null	54	0.0050781
Null	3259	0.3064698
NA	30	0.0028211

Some cross-tabulations…

The following cross-tabulations are meant to provide some insights into data structure and diagnose potential problems.

cross-tabulation: Site vs. Plot

wt.nsf %>% 
  crosstab(Site,Plot) %>%
  datatable(caption = "Count of observations")

Questions/Issues:
1. Note presence of two flavors of null: “null” and “Null”
2. Also note presence of NA. How to bring these properly into allignment?

cross-tabulation: Site2 vs. Plot

wt.nsf %>% 
  crosstab(Site2,Plot) %>% 
  datatable(caption = "Count of observations")

Questions/Issues:
1. Note miscoded Site2 entries. For example, there is “eb1” and “EB1”
2. “Site” looks like it is supposed to explicitly encode treatment and “Site2” is not; however, see the “Site2” entry for “EB2CC”, etc. Some errors appear to be present…
3. Similar issue as noted above regarding the null and NA values.

cross-tabulation: Site2 vs. Year

wt.nsf %>% 
  # names()
  crosstab(Site2,Year) %>%
  datatable(caption = "Count of observations")

Questions/Issues:
1. Were data collected in in 2007? Perhaps these are the “1900” data…

cross-tabulation: Wat_ID2 vs. Year

wt.nsf %>% 
  # names()
  crosstab(Wat_ID2,Year) %>%
  datatable(caption = "Count of observations")

Questions/Issues:
1. “1900” data…
2. The use of site and well IDs seem wonky in places. Can these be tied to field books? What should be considered “canonical”?

cross-tabulation: Wat_ID2 vs. Site2

wt.nsf %>% 
  crosstab(Wat_ID2,Site2) %>% 
  datatable()

Questions/Issues:
1. Are these supposed to 2007 data?
2. Where would the original data be housed? Field books, data files,?…

Graphical crap detection

Heat map: count of observations by “Wat_ID2” and “Site2”

wt.nsf %>% 
  # names()
  crosstab(Wat_ID2,Site2) %>%
  gather(key = Site2,value = cnt,-Wat_ID2) %>% 
  ggplot(aes(Site2,Wat_ID2)) +
  geom_tile(aes(fill=cnt), color="grey") +
  # scale_fill_gradient(low = "white", high = "darkred") +
  scale_fill_gradient2(low = "white", mid = "blue", high = "red", midpoint = 45, space = "Lab", na.value = "grey50", guide = "colourbar") +
  # scale_fill_viridis() +
  theme_bw() +
  theme(axis.text.x=element_text(angle=45, hjust=1))

Questions/Issues:
1. Note NA values. Figure out which sites and decide whether to keep or drop…
2. Check what’s up with “Eunlabeled”. Can this be recovered as valid data or does it need to be dropped?
3. What’s the significance of the hyphenated ID’s (e.g., O47-1 and O47-2)?
4. Does the pattern of “missingness” match expectations?

Heat map: count of observations by Wat_ID2 and year EXCLUDING the data labelled with Year<2001

# s1
wt.nsf %>% 
  group_by(Wat_ID2, Year) %>%
  summarise(cnt = n()) %>%
  filter(Year>2000) %>% 
  ggplot(aes(Year, Wat_ID2)) +
  geom_tile(aes(fill=cnt), color='white') +
  scale_fill_viridis() +
  theme_bw() +
  labs(y="") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

Questions/Issues:
1. Note the missing 2007 data.
2. How does the pattern of “missingness” match expectations?

“E” wells/staff gauges – jitter plots

# exp plots - box plot
# wt.nsf %>% 
#   group_by(Wat_ID2) %>%
#   mutate(cnt.WatID = n()) %>%
#   filter(Year>2000) %>%
#   mutate(Year = as.character(Year)) %>%
#   filter(str_detect(Wat_ID2, "E")) %>% 
#   # filter(cnt.WatID > 3 & rellevel < 999) %>% 
#   # ggplot(aes(Date,rellevel)) +
#   ggplot(aes(Year,rellevel)) + 
#   geom_boxplot(aes(group=Year)) +
#   facet_wrap(~Wat_ID2, ncol=5, scales = "free") +
#   theme(axis.text.x=element_text(angle=45, hjust=1))

## exp wells - jitter
wt.nsf %>% 
  group_by(Wat_ID2) %>%
  mutate(cnt.WatID = n()) %>%
  filter(Year>2000) %>%
  mutate(Year = as.character(Year)) %>%
  filter(str_detect(Wat_ID2, "E")) %>% 
  # filter(cnt.WatID > 3 & rellevel < 999) %>% 
  # ggplot(aes(Date,rellevel)) +
  ggplot(aes(Year,rellevel)) + 
  geom_jitter(width = 0.15, height = 0.15, aes(color = Year)) +
  # geom_point(color="darkblue", alpha=0.1,position = position_jitterdodge(dodge.width = 0.9, jitter.width = 0.2)) +
  # geom_boxplot(aes(group=Year)) +
  theme_minimal() +
  theme(legend.position = "none") +
  facet_wrap(~Wat_ID2, ncol=4, scales = "free") +
  theme(axis.text.x = element_text(face="bold", color="black", 
                           size=7, angle=75, hjust = 1))

## Warning: Removed 14 rows containing missing values (geom_point).

Questions/Issues:
1. Seem to be a fairly large number of questionable values. Individual outliers need to be examined more closely (e.g., E199 in 2013, E2 in 2011, etc.). Are these data entry errors, artifacts from the rescaling process, ??
2. Issues with some specific wells/gauges more clearly seem to be related to the rescaling (e.g., E25)
3. What’s with the wells with more limited records (e.g., E300 has only two years of data. Why? Was this installed for a specific and limited purpose or is this an issue with well IDs?)

“O” wells/staff gauges – jitter plots

## obs wells - box plots
# wt.nsf %>% 
#   group_by(Wat_ID2) %>%
#   mutate(cnt.WatID = n()) %>%
#   filter(Year>2000) %>%
#   mutate(Year = as.character(Year)) %>%
#   filter(str_detect(Wat_ID2, "O")) %>% 
#   # filter(cnt.WatID > 3 & rellevel < 999) %>% 
#   # ggplot(aes(Date,rellevel)) +
#   ggplot(aes(Year,rellevel)) + 
#   geom_boxplot(aes(group=Year)) +
#   facet_wrap(~Wat_ID2, ncol=5, scales = "free") +
#   theme(axis.text.x=element_text(angle=45, hjust=1))

## obs wells - jitter
wt.nsf %>% 
  group_by(Wat_ID2) %>%
  mutate(cnt.WatID = n()) %>%
  filter(Year>2000) %>%
  mutate(Year = as.character(Year)) %>%
  filter(str_detect(Wat_ID2, "O")) %>% 
  # filter(cnt.WatID > 3 & rellevel < 999) %>% 
  # ggplot(aes(Date,rellevel)) +
  ggplot(aes(Year,rellevel)) + 
  geom_jitter(width = 0.15, height = 0.15, aes(color = Year)) +
  # geom_point(color="darkblue", alpha=0.1,position = position_jitterdodge(dodge.width = 0.9, jitter.width = 0.2)) +
  # geom_boxplot(aes(group=Year)) +
  theme_minimal() +
  theme(legend.position = "none") +
  facet_wrap(~Wat_ID2, ncol=5, scales = "free") +
  theme(axis.text.x = element_text(face="bold", color="black", 
                           size=7, angle=75, hjust = 1))

## Warning: Removed 1 rows containing missing values (geom_point).

Questions/Issues:
1. Seem to be a fairly large number of questionable values. Individual outliers need to be examined more closely (e.g., O1 in 2013, O10 in 2014, O11 in 2014). Are these data entry errors, artifacts from the relativization process, ??
2. Issues with some specific wells/gauges more clearly seem to be related to the rescaling (e.g., O200, O21_2)
3. Why do some wells have such short records? I’m assuming some might be new (e.g., O122, which has observations only in 2014), but then there are the issues of wells/gauges that blink in and out. For example:ODP1 (2013 only), O42 (2008-2011), etc. Have well numbers been changed over the duration of the study?

Spatial plotting of the 2001-2015 DCC well data set

Create a sf representation of the wells/staff gauges from coordinates and plot.

Note: Recieved the following error: “Error in st_as_sf.data.frame(x = wt.nsf, coords = c(”Y“,”X“), crs =”+proj=utm +zone=12 +ellps=GRS80 +datum=NAD83 +units=m +no_defs“) : missing values in coordinates not allowed”
Inspection of the full data set found 71 records with missing x,y,z values.
Some records also have bogus coordinates (e.g., “999”)

## find missing
wt.nsf %>% 
  filter(is.na(X) | is.na(Y) | X == "999" | Y == "999") %>%
  datatable(caption = "Records with missing/bogus x or y coordinates.")

# create sf bject from coordinates. This time, filtering out records with missing x,y
wt.nsf.sf <- wt.nsf %>% 
  filter(!is.na(X)) %>% 
  filter(!is.na(Y)) %>% 
  filter(X != "999") %>% 
  st_as_sf(.,
           coords = c("X","Y"),
           crs = "+proj=utm +zone=12 +ellps=GRS80 +datum=NAD83 +units=m +no_defs")

Map of wells/gauges from DCC dataset

Note: Records with missing/bogus coordinates removed.

## opentopo
# mapview(wt.nsf.sf,map.types = "OpenTopoMap", zcol = "Wat_ID2", alpha = 0.12, legend = FALSE)

## opentopo + imagery options
mapview(wt.nsf.sf, map.types = c("OpenTopoMap","Esri.WorldImagery"), zcol = "Wat_ID2", alpha = 0.12, legend = FALSE)

Locations of Sites/wells with records with a date of “1/1/1900”

Seem to be scattered widely…

## Locations of the "1900" values
wt.nsf.sf %>%
  filter(Year == "1900") %>%
  mapview(., map.types = c("OpenTopoMap","Esri.WorldImagery"), zcol = "Wat_ID2", alpha = 0.12, legend = FALSE)

Sites/wells with records with a date of “1/1/1900”

These need to be fixed…

wt.nsf %>%
  filter(Year <= 2000) %>% 
  select(1:4,Site_Typ,Wat_ID2,Wat_Type) %>% 
  datatable()