This document provides a basic interrogation of the manual well measurement data archived on the “Digital Collections of Colorado” (DCC) library site: Well Measurement Data 2001 - 2015
These analyses are aimed at characterizing basic data structure and identifying potential issues such as:
My initital assumption is that the DCC data set represents the canonical version of the water table data, to which updates from 2015-2017 should be added. However, if there are errors in the DCC repository, these should be addressed first before attempting to update with more recent data.
wt.nsf <- read_tsv("data/NSF_data_archive_20180208/data/Well_measurement_data2001-2015/Yell_Well_Data_Manual2001_2015.txt")
wt.nsf.meta <- read_csv("data/NSF_data_archive_20180208/data/Well_measurement_data2001-2015/Yell_Well_Data_Manual2001_2015_METADATA.csv")
wt.nsf.meta %>%
knitr::kable()
| Site | Site_1 |
|---|---|
| Site2 | Site/Treatment |
| Plot | Plot Treatment |
| Date | Date |
| Day | Day of Month |
| Month | Month |
| Year | Year |
| JULIAN_DAY | Day of Year |
| Site_Typ | Experimental or Observation |
| Wat_ID | Unique Well ID (EXP and OBS lumped) |
| Wat_ID2 | Unique Well ID |
| Wat_Type | 1 = Well, 2 = Staff Gauge |
| rellevel | Distance to water relative to ground surface |
| X | Easting (UTM Zone 12 - NAD83) |
| Y | Northing (UTM Zone 12 - NAD 83) |
| Z | Ground Surface Elevation (meters) |
| abslevel | Water Table Elevation (meters) |
Questions/Issues:
1. How are staff gauge data recorded? Is this a corrected elevation or a distance from the gauge (t-post?) top?
2. What are the sources of x, y and z coordinates? Interogating the many folders of data, I found reference to total station survey work (from DK and earlier), but also GPS and (maybe) GIS-derived elevations floating about (e.g., pulled off of DEM). Do we have any info on how z-coordinates were obtained?
# first, make "Wat_ID" a character...
wt.nsf <- wt.nsf %>%
mutate(Wat_ID = as.character(Wat_ID))
# use some functions in skimr pkg to evaluate missing values.
s1 <- skimr::skim(wt.nsf)
s1 %>%
filter(stat=="missing") %>%
select(-c(level,formatted)) %>%
arrange(-value) %>%
datatable()
Questions/Issues:
1. There are different levels of “missingness” between fields. Fields such as “abslevel”, “JULIAN_DAY” have the most missing values, but other fields also have issues…
2. The well id fields
The site and site2 fields differ in how they encode treatments. “Site” explicitly encodes treatments (e.g., DC, DX, etc.); “Site2” leaves these out. BUT, not consistent…
s1 %>%
filter(stat=="n_unique") %>%
select(-c(level,formatted)) %>%
datatable()
Questions/Issues:
1. Which concept of site (“Site” vs. “Site2”) should be used going forward? From disucssions on 2018-02-22, the “Site2” approach (once error-corrected) seems like the way?
2. How do the different Site ID approaches relate to well ID’s? are these unique to “Site”, “Site2”, or globally distinct?
## distinct "Site"
wt.nsf %>%
tabyl(Site) %>%
datatable()
## distinct "Site2"
wt.nsf %>%
tabyl(Site2) %>%
datatable()
## distinct "plot"
wt.nsf %>%
tabyl(Plot) %>%
select(-valid_percent) %>%
knitr::kable()
| Plot | n | percent |
|---|---|---|
| CC | 1983 | 0.1864773 |
| CX | 1679 | 0.1578898 |
| DC | 1711 | 0.1608990 |
| DX | 1918 | 0.1803649 |
| null | 54 | 0.0050781 |
| Null | 3259 | 0.3064698 |
| NA | 30 | 0.0028211 |
The following cross-tabulations are meant to provide some insights into data structure and diagnose potential problems.
wt.nsf %>%
crosstab(Site,Plot) %>%
datatable(caption = "Count of observations")
Questions/Issues:
1. Note presence of two flavors of null: “null” and “Null”
2. Also note presence of NA. How to bring these properly into allignment?
wt.nsf %>%
crosstab(Site2,Plot) %>%
datatable(caption = "Count of observations")
Questions/Issues:
1. Note miscoded Site2 entries. For example, there is “eb1” and “EB1”
2. “Site” looks like it is supposed to explicitly encode treatment and “Site2” is not; however, see the “Site2” entry for “EB2CC”, etc. Some errors appear to be present…
3. Similar issue as noted above regarding the null and NA values.
wt.nsf %>%
# names()
crosstab(Site2,Year) %>%
datatable(caption = "Count of observations")
Questions/Issues:
1. Were data collected in in 2007? Perhaps these are the “1900” data…
wt.nsf %>%
# names()
crosstab(Wat_ID2,Year) %>%
datatable(caption = "Count of observations")
Questions/Issues:
1. “1900” data…
2. The use of site and well IDs seem wonky in places. Can these be tied to field books? What should be considered “canonical”?
wt.nsf %>%
crosstab(Wat_ID2,Site2) %>%
datatable()
Questions/Issues:
1. Are these supposed to 2007 data?
2. Where would the original data be housed? Field books, data files,?…
wt.nsf %>%
# names()
crosstab(Wat_ID2,Site2) %>%
gather(key = Site2,value = cnt,-Wat_ID2) %>%
ggplot(aes(Site2,Wat_ID2)) +
geom_tile(aes(fill=cnt), color="grey") +
# scale_fill_gradient(low = "white", high = "darkred") +
scale_fill_gradient2(low = "white", mid = "blue", high = "red", midpoint = 45, space = "Lab", na.value = "grey50", guide = "colourbar") +
# scale_fill_viridis() +
theme_bw() +
theme(axis.text.x=element_text(angle=45, hjust=1))
Questions/Issues:
1. Note NA values. Figure out which sites and decide whether to keep or drop…
2. Check what’s up with “Eunlabeled”. Can this be recovered as valid data or does it need to be dropped?
3. What’s the significance of the hyphenated ID’s (e.g., O47-1 and O47-2)?
4. Does the pattern of “missingness” match expectations?
# s1
wt.nsf %>%
group_by(Wat_ID2, Year) %>%
summarise(cnt = n()) %>%
filter(Year>2000) %>%
ggplot(aes(Year, Wat_ID2)) +
geom_tile(aes(fill=cnt), color='white') +
scale_fill_viridis() +
theme_bw() +
labs(y="") +
theme(axis.text.x=element_text(angle=45, hjust=1))
Questions/Issues:
1. Note the missing 2007 data.
2. How does the pattern of “missingness” match expectations?
# exp plots - box plot
# wt.nsf %>%
# group_by(Wat_ID2) %>%
# mutate(cnt.WatID = n()) %>%
# filter(Year>2000) %>%
# mutate(Year = as.character(Year)) %>%
# filter(str_detect(Wat_ID2, "E")) %>%
# # filter(cnt.WatID > 3 & rellevel < 999) %>%
# # ggplot(aes(Date,rellevel)) +
# ggplot(aes(Year,rellevel)) +
# geom_boxplot(aes(group=Year)) +
# facet_wrap(~Wat_ID2, ncol=5, scales = "free") +
# theme(axis.text.x=element_text(angle=45, hjust=1))
## exp wells - jitter
wt.nsf %>%
group_by(Wat_ID2) %>%
mutate(cnt.WatID = n()) %>%
filter(Year>2000) %>%
mutate(Year = as.character(Year)) %>%
filter(str_detect(Wat_ID2, "E")) %>%
# filter(cnt.WatID > 3 & rellevel < 999) %>%
# ggplot(aes(Date,rellevel)) +
ggplot(aes(Year,rellevel)) +
geom_jitter(width = 0.15, height = 0.15, aes(color = Year)) +
# geom_point(color="darkblue", alpha=0.1,position = position_jitterdodge(dodge.width = 0.9, jitter.width = 0.2)) +
# geom_boxplot(aes(group=Year)) +
theme_minimal() +
theme(legend.position = "none") +
facet_wrap(~Wat_ID2, ncol=4, scales = "free") +
theme(axis.text.x = element_text(face="bold", color="black",
size=7, angle=75, hjust = 1))
## Warning: Removed 14 rows containing missing values (geom_point).
Questions/Issues:
1. Seem to be a fairly large number of questionable values. Individual outliers need to be examined more closely (e.g., E199 in 2013, E2 in 2011, etc.). Are these data entry errors, artifacts from the rescaling process, ??
2. Issues with some specific wells/gauges more clearly seem to be related to the rescaling (e.g., E25)
3. What’s with the wells with more limited records (e.g., E300 has only two years of data. Why? Was this installed for a specific and limited purpose or is this an issue with well IDs?)
## obs wells - box plots
# wt.nsf %>%
# group_by(Wat_ID2) %>%
# mutate(cnt.WatID = n()) %>%
# filter(Year>2000) %>%
# mutate(Year = as.character(Year)) %>%
# filter(str_detect(Wat_ID2, "O")) %>%
# # filter(cnt.WatID > 3 & rellevel < 999) %>%
# # ggplot(aes(Date,rellevel)) +
# ggplot(aes(Year,rellevel)) +
# geom_boxplot(aes(group=Year)) +
# facet_wrap(~Wat_ID2, ncol=5, scales = "free") +
# theme(axis.text.x=element_text(angle=45, hjust=1))
## obs wells - jitter
wt.nsf %>%
group_by(Wat_ID2) %>%
mutate(cnt.WatID = n()) %>%
filter(Year>2000) %>%
mutate(Year = as.character(Year)) %>%
filter(str_detect(Wat_ID2, "O")) %>%
# filter(cnt.WatID > 3 & rellevel < 999) %>%
# ggplot(aes(Date,rellevel)) +
ggplot(aes(Year,rellevel)) +
geom_jitter(width = 0.15, height = 0.15, aes(color = Year)) +
# geom_point(color="darkblue", alpha=0.1,position = position_jitterdodge(dodge.width = 0.9, jitter.width = 0.2)) +
# geom_boxplot(aes(group=Year)) +
theme_minimal() +
theme(legend.position = "none") +
facet_wrap(~Wat_ID2, ncol=5, scales = "free") +
theme(axis.text.x = element_text(face="bold", color="black",
size=7, angle=75, hjust = 1))
## Warning: Removed 1 rows containing missing values (geom_point).
Questions/Issues:
1. Seem to be a fairly large number of questionable values. Individual outliers need to be examined more closely (e.g., O1 in 2013, O10 in 2014, O11 in 2014). Are these data entry errors, artifacts from the relativization process, ??
2. Issues with some specific wells/gauges more clearly seem to be related to the rescaling (e.g., O200, O21_2)
3. Why do some wells have such short records? I’m assuming some might be new (e.g., O122, which has observations only in 2014), but then there are the issues of wells/gauges that blink in and out. For example:ODP1 (2013 only), O42 (2008-2011), etc. Have well numbers been changed over the duration of the study?
Create a sf representation of the wells/staff gauges from coordinates and plot.
## find missing
wt.nsf %>%
filter(is.na(X) | is.na(Y) | X == "999" | Y == "999") %>%
datatable(caption = "Records with missing/bogus x or y coordinates.")
# create sf bject from coordinates. This time, filtering out records with missing x,y
wt.nsf.sf <- wt.nsf %>%
filter(!is.na(X)) %>%
filter(!is.na(Y)) %>%
filter(X != "999") %>%
st_as_sf(.,
coords = c("X","Y"),
crs = "+proj=utm +zone=12 +ellps=GRS80 +datum=NAD83 +units=m +no_defs")
Note: Records with missing/bogus coordinates removed.
## opentopo
# mapview(wt.nsf.sf,map.types = "OpenTopoMap", zcol = "Wat_ID2", alpha = 0.12, legend = FALSE)
## opentopo + imagery options
mapview(wt.nsf.sf, map.types = c("OpenTopoMap","Esri.WorldImagery"), zcol = "Wat_ID2", alpha = 0.12, legend = FALSE)