Welcome!
Welcome to the Ellerbe Creek Cleanup Tutorial! To download this file and follow along, head over to our GitHub page
SECTION 01: WORKFLOW
Set up your environment
Some basics: To run a block of code, click the green arrow in the top right of the gray box.
Before we start, you need to set up your “environment.” This involves
installing and loading any libraries that you need to run the code for
this module. Libraries basically store a bunch of really useful
functions that other people have developed to make your life easier. You
can both download and load packages with the p_load
function!
# Load libraries (which have all the important functions).
# The first line of code only needs to be ran once to load the package that
# has the p_load function! Run it by removing the "#" before clicking the play button
#install.packages("pacman")
pacman::p_load(tidyverse, lubridate, dataRetrieval)
# Ensure the computer knows where to look for your files
setwd(dirname(rstudioapi::documentPath()))Load your data
Now that you’ve loaded all the libraries we need, the next step is to load all data for this project.
Notice that we are defining various variables. The following
a <- 3 means that we are telling R that we want the
variable a to equal the number 3 (we call this “assigning a
variable”). You can do this with numbers, text, and datasets, you name
it!
# This tells R to look for the folders and then giving them a name
data.dir <- "../01_data" # This folder has all the data
meta.dir <- "../00_meta" # This folder has all the meta data
file.name <- file.path(data.dir, "durham_data_tutorial_version.csv")# Finds file name
# Read dataset
durham_data <- read.csv(file.name) If you are new to R, maybe you have never seen the following before: %>%. This is called piping and is a powerful way to simplify your code and make it easier to read. Here’s a wonderful tutorial.
SECTION 02: Why R?
Reproducibility
By working in R, we have a record of every change or calculation we did on a dataset. This is very important so your work can be repeated by others. Also so you can repeat it if you find you accidentally made a mistake!
Open Source = Accessible
R is open source. That means it is free and will likely always be! That means anyone can use your code and run it without paying anything for a software. This makes your work accessible to many.
SECTION 03: Cleaning Up
Open the durham_data and take a look
Now, back to the dataset at hand. Did you know that you can view data
in R just like you do in Excel? Using the View() command,
you can view the dataset like a spreadsheet. The head()
function will show you just the top of the table (so it isn’t too
overwhelming). Try it out:
head(durham_data)| Disclaimer..Although.Stormwater.Services.makes.every.effort.to.ensure.data.quality.in.the.monitoring.program | errors.do.arise.and.may.not.be.identified.at.the.time.the.data.was.uploaded.to.this.database…The.City.of.Durham.does.not.assume.any.liability.associated.with.the.use.of.these.data.and.is.not.obligated.to.notify.parties.of.modifications.or.changes.to.the.database. | X | X.1 | X.2 | X.3 | X.4 | X.5 | X.6 | X.7 | X.8 | X.9 | X.10 | X.11 | X.12 | X.13 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Query Results: | |||||||||||||||
| Name:EL10.7EC-EL1.9EC-EL5.0EC-EL5.5GC-EL5.6EC-EL7.1EC-EL7.1SEC-EL7.6SECT-EL7.9EC-EL8.1GC-EL8.2EC-EL8.5SEC-EL8.6SECUT-EL9.9EC-EL-Englewood-EL-Knox | |||||||||||||||
| Medium:Water | |||||||||||||||
| Parameters: all | |||||||||||||||
| Project: Ambient |
Wait a minute, something is not right. What’s going on? Welcome to the fun of coding!
Sometimes, inappropriate metadata is stored within a .csv file. CSV, or comma-separated values, should only have data in them. If you open this spreadsheet in Excel, you will see that there is a set of data on top of the original that gives valuable metadata on the collection. We want to preserve this data, as it is important, but in a more appropriate way. An easy way to do this is to read the data in three parts. Once, to grab the metadata; second, to grab the column headers; third, to grab the actual data:
# load the metadata (extra data at the top of the file) separately using "readLines"
meta <- readLines( file.name, n=10) # the metadata stops on row 10
meta <- paste( gsub(',', '', meta), collapse="\n") #remove errant commas
print(meta)## [1] "Disclaimer: Although Stormwater Services makes every effort to ensure data quality in the monitoring program errors do arise and may not be identified at the time the data was uploaded to this database. The City of Durham does not assume any liability associated with the use of these data and is not obligated to notify parties of modifications or changes to the database.\n\nQuery Results: \nName:EL10.7EC-EL1.9EC-EL5.0EC-EL5.5GC-EL5.6EC-EL7.1EC-EL7.1SEC-EL7.6SECT-EL7.9EC-EL8.1GC-EL8.2EC-EL8.5SEC-EL8.6SECUT-EL9.9EC-EL-Englewood-EL-Knox \nMedium:Water\nParameters: all \nProject: Ambient\nTime Period:2016-05-02 ~ 2022-05-02\nData was downloaded on May 23 2022 11:41 am\nThe data contains the QA code. "
# now load the durham_data
durham_data <- read.csv(file.name, skip=12, row.names=NULL) #durham_data is on row 12+
head(durham_data)| Station.Name | Filtered | Parameter | Date.Time | Value | QA.Code | Unit | QAQC.Sample | Comments | Rain.in.Last.24.Hours | Sky.Condition | Flow.Severity | Comp..Code | Thalweg | Lab | Project |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| EL8.5SEC | Alkalinity | 5/11/16 8:40 | 94 | mgCaCO3/L | Staff Gage: 0.68 | No | Partly Cloudy | 2 | G | Yes | SDWRF | Ambient | |||
| EL8.5SEC | Alkalinity | 6/15/16 8:35 | 88 | mgCaCO3/L | Staff Gage: 0.64; Barely discernable discharge | No | Partly Cloudy | 1 | G | Yes | SDWRF | Ambient | |||
| EL8.5SEC | Alkalinity | 7/13/16 8:35 | 60 | mgCaCO3/L | Staff Gage: 0.73 | Yes | Overcast | 2 | G | Yes | SDWRF | Ambient | |||
| EL8.5SEC | Alkalinity | 8/10/16 8:30 | 90 | mgCaCO3/L | Staff Gage: 0.56 | Yes | Partly Cloudy | 3 | G | Yes | SDWRF | Ambient | |||
| EL8.5SEC | Alkalinity | 9/14/16 8:40 | 110 | mgCaCO3/L | Duplicate Site; Staff Gage: 0.66 Duplicate | No | Partly Cloudy | 2 | G | Yes | SDWRF | Ambient | |||
| EL8.5SEC | Alkalinity | 9/14/16 8:40 | 110 | mgCaCO3/L | Duplicate Site; Staff Gage: 0.66 Duplicate | No | Partly Cloudy | 2 | G | Yes | SDWRF | Ambient |
Now we have the durham_data we need saved in ‘durham_data’, and the metadurham_data stored in ‘meta’. Let’s save the metadurham_data into a new file (NOT overwriting the raw durham_data).
meta.file <- file.path(meta.dir, 'metadurham_data.txt')
write.table(meta, file=meta.file, row.names=FALSE, col.names=FALSE)For data cleaning, we’re going to cover three main issues: (1) Inconsistent Data Entry, (2) Duplicate Rows, and (3) Missing Data.
INCONSISTENT DATA ENTRY
Now we are ready to look at the real data! It can be helpful at first to look at the “unique” data values for each column, to get a sense of what you’re working with. Let’s start with “Sky.Condition”
sky_condition<-durham_data %>%
# group the data so all like things in "Sky.Condition" are put together
group_by(Sky.Condition) %>%
# count the number of occurances for a given type of "Sky Condition"
count() %>%
# arrange the rows by highest to lowest value of "n"
arrange(desc(n))
print(sky_condition) # shows the result!## # A tibble: 9 × 2
## # Groups: Sky.Condition [9]
## Sky.Condition n
## <chr> <int>
## 1 "Sunny" 3699
## 2 "Overcast" 2916
## 3 "Partly Cloudy" 2065
## 4 "" 83
## 5 "overcast" 2
## 6 "outcast" 1
## 7 "party cloudy" 1
## 8 "sunny" 1
## 9 "suny" 1
Ah, it looks like somebody had some inconsistencies with spellings. Let’s deal with some of those by making everything lowercase, then substituting some spelling fixes:
durham_data <- durham_data %>%
# First, make everything in the column lowercase with 'tolower()'
mutate(Sky.Condition = tolower( Sky.Condition )) %>%
# Next, fix the spelling of "sunny". We use 'gsub(pattern, replacement, searchlist)'
# to replace words or phrases. Modify this code to fix the other misspelling!
mutate(Sky.Condition = gsub("suny", "sunny", Sky.Condition)) %>%
mutate(Sky.Condition = gsub("pattern", "replacement", Sky.Condition))
sort( unique( durham_data$Sky.Condition ) )## [1] "" "outcast" "overcast" "partly cloudy"
## [5] "party cloudy" "sunny"
DUPLICATE durham_data
As you may have noticed from your initial exploration of the data,
some rows are duplicated. Dealing with these will be a little tricky, so
let’s break it into steps. First, we pull out all the duplicates and
save them in a variable called dupes.
# find rows with "duplicate" in the comments, then create a new durham_data frame just with those.
inx.dup <- which( grepl( 'duplicate', tolower(durham_data$Comments) ) )
dupes <- durham_data[inx.dup,]
head( dupes )| Station.Name | Filtered | Parameter | Date.Time | Value | QA.Code | Unit | QAQC.Sample | Comments | Rain.in.Last.24.Hours | Sky.Condition | Flow.Severity | Comp..Code | Thalweg | Lab | Project | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | EL8.5SEC | Alkalinity | 9/14/16 8:40 | 110.0 | mgCaCO3/L | Duplicate Site; Staff Gage: 0.66 Duplicate | No | partly cloudy | 2 | G | Yes | SDWRF | Ambient | |||
| 6 | EL8.5SEC | Alkalinity | 9/14/16 8:40 | 110.0 | mgCaCO3/L | Duplicate Site; Staff Gage: 0.66 Duplicate | No | partly cloudy | 2 | G | Yes | SDWRF | Ambient | |||
| 14 | EL8.5SEC | Field | Aluminum | 9/14/16 8:40 | 6.0 | J | ug/L | Duplicate Site; Staff Gage: 0.66 Duplicate | No | partly cloudy | 2 | G | Yes | SDWRF | Ambient | |
| 15 | EL8.5SEC | Field | Aluminum | 9/14/16 8:40 | 5.6 | J | ug/L | Duplicate Site; Staff Gage: 0.66 Duplicate | No | partly cloudy | 2 | G | Yes | SDWRF | Ambient | |
| 23 | EL8.5SEC | Aluminum | 9/14/16 8:40 | 288.0 | ug/L | Duplicate Site; Staff Gage: 0.66 Duplicate | No | partly cloudy | 2 | G | Yes | SDWRF | Ambient | |||
| 24 | EL8.5SEC | Aluminum | 9/14/16 8:40 | 198.0 | ug/L | Duplicate Site; Staff Gage: 0.66 Duplicate | No | partly cloudy | 2 | G | Yes | SDWRF | Ambient |
Normally, you could just get rid of duplicates using the function
distinct(). distinct() simply drops the second
row that is duplicated, however. Take a look at some of these values in
dupes – they’re not the same! When looking at your own
data, you need to make decisions on how to deal with duplicate data. For
now, we’ve decided to take the average of the two values. If you want to
use just the first value, distinct() is fine; otherwise,
here’s how to replace duplicate values with the averages:
# Create an ID row to sort on station name, filtered, parameter, and date
dupes <- dupes %>%
mutate(ID = paste0(Station.Name, Filtered, Parameter, Date.Time)) %>%
arrange(ID)
# grab the mean values for each combination
means <- dupes %>%
group_by(ID) %>% #group dataframe by ID
summarize(MeanValue = mean(Value, na.rm=TRUE)) #take the mean and ignore NAs
# collapse the duplicate data frame based on ID and sort it the same way
dupes <- dupes %>%
distinct(Station.Name, Filtered, Parameter, Date.Time, .keep_all=TRUE)
# make sure both datasets are the same length and in the same order
nrow(dupes) == nrow(means) && all(dupes$ID == means$ID)## [1] TRUE
# Now that we're sure, replace Value column of distinct dupes with mean values
# and remove the ID column
dupes$Value <- means$MeanValue
dupes <- dupes %>% dplyr::select(-ID)
# Put it all together! First, remove duplicated rows from the main data frame
durham_data <- durham_data[-inx.dup,]
# Add the averaged, previously-duplicated rows to the end of the data frame
durham_data <- bind_rows(durham_data, dupes) %>%
arrange(Station.Name, Filtered, Parameter, Date.Time) #sort data again
head(durham_data)| Station.Name | Filtered | Parameter | Date.Time | Value | QA.Code | Unit | QAQC.Sample | Comments | Rain.in.Last.24.Hours | Sky.Condition | Flow.Severity | Comp..Code | Thalweg | Lab | Project |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| EL1.9EC | Ammonia Nitrogen | 1/19/17 10:30 | 0.08 | mg/L | Yes | sunny | 2 | G | Yes | SDWRF | Ambient | ||||
| EL1.9EC | Ammonia Nitrogen | 1/2/18 9:45 | 0.04 | U | mg/L | No | sunny | 2 | G | Yes | SDWRF | Ambient | |||
| EL1.9EC | Ammonia Nitrogen | 1/4/22 9:00 | 0.23 | mg/L | Winter storm over the weekend | No | sunny | 4 | G | Yes | SDWRF | Ambient | |||
| EL1.9EC | Ammonia Nitrogen | 1/5/21 9:20 | 1.50 | J7 | mg/L | No | overcast | 3 | G | No | SDWRF | Ambient | |||
| EL1.9EC | Ammonia Nitrogen | 1/7/20 9:15 | 0.05 | mg/L | No | overcast | 3 | G | Yes | SDWRF | Ambient | ||||
| EL1.9EC | Ammonia Nitrogen | 1/8/19 9:50 | 0.11 | mg/L | No | partly cloudy | 3 | G | Yes | SDWRF | Ambient |
MISSING DATA
Sometimes, data is just plain missing. These data aren’t always an
issue, but it’s good to know where the missing data is. You can check
that out using summary() or is.na(); be
mindful if you’re doing counts of rows or other things that would
require removing data that is represented as NA, or
missing. The function which() can tell you which rows have
missing data, and removing them is a cinch. Just be sure to save that
data frame as something other than your original, since once you remove
rows you’ll have to run your code again to get them back!
summary( durham_data$Value )## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 2.5 7.6 1587.1 78.0 76800.0 3
any( is.na(durham_data$Value) )## [1] TRUE
which( is.na(durham_data$Value) )## [1] 2743 4009 6771
inx.rm <- which( is.na(durham_data$Value) )
durham_data_no_na <- durham_data[-inx.rm,]DATA FORMAT
Now it’s time to get the data ready for visualizations.
# Let's select just a few columns of data using the "select" function.
D1 <- durham_data %>%
dplyr::select(Station.Name, Date.Time, Parameter, Value)
# Now let's filter for distinct rows. We can do this first by grouping by Name and Date.
# Put simply, R will basically take all the rows from the same date, then look at rows with the same Station.Name, and finally look for duplicates and remove.
D2<-D1 %>%
group_by(Station.Name, Date.Time) %>%
distinct(Station.Name, Date.Time, Parameter, .keep_all=TRUE)
# Next we are going to make the dataset "wider". Transforming the dataset.
D3<-D2%>%
pivot_wider(names_from=Parameter, values_from=Value)
# And we will finish by cleaning it up
durham_data_wide<-D3 %>%
mutate(Date = as.Date(Date.Time, format="%m/%d/%y"),
Year = year(Date),
Date = as.POSIXct.Date(Date)) %>%
relocate(Date, Year, .after=Date.Time) %>%
ungroup()
# Remove D1, D2, and D3 since durham_data_wide is all we really want
rm(D1, D2, D3)SECTION 04: Visualization and Analysis
Now let’s visualize some of this data! Through visualizations, we can take complex or large datasets and simplify them to look for trends. That in turn can help guide our analysis or direct us towards interesting and new questions.
Time Series Plots and Analysis
Starting with the basics, we can think about this dataset as a time series and plot a given analyte against time.
What analytes exist? Well they are the names of the columns. Let’s look at their names
colnames(durham_data_wide)## [1] "Station.Name" "Date.Time"
## [3] "Date" "Year"
## [5] "Ammonia Nitrogen" "Biochemical Oxygen Demand"
## [7] "Calcium" "Conductivity"
## [9] "Copper" "Dissolved Oxygen"
## [11] "Dissolved Oxygen Saturation" "Fecal Coliform"
## [13] "Hardness" "Magnesium"
## [15] "Nitrate + Nitrite as N" "Organic Carbon"
## [17] "pH" "Temperature"
## [19] "Total Kjeldahl Nitrogen" "Total Phosphorus"
## [21] "Total Suspended Solids" "Turbidity"
## [23] "Zinc" "Gage height"
## [25] "Alkalinity" "Aluminum"
## [27] "Cadmium" "Chloride"
## [29] "Chromium" "Fluoride"
## [31] "Iron" "Lead"
## [33] "Manganese" "Nickel"
## [35] "Potassium" "Sodium"
## [37] "Sulfate"
Let’s see how concentrations of calcium changes over time. We can color the points based on the site where the sample was taken.
ggplot() +
geom_point(data=durham_data_wide, mapping=aes(x=Date, y=Calcium, color=Station.Name)) +
theme_classic()## Warning: Removed 23 rows containing missing values (geom_point).
Using similar code to above, choose two different variables and see how they compare to one another!
Hint: You can simply change the x-axis from “Date” to your variable of choosing.
# INSERT YOUR CODE HERECool, we can see how concentration varies overtime, but it’s hard to tell how variation throughout the year differs between sites. So let’s look at a boxplot of this data!
ggplot(
data=durham_data_wide, aes(x=Station.Name, y=Calcium)) +
geom_boxplot()## Warning: Removed 23 rows containing non-finite values (stat_boxplot).
Let’s get fancy and separate this out by year. Maybe there’s a difference between years?
ggplot(
data=durham_data_wide, aes(x=Station.Name, y=Calcium)) +
geom_boxplot() +
facet_wrap(~Year)## Warning: Removed 23 rows containing non-finite values (stat_boxplot).
Ah shoot the graph is a little messy, let’s clean it up
ggplot(
data=durham_data_wide, aes(x=Station.Name, y=Calcium)) +
geom_boxplot() +
facet_wrap(~Year) +
theme_classic()+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))## Warning: Removed 23 rows containing non-finite values (stat_boxplot).
Try doing the same visualizations now, but with a different variable. Make sure you use the exact name of the variable as it is found in the dataset!
# INSERT YOUR CODE HERE!SECTION 05: Exploring a Federal Dataset
It just so happens that the City of Durham isn’t the only one sampling Ellerbe Creek! We also have the US Geological Survey (USGS), which has a bunch of sensors deployed in streams and rivers nationally, and two in Ellerbe Creek.
Let’s explore what they have, using USGS’ R package (dataRetrieval) to download and view the data.
First, let’s introduce you to the wonderful world of CRAN. Every package will usually create a helpful documentation outlining every function in the package. It will include a list of every function, and a help page for each (usually with examples). The dataRetrieval documentation can be found here.
# Look for all sites in NC that have measures for mean daily Discharge, cubic feet per second (code = 00060)
siteListDischarge <- whatNWISsites(stateCd="NC",parameterCd="00060") %>%
filter(str_detect(station_nm, "ELLERBE")) # Filter for sites that are for Ellerbe Creek
head(siteListDischarge)| agency_cd | site_no | station_nm | site_tp_cd | dec_lat_va | dec_long_va | colocated | queryTime |
|---|---|---|---|---|---|---|---|
| USGS | 02128576 | LITTLE MOUNTAIN CREEK NEAR ELLERBE, NC | ST | 35.08904 | -79.83311 | FALSE | 2022-10-27 09:47:31 |
| USGS | 02128500 | BIG MOUNTAIN CREEK NEAR ELLERBE, NC | ST | 35.12015 | -79.81616 | FALSE | 2022-10-27 09:47:31 |
| USGS | 0208675010 | ELLERBE CREEK AT CLUB BOULEVARD AT DURHAM, NC | ST | 36.01939 | -78.89478 | FALSE | 2022-10-27 09:47:31 |
| USGS | 02086849 | ELLERBE CREEK NEAR GORMAN, NC | ST | 36.05931 | -78.83251 | FALSE | 2022-10-27 09:47:31 |
Site Search in NC
Let’s start by looking for all sites in NC that have measures for mean daily Discharge, cubic feet per second (code = 00060).
siteListDischarge <- whatNWISsites(stateCd="NC",parameterCd="00060") %>%
filter(str_detect(station_nm, "ELLERBE")) # Filter for sites that are for Ellerbe Creek
head(siteListDischarge)| agency_cd | site_no | station_nm | site_tp_cd | dec_lat_va | dec_long_va | colocated | queryTime |
|---|---|---|---|---|---|---|---|
| USGS | 02128576 | LITTLE MOUNTAIN CREEK NEAR ELLERBE, NC | ST | 35.08904 | -79.83311 | FALSE | 2022-10-27 09:47:32 |
| USGS | 02128500 | BIG MOUNTAIN CREEK NEAR ELLERBE, NC | ST | 35.12015 | -79.81616 | FALSE | 2022-10-27 09:47:32 |
| USGS | 0208675010 | ELLERBE CREEK AT CLUB BOULEVARD AT DURHAM, NC | ST | 36.01939 | -78.89478 | FALSE | 2022-10-27 09:47:32 |
| USGS | 02086849 | ELLERBE CREEK NEAR GORMAN, NC | ST | 36.05931 | -78.83251 | FALSE | 2022-10-27 09:47:32 |
Download and view data from Ellerbe Creek, in Durham, NC
Sweet! Okay now let’s look at two sites in Ellerbe Creek. These are sites with stream gauges, thus lots of data. They are the sites near Club Boulevard (Durham,NC) and Gorman, NC.
Let’s start by exploring what data is available.
Club_Code<-"0208675010"
Gorman_Code<-"02086849"
siteNos<-c(Club_Code, Gorman_Code)
availableData <- whatNWISdata(siteNumber = siteNos)View other parameters
Turns out that it isn’t just flow data that is available through dataRetrevial. Let’s see the full list of possible parameters (this will take a minute to load).
# This will take a minute to load
pcode <- readNWISpCode("all")
head(pcode)| parameter_cd | parameter_group_nm | parameter_nm | casrn | srsname | parameter_units |
|---|---|---|---|---|---|
| 00001 | Information | Location in cross section, distance from right bank looking upstream, feet | ft | ||
| 00002 | Information | Location in cross section, distance from right bank looking upstream, percent | % | ||
| 00003 | Information | Sampling depth, feet | ft | ||
| 00004 | Physical | Stream width, feet | Instream features, est. stream width | ft | |
| 00005 | Information | Location in cross section, fraction of total depth, percent | % | ||
| 00008 | Information | Sample accounting number | nu |
# See how many parameters there are
pcode %>% count()| n |
|---|
| 24738 |
Let’s figure out what each of those parameters are for our sites!
parameters_at_sites<-availableData %>%
select(site_no, station_nm, parm_cd, begin_date, end_date, count_nu) %>%
na.omit() %>%
rename(parameter_cd = parm_cd) %>%
left_join(pcode)## Joining, by = "parameter_cd"
parameters_at_sites %>%
arrange(desc(count_nu)) %>%
mutate(count_rank=row_number()) %>%
ggplot(aes(x= count_rank, y=count_nu, shape=station_nm, color=station_nm)) +
geom_point()+
facet_wrap(~station_nm) +
scale_y_log10() +
theme_classic() +
theme(legend.position = "bottom")Turns out a vast majority of the data collected was only collected on a few days of sampling. We want to look at temporal trends, so let’s see what is regularly monitored here.
parameters_at_sites %>%
filter(count_nu >100) %>%
group_by(station_nm, parameter_cd) %>%
count()| station_nm | parameter_cd | n |
|---|---|---|
| ELLERBE CREEK AT CLUB BOULEVARD AT DURHAM, NC | 00060 | 4 |
| ELLERBE CREEK AT CLUB BOULEVARD AT DURHAM, NC | 00065 | 4 |
| ELLERBE CREEK NEAR GORMAN, NC | 00010 | 1 |
| ELLERBE CREEK NEAR GORMAN, NC | 00060 | 2 |
| ELLERBE CREEK NEAR GORMAN, NC | 00061 | 1 |
| ELLERBE CREEK NEAR GORMAN, NC | 00065 | 4 |
| ELLERBE CREEK NEAR GORMAN, NC | 00095 | 2 |
| ELLERBE CREEK NEAR GORMAN, NC | 00191 | 1 |
| ELLERBE CREEK NEAR GORMAN, NC | 00400 | 1 |
| ELLERBE CREEK NEAR GORMAN, NC | 30209 | 1 |
| ELLERBE CREEK NEAR GORMAN, NC | 63160 | 1 |
Looks like it is parameter code 00060, what is that? Ah ha! It’s discharge. Makes sense since it is a gage :D
Explore Discharge data
The first site (Club) is upstream of the second. Based on this, which would you expect to have a higher flow? Let’s verify by loading in and viewing some data.
pCode <- "00060"
Discharge_Data <- readNWISdv(siteNos, pCode, "2005-01-01","2021-12-31")View data
ggplot() +
geom_point(data=Discharge_Data, mapping=aes(x=Date, y=X_00060_00003, color=site_no)) +
theme_classic() +
labs(y="Average Daily Discharge (ft^3/sec)") +
facet_wrap(~site_no)+
theme(legend.position = "bottom")Yikes that is messy. Maybe we should zoom into one year, and then replace with lines and then overlay.
# Filter data
Discharge_Data_2020<-Discharge_Data %>%
filter(Date > "2020-01-01" & Date < "2020-12-31")
# Plot data
ggplot() +
geom_line(data=Discharge_Data_2020, mapping=aes(x=Date, y=X_00060_00003, color=site_no)) +
theme_classic() +
labs(y="Average Daily Discharge (ft^3/sec)")You try now!
Now you give it a try. Maybe select a different site, or a different measure.
If you want to get even deeper, check out this tutorial from USGS that outlines even more things that you can do with this package.
SECTION 05: Conclusion
This is only the start of all the wonderful things you can do in R! Please reach out to Jonny if you’d like more tutorials to dive deeper into the R world of fun.