Transformation One

New Hampshire Department of Environmental Services Beach Data:

Weirs Endicott Beach - Laconia, NH

For this tidying adventure I chose to extract an HTML table from an archaic state of New Hampshire Environmental Services webpage which posts daily results from laboratory cultures of fresh and saltwater beaches. The goal is to gather, structure, sort and tidy data, extracting only the reports for August 2017 and calculating the proportion of test results for the month which exceed the states advisory level of 158 colony forming units and make a simple bar graph of the resulting proportions.

Because this assignment was not about web scraping I used the URL for the singular beach, Endicott at the Weirs in Laconia, NH within the time period desired making sure that the page also included some samples from July to facilitate filtering.

The first step was to use the package XML to read the existing table into R, the following garbled text is the result of this acquisition.

url <- "http://www4.des.state.nh.us/DESOnestop/BCHDetail.aspx?ID=162"
web_table <- readHTMLTable(url, header = T, which = 1, stringsAsFactors = F)
head(web_table, 2)


Tidy Table Data

The developers of this site used HTML tables to also structure the shape of the page using cells as containers for content in lieu of a systematic approach to using CSS styles. Therefore the first 68 rows contain a variety of text-based chunks which are not meaningful to us in this particular task.

Thus the next step was to extract only the portion of the table with actual tabular data.Simply subletting and checking for consistency by comparing to the website was adequate to this task. Form here out, the tidying and transformation will be accomplished using a combination of tidyr, dplyr, lubridate and traditional aggregation methods.

Because the location of the beach (LAKE WINNIPESAUKEE ENDICOTT PK) and the location on the beach where the samples were taken (LEFT, Right and CENTER; although for this month all samples were taken only in the center), I used separate to create individual variables for each of these. In future aggregations, other beaches and sampling location may be appended on to this set, so having them separated makes sorting in future analyses easier. The names Beaches and Sample_site were applied at this point.

Because ‘' is not an allowable symbol in a variable name, mutate was used to replace the’' with a ’_’ in the V8 column, prior to spreading columns V7 & V8 so that the values in V8 become column names and V7 become the values.

Although there is only one value in this case, multiple types of tests can be performed on a sample, so in future cases this might create multiple columns.

The data was then pared down to the core necessary variables for our analyses moving forward using select and applying useful names to the previously generic fields not already renamed.

The next two steps were both mutate transformations to bring the Date into POSIXct format and convert the MPN_100ML to numeric for summarization later. as well as converting both the Beach and Sample_Site to factors to make grouping functions more rational once other tables are ingested. The last step was to filter for august using lubridate and dplyr so we can summarize the proportions of acceptable and advisory days.

The data table below reflects the changes made in the pipe.

data_tab <- web_table[69:88, 2:8]
tail(data_tab)
data_tab <- data_tab %>% separate(V3, c("Beach", "Sample_Site"), sep = "\\-", 
    remove = TRUE) %>% mutate(V8 = (gsub("/", "_", V8))) %>% spread(V8, V7) %>% 
    select(Beach, Sample_Site, Date = V5, MPN_100ML) %>% mutate(Date = as.POSIXct(Date, 
    format = "%m/%d/%Y")) %>% mutate(MPN_100ML = as.numeric(MPN_100ML), Beach = factor(Beach), 
    Sample_Site = factor(Sample_Site)) %>% filter(month(Date) == 8)
write.csv(data_tab, "data_tab.csv")
data_tab


Calculating and Graphing Exceedance & Non-Exceedance Probabilities

To gather a proportion of excrescences and non-excrescences, I used dplyr and pipes again to apply Boolean filtering against the MPN_100ML variable and create a new column with 0 and 1 values, converting them to factors for the table and prop.table to act upon.

To continue in the spirit of dplyr I created a tibble from the proportion table of data_tab$Exceedance, renamed the variables to E.coli_Level and Proportion using pipes and rename.

From there a bar plot was made of the Proportion column and a table of the associated values shown below, using DT, which I feel is more attractive and adaptable for the final data than pretty tables in base rMarkdown

data_tab <- data_tab %>% mutate(Exceedance = factor(ifelse(MPN_100ML > 158, 
    1, 0)))

prop_exceed <- as_tibble(round(prop.table(table(data_tab$Exceedance)), 3)) %>% 
    mutate(Var1 = factor(Var1, labels = c("Non-Exceedance", "Exceedance"))) %>% 
    rename(E.coli_Level = Var1, Proportion = n)


barplot(prop_exceed$Proportion, main = "Proportion of Test Samples for August 2017", 
    xlab = "Results", ylab = "Proportions", col = c("maroon", "orchid"), ylim = c(0, 
        1))

Transformation Two

The goal of this tidying project is to download USGS water quality data from an API, clean it structure and then make some time series graphs from the table to see how different physical attributes of water change together. This is a preliminary step for possible predictive models using this data in conjunction withe the data from New Hampshire Department of Environmental Services, in the prior transformation.

For the second transformation I decided to pull data from a USGS Water Quality API, which you can read about here USGS Water REST Service. The first step is pulling in the URL for Water Quality Meta Data, reading it in using read.delim, removing the second row which contains fixed-width spacing, that we do not need for this analysis. Because this is simply meta data we might be interested in using to interpret the data, I am going to convert it to a single column, with the column names as row names and print it as is.

meta_url <- "https://waterservices.usgs.gov/nwis/site/?format=rdb&sites=01080000&siteStatus=all"

usgs_data <- read.delim(meta_url, header = TRUE, sep = "\t", comment.char = "#", 
    stringsAsFactors = FALSE)
usgs_meta <- data.frame(usgs_data[2, ])
usgs_meta <- t(usgs_meta)
nam <- dimnames(usgs_meta)
usgs_meta <- data.frame(usgs_meta[1:12])
rownames(usgs_meta) <- unlist(nam[1])

Tidying the Data

To make the data useful for our time series plots the USGS Water Quality Data API was accessed using the water services URL with the site code and the sensor parameters. A data frame was created using read.delim

This frame was piped into the usgs_final data frame removing the first row (to again remove the spacing designations for fixed-width files). select is used to extract the data I intend to use in the graphs and simultaneously rename the columns to more human friendly variable names.

After this initial streamlining of the data, each variable is converted to a useful type, factors, dates, and numeric data to support sorting and aggregation down the line. Dates and times were structured using POSIXct to allow for easy times series graphing, and the tibble was filtered for complete cases to remove the many observations with NA data when the sensors were not properly recording.

The head of this data can be seen in the table below. It was a conscious decision to leave this data set wide, instead of pulling it into a long format because the output would be ggplot2 graphs which do not play nice with pipes and thus would make using the filter, select, group_by tidy functions difficult to implement.

usgs_url <- "https://waterservices.usgs.gov/nwis/iv/?format=rdb&sites=01646500&startDT=2017-06-15&endDT=2017-09-30&parameterCd=00065,00010,00095,00300,00400&siteStatus=all"

usgs_raw <- read.delim(usgs_url, header = TRUE, sep = "\t", comment.char = "#", 
    stringsAsFactors = FALSE)


usgs_final <- usgs_raw[-1, ] %>% select(c(Agency = agency_cd, Date_Time = datetime, 
    Time_Zone = tz_cd, Water_Temp_C = X69942_00010, Lake_Stage = X69929_00065, 
    Specific_Conductance = X69943_00095, Dissolved_Oxygen = X69937_00300, Ph = X69941_00400)) %>% 
    mutate(Agency = factor(Agency), Time_Zone = factor(Time_Zone), Water_Temp_C = as.numeric(Water_Temp_C), 
        Lake_Stage = as.numeric(Lake_Stage), Specific_Conductance = as.numeric(Specific_Conductance), 
        Dissolved_Oxygen = as.numeric(Dissolved_Oxygen), Ph = as.numeric(Ph)) %>% 
    mutate(Date_Time = as.POSIXct(Date_Time, format = "%Y-%m-%d %H:%M")) %>% 
    filter(complete.cases(.))
write.csv(usgs_final, "usgs_final.csv")

I have selected a subset of this data from August 15, 2017 to August 16, 2017 to plot using ggplot2 package. The goal was to evaluate how water temperature changes relative to Dissolved Oxygen. In order to see the changing trends more clearly I divided the temperatures by three so that they would be nearer each other in scale. The values for the period follow in the data table below.

usgs_august <- usgs_final %>% filter(Date_Time > "2017-08-15" & Date_Time < 
    "2017-08-17")
write.csv(usgs_august, "august_USGS.csv")
ggplot(usgs_august) + geom_line(aes(x = Date_Time, y = Dissolved_Oxygen, color = "Dissolved Oxygen (ppm)")) + 
    geom_line(aes(x = Date_Time, y = Water_Temp_C/3, color = "Temperature (C)")) + 
    scale_y_continuous(name = expression("Dissolved Oxygen (ppm)"), sec.axis = sec_axis(~. * 
        3, name = "Water Temperature "), limits = c(7.5, 10)) + # scale_x_datetime(breaks=date_breaks('2 hour'),labels =
# date_format('%H:%M'))+
scale_color_manual(values = c("purple", "navy")) + theme_light() + labs(y = "Dissolved Oxygen (ppm)", 
    x = "Date and Time", color = "Sensor") + ggtitle("Changes in Water Temperature between August 15 and August 17")

Transformation Three

Twitter & Word Cloud

For this particular transformation I decided to pull use the Twitter API and pull some tweets with #putin in the hashtag and build a word cloud. the data will come into a data frame, and in this case, the frame will be used unconventionally storing multiple values in some fields as each tweet has a unique number of ‘@’ and ‘#’ tags. When it is no longer necessary to align them with their owner or the tweet text, they will be separated out into a more suitable data structure.

The code used to create the connection will not be echoed to protect my Twitter API keys and OAuth codes.

The Pull

With the setup completed it is time to pull the data from twitter. Because twitter is particular about how their data is accessed and used, we are limiting out pull to 1500 tweets at this time, with the search term ‘#putin’ and pushed the resulting lists to a data frame using twListToDF.

setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret)
## [1] "Using direct authentication"
search.string <- "#putin"
no.of.tweets <- 1500
tweets <- searchTwitteR(search.string, n = no.of.tweets, lang = "en")
tweetsDF <- twListToDF(tweets)
write.csv(tweetsDF, "tweetsDF.csv")

Tidy the necessary parts of Data Frame and select

To build a word cloud of the most common mentions, it is necessary to extract the mentions to a column of their own using regular expressions and str-extract_all. While at it, I did the same for the hash features, removing `character(0).

tweetsDF <- read.csv("tweetsDF.csv", header = TRUE, stringsAsFactors = FALSE)
tweetsDF_2 <- tweetsDF[!is.na(tweetsDF$text), ] %>% mutate(mentions = str_extract_all(text, 
    "@\\w+")) %>% mutate(hashes = str_extract_all(text, "#\\w+")) %>% mutate(mentions = replace(mentions, 
    mentions == "character(0)", NA)) %>% mutate(hashes = replace(hashes, hashes == 
    "character(0)", NA)) %>% select(hashes, mentions)

Create a Wordcloud of Mentions Returned by #putin Query

In order to create a word cloud it is necessary to aggregate all mentions, without regard for which cells they are in and count them. This demands unlisting. For simplicity sake. I then filtered out the NA values and removed the @ and the # symbols.

require(viridis)
tstring <- data.frame(mentions = unlist(tweetsDF_2$mentions, na.omit)) %>% mutate(mentions = as.character(mentions)) %>% 
    filter(!is.na(mentions)) %>% mutate(mentions = gsub("@", "", mentions))

cloud_mention <- as.data.frame(table(tstring))

wordcloud(words = cloud_mention$tstring, freq = cloud_mention$Freq, min.freq = 1, 
    max.words = 50, rot.per = 0.26, random.order = FALSE, colors = viridis(4))

colnames(cloud_mention) <- c("Mentions", "Frequency")
write.csv(cloud_mention, "cloud_mention.csv")
DT::datatable(cloud_mention, rownames = FALSE)

Word Cloud of Hashes Returned by #putin Query

tstring <- data.frame(hashes = unlist(tweetsDF_2$hashes, na.omit)) %>% mutate(hashes = as.character(hashes)) %>% 
    filter(!is.na(hashes)) %>% mutate(hashes = gsub("#", "", hashes))

cloud_hashes <- as.data.frame(table(tstring))

wordcloud(words = cloud_hashes$tstring, freq = cloud_hashes$Freq, min.freq = 1, 
    max.words = 100, rot.per = 0.26, random.order = FALSE, colors = viridis(7))

colnames(cloud_hashes) <- c(" Hashes", "Frequency")
DT::datatable(cloud_hashes, rownames = FALSE)
write.csv(cloud_hashes, "cloud_hashes.csv")