In Project 1, we are focusing on tests of private wells in Maine between 1999 and 2013. The testing data for this project are courtesy of Maine Tracking Network.
Our source data are summarized by location, with the same variables in separate files for arsenic and fluoride.
Due to the nature of our source data:
Given these considerations, this analyst-student focused on two questions:
The appearance of such a relationship seems plausible in both cases, as much of Maine’s groundwater travels in aquifers constrained by bedrock. This contact with bedrock over long periods of time are largely responsible for the introduction of these minerals.
Let’s begin our investigation by preparing our environment:
#Set Working Directory and Load the Requisite Libraries
setwd("~/Documents/MBA 676/Unit 4 Stuff")
getwd()
## [1] "/Users/Josh/Documents/MBA 676/Unit 4 Stuff"
library(dplyr)
library(ggvis)
library(knitr)
library(shiny)
library(zipcode)
library(ggplot2)
library(choroplethrZip)
library(rmarkdown)
Next, we read our base files into data frames and cleanse locations with fewer than 20 total wells tested.
#Read the two base files into data frames
arsenic <- read.csv("arsenic.csv", header = TRUE, stringsAsFactors = FALSE)
fluoride <- read.csv("fluoride.csv",header = TRUE, stringsAsFactors = FALSE)
#Remove missing values from the original data frames for locations with fewer than 20 tests. Rename results as new data frames.
arsenic_no_smalls <- arsenic %>% filter(n_wells_tested >=20)
fluoride_no_smalls <- fluoride %>% filter(n_wells_tested >=20)
Now we prepare our two frames for an inner join by location. Recognizing that the joined data frame will share variable names for both arsenic and fluoride, we apply new column names to keep the variables unique.
arsenic_no_smalls <- arsenic_no_smalls %>% rename(a_n_wells_tested = n_wells_tested, a_percent_wells_above_guideline = percent_wells_above_guideline, a_median = median, a_percentile_95 = percentile_95, a_maximum = maximum)
fluoride_no_smalls <- fluoride_no_smalls %>% rename(f_n_wells_tested = n_wells_tested, f_percent_wells_above_guideline = percent_wells_above_guideline, f_median = median, f_percentile_95 = percentile_95, f_maximum = maximum)
Next, we complete the inner join to create our new master data frame including both the arsenic and fluoride variables.
#Create a new master data frame, inner joining on location. Only locations with 20 or more tests for both fluoride and arsenic will be included in the new master data frame
master_frame <- arsenic_no_smalls %>% inner_join(fluoride_no_smalls)
#Write out a .csv of the new master data frame to serve as a restore point if needed
write.csv(master_frame, "Joined_Well_Data.csv", row.names = FALSE)
We can test the relationship between arsenic and fluoride using ggvis.
master_frame %>% ggvis(~f_percent_wells_above_guideline, ~a_percent_wells_above_guideline) %>% layer_model_predictions(model = "lm", se = TRUE) %>% layer_points()
The scatterplot above plots locations and their percentage of wells tested over the recommended threshold for fluoride and arsenic. The plot also includes a linear trendline along with shading for the 95th percentile confidence interval. It suggests a few findings:
With our scatterplot above, we established that there may be a significant, though not terribly strong relationship between elevated fluoride and arsenic concentrations in Maine.
The other question that comes to mind is whether we might see a relationship in terms of where we find these elevated readings. To investigate this line of questioning, we endeavor to plot the elevated readings by location.
This approach required the introduction of some foreign data, namely the zip codes for the locations in our data set.
To simplify the work, we drop records that didn’t match a zipcode. We also drop records where one zipcode appeared more than once. In this second case, we keep the location with the higher maximum arsenic result (the higher median arsenic result was used as a tiebreaker).
#Read in a new helper frame called zip codes
zipcodes <- read.csv("zipcodes.csv", header = TRUE, stringsAsFactors = FALSE)
#Prepare raw zips lacking leading zeroes for the cleaner function
zipcodes <- zipcodes %>% rename(postal = zipcode)
#Clean zip codes to append leading zeroes and store as chr vector
zipcodes$zip = clean.zipcodes(zipcodes$postal)
#Drop original int "postal" column
zipcodes <- zipcodes %>% select(location, zip)
#left outer join on location to master frame to append zip codes to master frame
master_frame_zip <- master_frame %>% left_join(zipcodes)
master_frame_zip <- master_frame_zip %>% group_by(location) %>% arrange(desc(location)) %>% top_n(1)
#Preparing master_frame_zip for zip_choropleth
master_frame_zip <- master_frame_zip %>% rename(region = zip)
#Remove duplicate zip codes from data set
master_frame_zip <- master_frame_zip %>% group_by(region) %>% filter(a_maximum == max(a_maximum)) %>% top_n(1)
master_frame_zip <- master_frame_zip %>% group_by(region) %>% filter(a_median == max(a_median)) %>% top_n(1)
#Final prep - renaming the value to plot
master_frame_zip <- master_frame_zip %>% rename(value = a_percent_wells_above_guideline)
zip_choropleth(master_frame_zip,
title = "Arsenic Concentrations in Wells Across Maine",
legend = "Percentage of Wells Above Threshold",
num_colors = 9,
state_zoom = "maine",
reference_map = TRUE)
#Prepare for Fluoride Vis
master_frame_zip <- master_frame_zip %>% rename(a_percent_wells_above_guideline = value)
master_frame_zip <- master_frame_zip %>% rename(value = f_percent_wells_above_guideline)
zip_choropleth(master_frame_zip,
title = "Fluoride Concentrations in Wells Across Maine",
legend = "Percentage of Wells Above Threshold",
num_colors = 9,
state_zoom = "maine",
reference_map = TRUE)
Conclusions from the visualizations for arsenic and fluoride:
This exploratory analysis suggests that elevated levels of dissolved minerals are a significant issue in many areas of Maine. Overall, arsenic has the greater prevalance of wells over guideline and represents the greater health risk compared with fluoride. Areas testing higher in one mineral are likely to test higher in other minerals, though this relationship may be weak.
While every private well should be tested regularly, well owners near the darker-shaded regions in the final two visualizations would be well-advised to test annually.