MBA 676 Unit 4, Fall 2016

In Project 1, we are focusing on tests of private wells in Maine between 1999 and 2013. The testing data for this project are courtesy of Maine Tracking Network.
Our source data are summarized by location, with the same variables in separate files for arsenic and fluoride.

Due to the nature of our source data:

Given these considerations, this analyst-student focused on two questions:

  1. Do we find a relationship between elevated concentrations of fluoride and arsenic?
  2. Do the locations with higher concentrations of fluoride and arsenic enjoy a geographic relationship?

The appearance of such a relationship seems plausible in both cases, as much of Maine’s groundwater travels in aquifers constrained by bedrock. This contact with bedrock over long periods of time are largely responsible for the introduction of these minerals.

Let’s begin our investigation by preparing our environment:

#Set Working Directory and Load the Requisite Libraries
setwd("~/Documents/MBA 676/Unit 4 Stuff")
getwd()
## [1] "/Users/Josh/Documents/MBA 676/Unit 4 Stuff"
library(dplyr)
library(ggvis)
library(knitr)
library(shiny)
library(zipcode)
library(ggplot2)
library(choroplethrZip)
library(rmarkdown)

Next, we read our base files into data frames and cleanse locations with fewer than 20 total wells tested.

#Read the two base files into data frames
arsenic <- read.csv("arsenic.csv", header = TRUE, stringsAsFactors = FALSE)
fluoride <- read.csv("fluoride.csv",header = TRUE, stringsAsFactors = FALSE)
#Remove missing values from the original data frames for locations with fewer than 20 tests.  Rename results as new data frames.  
arsenic_no_smalls <- arsenic %>% filter(n_wells_tested >=20)
fluoride_no_smalls <- fluoride %>% filter(n_wells_tested >=20)

Now we prepare our two frames for an inner join by location. Recognizing that the joined data frame will share variable names for both arsenic and fluoride, we apply new column names to keep the variables unique.

arsenic_no_smalls <- arsenic_no_smalls %>% rename(a_n_wells_tested = n_wells_tested, a_percent_wells_above_guideline = percent_wells_above_guideline, a_median = median, a_percentile_95 = percentile_95, a_maximum = maximum)

fluoride_no_smalls <- fluoride_no_smalls %>% rename(f_n_wells_tested = n_wells_tested, f_percent_wells_above_guideline = percent_wells_above_guideline, f_median = median, f_percentile_95 = percentile_95, f_maximum = maximum)

Next, we complete the inner join to create our new master data frame including both the arsenic and fluoride variables.

#Create a new master data frame, inner joining on location.  Only locations with 20 or more tests for both fluoride and arsenic will be included in the new master data frame
master_frame <- arsenic_no_smalls %>% inner_join(fluoride_no_smalls)
#Write out a .csv of the new master data frame to serve as a restore point if needed
write.csv(master_frame, "Joined_Well_Data.csv", row.names = FALSE)

Now, the fun part

We can test the relationship between arsenic and fluoride using ggvis.

master_frame %>% ggvis(~f_percent_wells_above_guideline, ~a_percent_wells_above_guideline) %>% layer_model_predictions(model = "lm", se = TRUE) %>% layer_points()

The scatterplot above plots locations and their percentage of wells tested over the recommended threshold for fluoride and arsenic. The plot also includes a linear trendline along with shading for the 95th percentile confidence interval. It suggests a few findings:

  1. The relationship isn’t terribly strong between elevated levels of the two minerals.
  2. Though not terribly strong, there’s a good chance the relationship is significant, which could be proven with additional analysis.
  3. What relationship is present is positive. Locations with greater percentages of wells over the recommended threshold in one mineral are more likely (rather than less likely) to also have a higher percentage of wells over the recommended threshold for the other mineral.

The other part of the question

With our scatterplot above, we established that there may be a significant, though not terribly strong relationship between elevated fluoride and arsenic concentrations in Maine.

The other question that comes to mind is whether we might see a relationship in terms of where we find these elevated readings. To investigate this line of questioning, we endeavor to plot the elevated readings by location.

This approach required the introduction of some foreign data, namely the zip codes for the locations in our data set.
To simplify the work, we drop records that didn’t match a zipcode. We also drop records where one zipcode appeared more than once. In this second case, we keep the location with the higher maximum arsenic result (the higher median arsenic result was used as a tiebreaker).

#Read in a new helper frame called zip codes
zipcodes <- read.csv("zipcodes.csv", header = TRUE, stringsAsFactors = FALSE)
#Prepare raw zips lacking leading zeroes for the cleaner function
zipcodes <- zipcodes %>% rename(postal = zipcode)
#Clean zip codes to append leading zeroes and store as chr vector
zipcodes$zip = clean.zipcodes(zipcodes$postal)
#Drop original int "postal" column
zipcodes <- zipcodes %>% select(location, zip)
#left outer join on location to master frame to append zip codes to master frame
master_frame_zip <- master_frame %>% left_join(zipcodes)
master_frame_zip <- master_frame_zip %>% group_by(location) %>% arrange(desc(location)) %>% top_n(1)
#Preparing master_frame_zip for zip_choropleth
master_frame_zip <- master_frame_zip %>% rename(region = zip)
#Remove duplicate zip codes from data set
master_frame_zip <- master_frame_zip %>% group_by(region) %>% filter(a_maximum == max(a_maximum)) %>% top_n(1)
master_frame_zip <- master_frame_zip %>% group_by(region) %>% filter(a_median == max(a_median)) %>% top_n(1)
#Final prep - renaming the value to plot
master_frame_zip <- master_frame_zip %>% rename(value = a_percent_wells_above_guideline)
zip_choropleth(master_frame_zip, 
                title = "Arsenic Concentrations in Wells Across Maine", 
                legend = "Percentage of Wells Above Threshold",
                num_colors = 9, 
                state_zoom = "maine", 
                reference_map = TRUE)

#Prepare for Fluoride Vis
master_frame_zip <- master_frame_zip %>% rename(a_percent_wells_above_guideline = value)
master_frame_zip <- master_frame_zip %>% rename(value = f_percent_wells_above_guideline)
zip_choropleth(master_frame_zip, 
               title = "Fluoride Concentrations in Wells Across Maine", 
               legend = "Percentage of Wells Above Threshold",
               num_colors = 9, 
               state_zoom = "maine", 
               reference_map = TRUE)

Conclusions from the visualizations for arsenic and fluoride:

  1. Based on the bins generated by the zip_choropleth process, we find a much higher percentage as the top bin for arsenic - indicating a higher max percentage of wells over guideline for arsenic.
  2. Overall, the arsenic visualization is more shaded than the fluoride one; here we find additional evidence that more Maine locations have a greater percentage of wells over guideline for arsenic versus fluoride.
  3. We also find evidence of the correlation between elevated levels of arsenic and fluoride. Many of the same locations are shaded in both maps, like the Katahdin Region, the western border, and Bar Harbor-Downeast.
  4. Finally, the mapping format helps us to confirm that for both minerals, the impacted areas are not evenly distributed throughout the state. Instead, we find clustered areas over guideline.

The final word

This exploratory analysis suggests that elevated levels of dissolved minerals are a significant issue in many areas of Maine. Overall, arsenic has the greater prevalance of wells over guideline and represents the greater health risk compared with fluoride. Areas testing higher in one mineral are likely to test higher in other minerals, though this relationship may be weak.

While every private well should be tested regularly, well owners near the darker-shaded regions in the final two visualizations would be well-advised to test annually.