Introduction

The State of Maine Health and Environmental Testing Laboratory (HETL) conducted private well water sample tests between 1999 and 2013 and saved them within two data sets titled arsenic and flouride. After looking through the available data between the two sets, the information regarding the percents of wells which were above the Maine Maximum Exposure Guidelines, one for flouride and another for arsenic, seemed most interesting. My goal is to see which locations rank poorly (bottom ten percent) in percent of wells above the guideline for both arsenic and flouride levels.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr)
flouride <- read.csv("flouride.csv", header = TRUE, 
                stringsAsFactors = FALSE) 

arsenic <- read.csv("arsenic.csv", header = TRUE, 
                stringsAsFactors = FALSE)

Anaylsis

The first step in the analysis is understanding the data sets themselves. Both flouride and arsenic share the same columns, and for both any locations with fewer than 20 wells tested only the n_wells_tested and maximum columns are populated. Since I want to compare the percent of wells above the State’s guidelines, I need to filter out locations where that data is missing. Due to the size of the data sets I display the top 5 rows of data to give a truncated view of the outputs until the final result.

flouride <- flouride %>% filter(n_wells_tested > 20)
arsenic <- arsenic %>% filter(n_wells_tested > 20)
kable(flouride %>% top_n(5))
## Selecting by maximum
location n_wells_tested percent_wells_above_guideline median percentile_95 maximum
Kennebunk 110 12.7 0.1 3.200 9.6
Anson 40 5.0 0.1 1.400 14.0
Raymond 181 5.0 0.3 1.900 9.1
Ashland 27 3.7 0.1 0.551 10.0
Peru 58 1.7 0.1 1.255 9.9
kable(arsenic %>% top_n(5))
## Selecting by maximum
location n_wells_tested percent_wells_above_guideline median percentile_95 maximum
Buxton 334 43.4 6.0 110.0 670
Blue Hill 241 42.7 7.0 229.0 930
Danforth 35 40.0 5.0 372.5 3100
Sedgwick 142 37.3 4.2 236.6 840
Northport 157 26.1 2.1 113.7 1700

Now focusing on the locations where we know the percent of wells above the State guidelines, I rearrange both data sets by the Percent_wells_above_guideline column from highest value to lowest. Since we will be focusing on location and percent_wells_above_guideline, I select only those columns to use going forward.

flouride <- flouride %>%  arrange(desc(percent_wells_above_guideline)) %>% select(location, percent_wells_above_guideline)
arsenic <- arsenic %>%  arrange(desc(percent_wells_above_guideline)) %>% select(location, percent_wells_above_guideline)
kable(flouride %>% top_n(5))
## Selecting by percent_wells_above_guideline
location percent_wells_above_guideline
Otis 30.0
Dedham 22.5
Denmark 19.6
Surry 18.3
Prospect 17.5
kable(arsenic %>% top_n(5))
## Selecting by percent_wells_above_guideline
location percent_wells_above_guideline
Manchester 58.9
Gorham 50.1
Columbia 50.0
Monmouth 49.5
Eliot 49.3

Next I want to understand where each location ranks based on the percent of wells above guideline, so I add a percent rank column to both data sets based on percent of wells above guideline. Because having a high percentage of wells above the guideline is negative, I have the percent rank inverted to give locations with a high percentage of wells above the guidelines a low percent rank. I add the chemcial names to he new columns for when we combine the two data sets later.

flouride <- flouride %>% mutate(Above_Flouride_Guideline_Percent_Rank = percent_rank(1-percent_wells_above_guideline)*100)
arsenic <- arsenic %>% mutate(Above_Arsenic_Guideline_Percent_Rank = percent_rank(1-percent_wells_above_guideline)*100)
kable(flouride %>% slice(1:5), digits = 2)
location percent_wells_above_guideline Above_Flouride_Guideline_Percent_Rank
Otis 30.0 0.00
Dedham 22.5 0.28
Denmark 19.6 0.56
Surry 18.3 0.84
Prospect 17.5 1.12
kable(arsenic %>% slice(1:5), digits = 2)
location percent_wells_above_guideline Above_Arsenic_Guideline_Percent_Rank
Manchester 58.9 0.0
Gorham 50.1 0.3
Columbia 50.0 0.6
Monmouth 49.5 0.9
Eliot 49.3 1.2

Now that I have the percent rankings for both arsenic and flouride locations, I then look to join the two data sets together. Because I am looking to evaluate locations that rank poorly for both arsenic and flouride, I use and inner join to filter out locations where we do not have pecent rankings for both tests. This could be due to the locaton not being tested for either or both chemicals, or if fewer than 20 wells were tested. I call this new combined data set wells.

wells <- inner_join(arsenic, flouride, by = "location") %>% select(location, Above_Flouride_Guideline_Percent_Rank, Above_Arsenic_Guideline_Percent_Rank)
kable(wells %>% slice(1:5), digits = 2)
location Above_Flouride_Guideline_Percent_Rank Above_Arsenic_Guideline_Percent_Rank
Manchester 22.75 0.0
Gorham 52.25 0.3
Columbia 36.24 0.6
Monmouth 25.28 0.9
Eliot 52.25 1.2

To determine which locations rank poorly for both arsenic and flouride, I decide to see which fall into the bottom ten percent for both tests. Here I output all results as the bottom_locations.

bottom_locations <- wells %>% filter(Above_Flouride_Guideline_Percent_Rank < 10, Above_Arsenic_Guideline_Percent_Rank < 10)
kable(bottom_locations, digits = 2)
location Above_Flouride_Guideline_Percent_Rank Above_Arsenic_Guideline_Percent_Rank
Blue Hill 7.02 2.70
Orland 8.43 3.60
Surry 0.84 3.90
Otis 0.00 5.11
Sedgwick 5.06 5.71
Mercer 1.69 6.01

Conclusion

The final output contains six locations which ranked in the bottom ten percent in terms of percent of wells above state guidelines for both arsenic and flouride. Ranking these six locations would be difficult since we are looking at percent rankings for two different tests. Considering that these locations have high percentages of wells above the guidelines for both chemcials, it can be assumed that water treatment should be expected by anyone looking to install a private well.