Introduction
The State of Maine Health and Environmental Testing Laboratory (HETL) conducted private well water sample tests between 1999 and 2013 and saved them within two data sets titled arsenic and flouride. After looking through the available data between the two sets, the information regarding the percents of wells which were above the Maine Maximum Exposure Guidelines, one for flouride and another for arsenic, seemed most interesting. My goal is to see which locations rank poorly (bottom ten percent) in percent of wells above the guideline for both arsenic and flouride levels.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
flouride <- read.csv("flouride.csv", header = TRUE,
stringsAsFactors = FALSE)
arsenic <- read.csv("arsenic.csv", header = TRUE,
stringsAsFactors = FALSE)
Anaylsis
The first step in the analysis is understanding the data sets themselves. Both flouride and arsenic share the same columns, and for both any locations with fewer than 20 wells tested only the n_wells_tested and maximum columns are populated. Since I want to compare the percent of wells above the State’s guidelines, I need to filter out locations where that data is missing. Due to the size of the data sets I display the top 5 rows of data to give a truncated view of the outputs until the final result.
flouride <- flouride %>% filter(n_wells_tested > 20)
arsenic <- arsenic %>% filter(n_wells_tested > 20)
kable(flouride %>% top_n(5))
## Selecting by maximum
Kennebunk |
110 |
12.7 |
0.1 |
3.200 |
9.6 |
Anson |
40 |
5.0 |
0.1 |
1.400 |
14.0 |
Raymond |
181 |
5.0 |
0.3 |
1.900 |
9.1 |
Ashland |
27 |
3.7 |
0.1 |
0.551 |
10.0 |
Peru |
58 |
1.7 |
0.1 |
1.255 |
9.9 |
kable(arsenic %>% top_n(5))
## Selecting by maximum
Buxton |
334 |
43.4 |
6.0 |
110.0 |
670 |
Blue Hill |
241 |
42.7 |
7.0 |
229.0 |
930 |
Danforth |
35 |
40.0 |
5.0 |
372.5 |
3100 |
Sedgwick |
142 |
37.3 |
4.2 |
236.6 |
840 |
Northport |
157 |
26.1 |
2.1 |
113.7 |
1700 |
Now focusing on the locations where we know the percent of wells above the State guidelines, I rearrange both data sets by the Percent_wells_above_guideline column from highest value to lowest. Since we will be focusing on location and percent_wells_above_guideline, I select only those columns to use going forward.
flouride <- flouride %>% arrange(desc(percent_wells_above_guideline)) %>% select(location, percent_wells_above_guideline)
arsenic <- arsenic %>% arrange(desc(percent_wells_above_guideline)) %>% select(location, percent_wells_above_guideline)
kable(flouride %>% top_n(5))
## Selecting by percent_wells_above_guideline
Otis |
30.0 |
Dedham |
22.5 |
Denmark |
19.6 |
Surry |
18.3 |
Prospect |
17.5 |
kable(arsenic %>% top_n(5))
## Selecting by percent_wells_above_guideline
Manchester |
58.9 |
Gorham |
50.1 |
Columbia |
50.0 |
Monmouth |
49.5 |
Eliot |
49.3 |
Next I want to understand where each location ranks based on the percent of wells above guideline, so I add a percent rank column to both data sets based on percent of wells above guideline. Because having a high percentage of wells above the guideline is negative, I have the percent rank inverted to give locations with a high percentage of wells above the guidelines a low percent rank. I add the chemcial names to he new columns for when we combine the two data sets later.
flouride <- flouride %>% mutate(Above_Flouride_Guideline_Percent_Rank = percent_rank(1-percent_wells_above_guideline)*100)
arsenic <- arsenic %>% mutate(Above_Arsenic_Guideline_Percent_Rank = percent_rank(1-percent_wells_above_guideline)*100)
kable(flouride %>% slice(1:5), digits = 2)
Otis |
30.0 |
0.00 |
Dedham |
22.5 |
0.28 |
Denmark |
19.6 |
0.56 |
Surry |
18.3 |
0.84 |
Prospect |
17.5 |
1.12 |
kable(arsenic %>% slice(1:5), digits = 2)
Manchester |
58.9 |
0.0 |
Gorham |
50.1 |
0.3 |
Columbia |
50.0 |
0.6 |
Monmouth |
49.5 |
0.9 |
Eliot |
49.3 |
1.2 |
Now that I have the percent rankings for both arsenic and flouride locations, I then look to join the two data sets together. Because I am looking to evaluate locations that rank poorly for both arsenic and flouride, I use and inner join to filter out locations where we do not have pecent rankings for both tests. This could be due to the locaton not being tested for either or both chemicals, or if fewer than 20 wells were tested. I call this new combined data set wells.
wells <- inner_join(arsenic, flouride, by = "location") %>% select(location, Above_Flouride_Guideline_Percent_Rank, Above_Arsenic_Guideline_Percent_Rank)
kable(wells %>% slice(1:5), digits = 2)
Manchester |
22.75 |
0.0 |
Gorham |
52.25 |
0.3 |
Columbia |
36.24 |
0.6 |
Monmouth |
25.28 |
0.9 |
Eliot |
52.25 |
1.2 |
To determine which locations rank poorly for both arsenic and flouride, I decide to see which fall into the bottom ten percent for both tests. Here I output all results as the bottom_locations.
bottom_locations <- wells %>% filter(Above_Flouride_Guideline_Percent_Rank < 10, Above_Arsenic_Guideline_Percent_Rank < 10)
kable(bottom_locations, digits = 2)
Blue Hill |
7.02 |
2.70 |
Orland |
8.43 |
3.60 |
Surry |
0.84 |
3.90 |
Otis |
0.00 |
5.11 |
Sedgwick |
5.06 |
5.71 |
Mercer |
1.69 |
6.01 |
Conclusion
The final output contains six locations which ranked in the bottom ten percent in terms of percent of wells above state guidelines for both arsenic and flouride. Ranking these six locations would be difficult since we are looking at percent rankings for two different tests. Considering that these locations have high percentages of wells above the guidelines for both chemcials, it can be assumed that water treatment should be expected by anyone looking to install a private well.