For this report, I decided to focus on a couple of questions:
Despite the fact that town size and population aren’t included in the data set, this was an interesting analysis to work through on the towns in Maine. When I started working through the data, I couldn’t help but question the various factors that come into play when testing town wells, such as testing personnel, weather conditions, test funding, time of test, decisions made upon conclusion of the tests, etc.
When I started the assignment, I loaded the following packages so that I could clean up and review the data easily:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(knitr)
Next, I read the two csv files that we’re working with: arsenic.csv and flouride.csv. See below:
Arsenic_data<-read.csv("arsenic.csv",header=TRUE,stringsAsFactors=FALSE)
Flouride_data<-read.csv("flouride.csv",header=TRUE,stringsAsFactors=FALSE)
I checked the variable names via the head() function for the two data sets that I just read:
head(Arsenic_data)
## location n_wells_tested percent_wells_above_guideline median
## 1 Manchester 275 58.9 14.0
## 2 Gorham 467 50.1 10.5
## 3 Columbia 42 50.0 9.8
## 4 Monmouth 277 49.5 10.0
## 5 Eliot 73 49.3 9.7
## 6 Columbia Falls 25 48.0 8.1
## percentile_95 maximum
## 1 93.00 200
## 2 130.00 460
## 3 65.90 200
## 4 110.00 368
## 5 41.35 45
## 6 53.75 71
head(Flouride_data)
## location n_wells_tested percent_wells_above_guideline median
## 1 Otis 60 30.0 1.130
## 2 Dedham 102 22.5 0.940
## 3 Denmark 46 19.6 0.450
## 4 Surry 175 18.3 0.800
## 5 Prospect 57 17.5 0.785
## 6 Eastbrook 31 16.1 1.290
## percentile_95 maximum
## 1 3.200 3.6
## 2 3.270 7.0
## 3 3.150 3.9
## 4 3.525 6.9
## 5 2.500 2.7
## 6 2.445 3.3
When I looked at the variable names, I realized I could join the two data sets by location. However, I would need to rename the columns so that I knew which piece of data was which. I renamed all of the columns starting with ‘f’ for flouride and ‘a’ for arsenic.
colnames(Flouride_data)<-c("location", "f_n_wells_tested", "f_percent_wells_above_guideline", "f_median", "f_percentile_95", "f_maximum")
head(Flouride_data)
## location f_n_wells_tested f_percent_wells_above_guideline f_median
## 1 Otis 60 30.0 1.130
## 2 Dedham 102 22.5 0.940
## 3 Denmark 46 19.6 0.450
## 4 Surry 175 18.3 0.800
## 5 Prospect 57 17.5 0.785
## 6 Eastbrook 31 16.1 1.290
## f_percentile_95 f_maximum
## 1 3.200 3.6
## 2 3.270 7.0
## 3 3.150 3.9
## 4 3.525 6.9
## 5 2.500 2.7
## 6 2.445 3.3
colnames(Arsenic_data)<-c("location","a_n_wells_tested", "a_percent_wells_above_guideline", "a_median", "a_percentile_95", "a_maximum")
head(Arsenic_data)
## location a_n_wells_tested a_percent_wells_above_guideline a_median
## 1 Manchester 275 58.9 14.0
## 2 Gorham 467 50.1 10.5
## 3 Columbia 42 50.0 9.8
## 4 Monmouth 277 49.5 10.0
## 5 Eliot 73 49.3 9.7
## 6 Columbia Falls 25 48.0 8.1
## a_percentile_95 a_maximum
## 1 93.00 200
## 2 130.00 460
## 3 65.90 200
## 4 110.00 368
## 5 41.35 45
## 6 53.75 71
After the columns were renamed, I needed to join the two data sets by location. Since they had identical town names, I knew I could use the innerjoin command to link the two pieces. I checked the variable names with the head() command to be sure that everything was setup how I wanted.
Arsenic_Flouride_data<-Arsenic_data%>%inner_join(Flouride_data)
## Joining, by = "location"
head(Arsenic_Flouride_data)
## location a_n_wells_tested a_percent_wells_above_guideline a_median
## 1 Manchester 275 58.9 14.0
## 2 Gorham 467 50.1 10.5
## 3 Columbia 42 50.0 9.8
## 4 Monmouth 277 49.5 10.0
## 5 Eliot 73 49.3 9.7
## 6 Columbia Falls 25 48.0 8.1
## a_percentile_95 a_maximum f_n_wells_tested
## 1 93.00 200 276
## 2 130.00 460 452
## 3 65.90 200 54
## 4 110.00 368 288
## 5 41.35 45 84
## 6 53.75 71 38
## f_percent_wells_above_guideline f_median f_percentile_95 f_maximum
## 1 3.3 0.30 1.700 3.60
## 2 0.0 0.10 0.682 2.00
## 3 1.9 0.31 1.329 4.30
## 4 3.1 0.30 1.676 3.40
## 5 0.0 0.20 0.658 1.54
## 6 0.0 0.21 0.641 0.90
At first, I started looking at the data to see which towns were in the top 25% for the largest number of wells tested.
Top25FlourideWellsTested<-Arsenic_Flouride_data%>% arrange(desc(f_n_wells_tested))%>%filter(cume_dist(f_n_wells_tested)>=.75)%>%select(location,f_n_wells_tested)
Top25ArsenicWellsTested<-Arsenic_Flouride_data%>% arrange(desc(a_n_wells_tested))%>%filter(cume_dist(a_n_wells_tested)>=.75)%>%select(location,a_n_wells_tested)
When I looked at the massive amount of data that pulled, I decided to cut it down to the top 15 towns. After I reviewed the results, I found it interesting that between the two contaminants there were thirteen towns who shared the largest number of wells tested.
To see this in a table, I inner joined the two data sets.
Top15FLwellstested<-Top25FlourideWellsTested%>%top_n(15)
## Selecting by f_n_wells_tested
Top15ARwellstested<-Top25ArsenicWellsTested%>%top_n(15)
## Selecting by a_n_wells_tested
Top15FLwellstested
## location f_n_wells_tested
## 1 Ellsworth 503
## 2 Augusta 479
## 3 Winthrop 453
## 4 Gorham 452
## 5 Belgrade 417
## 6 Buxton 383
## 7 Readfield 351
## 8 Harpswell 318
## 9 Sidney 312
## 10 Gardiner 299
## 11 Brunswick 299
## 12 Standish 290
## 13 Monmouth 288
## 14 Windham 282
## 15 Freeport 278
Top15ARwellstested
## location a_n_wells_tested
## 1 Standish 632
## 2 Gorham 467
## 3 Augusta 454
## 4 Ellsworth 428
## 5 Winthrop 424
## 6 Belgrade 401
## 7 Readfield 344
## 8 Buxton 334
## 9 Harpswell 300
## 10 Sidney 287
## 11 Gardiner 279
## 12 Monmouth 277
## 13 Manchester 275
## 14 Litchfield 262
## 15 Brunswick 255
WellsTested<-Top15FLwellstested%>%inner_join(Top15ARwellstested)
## Joining, by = "location"
#This table shows 13 towns with the greatest number of wells tested for flouride and arsenic.
kable(WellsTested)
| location | f_n_wells_tested | a_n_wells_tested |
|---|---|---|
| Ellsworth | 503 | 428 |
| Augusta | 479 | 454 |
| Winthrop | 453 | 424 |
| Gorham | 452 | 467 |
| Belgrade | 417 | 401 |
| Buxton | 383 | 334 |
| Readfield | 351 | 344 |
| Harpswell | 318 | 300 |
| Sidney | 312 | 287 |
| Gardiner | 299 | 279 |
| Brunswick | 299 | 255 |
| Standish | 290 | 632 |
| Monmouth | 288 | 277 |
After I reviewed this data, I wanted to see which towns carried the greatest percentage of wells that tested above the maximum exposure guidelines to see if there was any correlation between the number of tests and test levels. I would expect to see a larger number of tests for areas that are testing above the maximum levels. In some aspects, it could indicate that the wells are being monitored and adjusted to lower the levels of contaminants.
I went about the the next piece of code the same route as above. See below:
Top15Flourideaboveguide<-Arsenic_Flouride_data%>%arrange(desc(f_percent_wells_above_guideline))%>%filter(cume_dist(f_percent_wells_above_guideline)>=.75)%>%select(location,f_percent_wells_above_guideline)%>%top_n(15)
## Selecting by f_percent_wells_above_guideline
kable(Top15Flourideaboveguide)
| location | f_percent_wells_above_guideline |
|---|---|
| Otis | 30.0 |
| Dedham | 22.5 |
| Denmark | 19.6 |
| Surry | 18.3 |
| Prospect | 17.5 |
| Eastbrook | 16.1 |
| Mercer | 15.6 |
| Fryeburg | 15.4 |
| Brownfield | 15.2 |
| Stockton Springs | 14.3 |
| Clifton | 14.0 |
| Starks | 13.6 |
| Marshfield | 12.9 |
| Kennebunk | 12.7 |
| Charlotte | 12.5 |
Top15Arsenicboveguide<-Arsenic_Flouride_data%>%arrange(desc(a_percent_wells_above_guideline))%>%filter(cume_dist(a_percent_wells_above_guideline)>=.75)%>%select(location,a_percent_wells_above_guideline)%>%top_n(15)
## Selecting by a_percent_wells_above_guideline
kable(Top15Arsenicboveguide)
| location | a_percent_wells_above_guideline |
|---|---|
| Manchester | 58.9 |
| Gorham | 50.1 |
| Columbia | 50.0 |
| Monmouth | 49.5 |
| Eliot | 49.3 |
| Columbia Falls | 48.0 |
| Winthrop | 44.8 |
| Hallowell | 44.6 |
| Buxton | 43.4 |
| Blue Hill | 42.7 |
| Litchfield | 42.0 |
| Hollis | 41.4 |
| Orland | 40.7 |
| Surry | 40.3 |
| Danforth | 40.0 |
| Mariaville | 40.0 |
When I looked at the results, I noticed that Gorham was among the highest number of wells tested with one of the greatest levels of percentages of wells that tested above the maximum exposure guideline for arsenic, which led me to my next question. How often are Gorham’s wells testing at this level? Is Gorham part of the top 15 for the 95th percentile?
I used the same method to view the 95th percentile for arsenic and flouride.
Top15FL95percentile<-Arsenic_Flouride_data%>% arrange(desc(f_percentile_95))%>% select(location,f_percentile_95)%>%top_n(15)
## Selecting by f_percentile_95
Top15FL95percentile
## location f_percentile_95
## 1 Charlotte 4.440
## 2 Mercer 4.180
## 3 Starks 3.800
## 4 Marshfield 3.570
## 5 Surry 3.525
## 6 York 3.419
## 7 Otisfield 3.300
## 8 Dedham 3.270
## 9 Otis 3.200
## 10 Kennebunk 3.200
## 11 Denmark 3.150
## 12 Fryeburg 3.120
## 13 Sumner 3.025
## 14 Biddeford 2.940
## 15 Sedgwick 2.870
Top15AR95percentile<-Arsenic_Flouride_data%>% arrange(desc(a_percentile_95))%>% select(location,a_percentile_95)%>%top_n(15)
## Selecting by a_percentile_95
Top15AR95percentile
## location a_percentile_95
## 1 Danforth 372.5
## 2 Matinicus Isle Plt 256.5
## 3 Sedgwick 236.6
## 4 Blue Hill 229.0
## 5 Standish 154.0
## 6 Surry 145.7
## 7 Gorham 130.0
## 8 Wales 129.0
## 9 Camden 125.0
## 10 Otis 121.4
## 11 Ellsworth 119.6
## 12 Northport 113.7
## 13 Monmouth 110.0
## 14 Buxton 110.0
## 15 Hollis 110.0
To look at all of the data that I had selected in a more interactive way, I decided to use data viewer. First, I merged the data and uploaded the DT package.
percentile<-merge(Top15AR95percentile,Top15FL95percentile, all=TRUE)
guide<-merge(Top15Arsenicboveguide,Top15Flourideaboveguide, all=TRUE)
wells<-merge(Top15ARwellstested,Top15FLwellstested, all=TRUE)
Selected1<-merge(percentile,guide,all=TRUE)
AllselectedData<-merge(Selected1,wells,all=TRUE)
DT::datatable(AllselectedData)
When I reviewed this table, it was a little large for the page. However, it’s very helpful to see the numbers for the locations that were in the top 15 of the specific category. As I noted earlier, Gorham was in the top for number of wells tested and percentage above guideline for arsenic. From the table, you can see that it also has one of the highest levels in the 95th percentile for arsenic. I am curious to understand the process behind well testing and actions once results like this are discovered. Is the town made aware? Is the community alerted?
In addition, it was interesting to find that there were multiple towns with the highest level of wells tested but did not land in the top 15 for percentile or wells above the maximum guidelines. They were: Augusta, Belgrade, Readfield, Harpswell, Sidney, Gardiner, and Brunswick. What do these towns have in place that Gorham seems to be missing?