For this report, I decided to focus on a couple of questions:

Despite the fact that town size and population aren’t included in the data set, this was an interesting analysis to work through on the towns in Maine. When I started working through the data, I couldn’t help but question the various factors that come into play when testing town wells, such as testing personnel, weather conditions, test funding, time of test, decisions made upon conclusion of the tests, etc.

When I started the assignment, I loaded the following packages so that I could clean up and review the data easily:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(knitr)

Next, I read the two csv files that we’re working with: arsenic.csv and flouride.csv. See below:

Arsenic_data<-read.csv("arsenic.csv",header=TRUE,stringsAsFactors=FALSE)
Flouride_data<-read.csv("flouride.csv",header=TRUE,stringsAsFactors=FALSE)

I checked the variable names via the head() function for the two data sets that I just read:

head(Arsenic_data)
##         location n_wells_tested percent_wells_above_guideline median
## 1     Manchester            275                          58.9   14.0
## 2         Gorham            467                          50.1   10.5
## 3       Columbia             42                          50.0    9.8
## 4       Monmouth            277                          49.5   10.0
## 5          Eliot             73                          49.3    9.7
## 6 Columbia Falls             25                          48.0    8.1
##   percentile_95 maximum
## 1         93.00     200
## 2        130.00     460
## 3         65.90     200
## 4        110.00     368
## 5         41.35      45
## 6         53.75      71
head(Flouride_data)
##    location n_wells_tested percent_wells_above_guideline median
## 1      Otis             60                          30.0  1.130
## 2    Dedham            102                          22.5  0.940
## 3   Denmark             46                          19.6  0.450
## 4     Surry            175                          18.3  0.800
## 5  Prospect             57                          17.5  0.785
## 6 Eastbrook             31                          16.1  1.290
##   percentile_95 maximum
## 1         3.200     3.6
## 2         3.270     7.0
## 3         3.150     3.9
## 4         3.525     6.9
## 5         2.500     2.7
## 6         2.445     3.3

When I looked at the variable names, I realized I could join the two data sets by location. However, I would need to rename the columns so that I knew which piece of data was which. I renamed all of the columns starting with ‘f’ for flouride and ‘a’ for arsenic.

colnames(Flouride_data)<-c("location", "f_n_wells_tested", "f_percent_wells_above_guideline", "f_median", "f_percentile_95", "f_maximum")
head(Flouride_data)
##    location f_n_wells_tested f_percent_wells_above_guideline f_median
## 1      Otis               60                            30.0    1.130
## 2    Dedham              102                            22.5    0.940
## 3   Denmark               46                            19.6    0.450
## 4     Surry              175                            18.3    0.800
## 5  Prospect               57                            17.5    0.785
## 6 Eastbrook               31                            16.1    1.290
##   f_percentile_95 f_maximum
## 1           3.200       3.6
## 2           3.270       7.0
## 3           3.150       3.9
## 4           3.525       6.9
## 5           2.500       2.7
## 6           2.445       3.3
colnames(Arsenic_data)<-c("location","a_n_wells_tested", "a_percent_wells_above_guideline", "a_median", "a_percentile_95", "a_maximum")
head(Arsenic_data)
##         location a_n_wells_tested a_percent_wells_above_guideline a_median
## 1     Manchester              275                            58.9     14.0
## 2         Gorham              467                            50.1     10.5
## 3       Columbia               42                            50.0      9.8
## 4       Monmouth              277                            49.5     10.0
## 5          Eliot               73                            49.3      9.7
## 6 Columbia Falls               25                            48.0      8.1
##   a_percentile_95 a_maximum
## 1           93.00       200
## 2          130.00       460
## 3           65.90       200
## 4          110.00       368
## 5           41.35        45
## 6           53.75        71

After the columns were renamed, I needed to join the two data sets by location. Since they had identical town names, I knew I could use the innerjoin command to link the two pieces. I checked the variable names with the head() command to be sure that everything was setup how I wanted.

Arsenic_Flouride_data<-Arsenic_data%>%inner_join(Flouride_data)
## Joining, by = "location"
head(Arsenic_Flouride_data)
##         location a_n_wells_tested a_percent_wells_above_guideline a_median
## 1     Manchester              275                            58.9     14.0
## 2         Gorham              467                            50.1     10.5
## 3       Columbia               42                            50.0      9.8
## 4       Monmouth              277                            49.5     10.0
## 5          Eliot               73                            49.3      9.7
## 6 Columbia Falls               25                            48.0      8.1
##   a_percentile_95 a_maximum f_n_wells_tested
## 1           93.00       200              276
## 2          130.00       460              452
## 3           65.90       200               54
## 4          110.00       368              288
## 5           41.35        45               84
## 6           53.75        71               38
##   f_percent_wells_above_guideline f_median f_percentile_95 f_maximum
## 1                             3.3     0.30           1.700      3.60
## 2                             0.0     0.10           0.682      2.00
## 3                             1.9     0.31           1.329      4.30
## 4                             3.1     0.30           1.676      3.40
## 5                             0.0     0.20           0.658      1.54
## 6                             0.0     0.21           0.641      0.90

At first, I started looking at the data to see which towns were in the top 25% for the largest number of wells tested.

Top25FlourideWellsTested<-Arsenic_Flouride_data%>% arrange(desc(f_n_wells_tested))%>%filter(cume_dist(f_n_wells_tested)>=.75)%>%select(location,f_n_wells_tested)
Top25ArsenicWellsTested<-Arsenic_Flouride_data%>% arrange(desc(a_n_wells_tested))%>%filter(cume_dist(a_n_wells_tested)>=.75)%>%select(location,a_n_wells_tested)

When I looked at the massive amount of data that pulled, I decided to cut it down to the top 15 towns. After I reviewed the results, I found it interesting that between the two contaminants there were thirteen towns who shared the largest number of wells tested.

To see this in a table, I inner joined the two data sets.

Top15FLwellstested<-Top25FlourideWellsTested%>%top_n(15)
## Selecting by f_n_wells_tested
Top15ARwellstested<-Top25ArsenicWellsTested%>%top_n(15)
## Selecting by a_n_wells_tested
Top15FLwellstested
##     location f_n_wells_tested
## 1  Ellsworth              503
## 2    Augusta              479
## 3   Winthrop              453
## 4     Gorham              452
## 5   Belgrade              417
## 6     Buxton              383
## 7  Readfield              351
## 8  Harpswell              318
## 9     Sidney              312
## 10  Gardiner              299
## 11 Brunswick              299
## 12  Standish              290
## 13  Monmouth              288
## 14   Windham              282
## 15  Freeport              278
Top15ARwellstested
##      location a_n_wells_tested
## 1    Standish              632
## 2      Gorham              467
## 3     Augusta              454
## 4   Ellsworth              428
## 5    Winthrop              424
## 6    Belgrade              401
## 7   Readfield              344
## 8      Buxton              334
## 9   Harpswell              300
## 10     Sidney              287
## 11   Gardiner              279
## 12   Monmouth              277
## 13 Manchester              275
## 14 Litchfield              262
## 15  Brunswick              255
WellsTested<-Top15FLwellstested%>%inner_join(Top15ARwellstested)
## Joining, by = "location"
#This table shows 13 towns with the greatest number of wells tested for flouride and arsenic.
kable(WellsTested)
location f_n_wells_tested a_n_wells_tested
Ellsworth 503 428
Augusta 479 454
Winthrop 453 424
Gorham 452 467
Belgrade 417 401
Buxton 383 334
Readfield 351 344
Harpswell 318 300
Sidney 312 287
Gardiner 299 279
Brunswick 299 255
Standish 290 632
Monmouth 288 277

After I reviewed this data, I wanted to see which towns carried the greatest percentage of wells that tested above the maximum exposure guidelines to see if there was any correlation between the number of tests and test levels. I would expect to see a larger number of tests for areas that are testing above the maximum levels. In some aspects, it could indicate that the wells are being monitored and adjusted to lower the levels of contaminants.

I went about the the next piece of code the same route as above. See below:

Top15Flourideaboveguide<-Arsenic_Flouride_data%>%arrange(desc(f_percent_wells_above_guideline))%>%filter(cume_dist(f_percent_wells_above_guideline)>=.75)%>%select(location,f_percent_wells_above_guideline)%>%top_n(15)
## Selecting by f_percent_wells_above_guideline
kable(Top15Flourideaboveguide)
location f_percent_wells_above_guideline
Otis 30.0
Dedham 22.5
Denmark 19.6
Surry 18.3
Prospect 17.5
Eastbrook 16.1
Mercer 15.6
Fryeburg 15.4
Brownfield 15.2
Stockton Springs 14.3
Clifton 14.0
Starks 13.6
Marshfield 12.9
Kennebunk 12.7
Charlotte 12.5
Top15Arsenicboveguide<-Arsenic_Flouride_data%>%arrange(desc(a_percent_wells_above_guideline))%>%filter(cume_dist(a_percent_wells_above_guideline)>=.75)%>%select(location,a_percent_wells_above_guideline)%>%top_n(15)
## Selecting by a_percent_wells_above_guideline
kable(Top15Arsenicboveguide)
location a_percent_wells_above_guideline
Manchester 58.9
Gorham 50.1
Columbia 50.0
Monmouth 49.5
Eliot 49.3
Columbia Falls 48.0
Winthrop 44.8
Hallowell 44.6
Buxton 43.4
Blue Hill 42.7
Litchfield 42.0
Hollis 41.4
Orland 40.7
Surry 40.3
Danforth 40.0
Mariaville 40.0

When I looked at the results, I noticed that Gorham was among the highest number of wells tested with one of the greatest levels of percentages of wells that tested above the maximum exposure guideline for arsenic, which led me to my next question. How often are Gorham’s wells testing at this level? Is Gorham part of the top 15 for the 95th percentile?

I used the same method to view the 95th percentile for arsenic and flouride.

Top15FL95percentile<-Arsenic_Flouride_data%>% arrange(desc(f_percentile_95))%>% select(location,f_percentile_95)%>%top_n(15)
## Selecting by f_percentile_95
Top15FL95percentile
##      location f_percentile_95
## 1   Charlotte           4.440
## 2      Mercer           4.180
## 3      Starks           3.800
## 4  Marshfield           3.570
## 5       Surry           3.525
## 6        York           3.419
## 7   Otisfield           3.300
## 8      Dedham           3.270
## 9        Otis           3.200
## 10  Kennebunk           3.200
## 11    Denmark           3.150
## 12   Fryeburg           3.120
## 13     Sumner           3.025
## 14  Biddeford           2.940
## 15   Sedgwick           2.870
Top15AR95percentile<-Arsenic_Flouride_data%>% arrange(desc(a_percentile_95))%>% select(location,a_percentile_95)%>%top_n(15)
## Selecting by a_percentile_95
Top15AR95percentile
##              location a_percentile_95
## 1            Danforth           372.5
## 2  Matinicus Isle Plt           256.5
## 3            Sedgwick           236.6
## 4           Blue Hill           229.0
## 5            Standish           154.0
## 6               Surry           145.7
## 7              Gorham           130.0
## 8               Wales           129.0
## 9              Camden           125.0
## 10               Otis           121.4
## 11          Ellsworth           119.6
## 12          Northport           113.7
## 13           Monmouth           110.0
## 14             Buxton           110.0
## 15             Hollis           110.0

To look at all of the data that I had selected in a more interactive way, I decided to use data viewer. First, I merged the data and uploaded the DT package.

percentile<-merge(Top15AR95percentile,Top15FL95percentile, all=TRUE)
guide<-merge(Top15Arsenicboveguide,Top15Flourideaboveguide, all=TRUE)
wells<-merge(Top15ARwellstested,Top15FLwellstested, all=TRUE)
Selected1<-merge(percentile,guide,all=TRUE)
AllselectedData<-merge(Selected1,wells,all=TRUE)

DT::datatable(AllselectedData)

When I reviewed this table, it was a little large for the page. However, it’s very helpful to see the numbers for the locations that were in the top 15 of the specific category. As I noted earlier, Gorham was in the top for number of wells tested and percentage above guideline for arsenic. From the table, you can see that it also has one of the highest levels in the 95th percentile for arsenic. I am curious to understand the process behind well testing and actions once results like this are discovered. Is the town made aware? Is the community alerted?

In addition, it was interesting to find that there were multiple towns with the highest level of wells tested but did not land in the top 15 for percentile or wells above the maximum guidelines. They were: Augusta, Belgrade, Readfield, Harpswell, Sidney, Gardiner, and Brunswick. What do these towns have in place that Gorham seems to be missing?