Assignment 1

For this report, I decided to focus on a couple of questions:

What are the top 15 towns for the number of wells tested?
Which towns carried the greatest % of wells that tested above the maximum exposure guidelines?

Despite the fact that town size and population aren’t included in the data set, this was an interesting analysis to work through on the towns in Maine. When I started working through the data, I couldn’t help but question the various factors that come into play when testing town wells, such as testing personnel, weather conditions, test funding, time of test, decisions made upon conclusion of the tests, etc.

When I started the assignment, I loaded the following packages so that I could clean up and review the data easily:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(knitr)

Next, I read the two csv files that we’re working with: arsenic.csv and flouride.csv. See below:

Arsenic_data<-read.csv("arsenic.csv",header=TRUE,stringsAsFactors=FALSE)
Flouride_data<-read.csv("flouride.csv",header=TRUE,stringsAsFactors=FALSE)

I checked the variable names via the head() function for the two data sets that I just read:

head(Arsenic_data)

##         location n_wells_tested percent_wells_above_guideline median
## 1     Manchester            275                          58.9   14.0
## 2         Gorham            467                          50.1   10.5
## 3       Columbia             42                          50.0    9.8
## 4       Monmouth            277                          49.5   10.0
## 5          Eliot             73                          49.3    9.7
## 6 Columbia Falls             25                          48.0    8.1
##   percentile_95 maximum
## 1         93.00     200
## 2        130.00     460
## 3         65.90     200
## 4        110.00     368
## 5         41.35      45
## 6         53.75      71

head(Flouride_data)

##    location n_wells_tested percent_wells_above_guideline median
## 1      Otis             60                          30.0  1.130
## 2    Dedham            102                          22.5  0.940
## 3   Denmark             46                          19.6  0.450
## 4     Surry            175                          18.3  0.800
## 5  Prospect             57                          17.5  0.785
## 6 Eastbrook             31                          16.1  1.290
##   percentile_95 maximum
## 1         3.200     3.6
## 2         3.270     7.0
## 3         3.150     3.9
## 4         3.525     6.9
## 5         2.500     2.7
## 6         2.445     3.3

When I looked at the variable names, I realized I could join the two data sets by location. However, I would need to rename the columns so that I knew which piece of data was which. I renamed all of the columns starting with ‘f’ for flouride and ‘a’ for arsenic.

colnames(Flouride_data)<-c("location", "f_n_wells_tested", "f_percent_wells_above_guideline", "f_median", "f_percentile_95", "f_maximum")
head(Flouride_data)

##    location f_n_wells_tested f_percent_wells_above_guideline f_median
## 1      Otis               60                            30.0    1.130
## 2    Dedham              102                            22.5    0.940
## 3   Denmark               46                            19.6    0.450
## 4     Surry              175                            18.3    0.800
## 5  Prospect               57                            17.5    0.785
## 6 Eastbrook               31                            16.1    1.290
##   f_percentile_95 f_maximum
## 1           3.200       3.6
## 2           3.270       7.0
## 3           3.150       3.9
## 4           3.525       6.9
## 5           2.500       2.7
## 6           2.445       3.3

colnames(Arsenic_data)<-c("location","a_n_wells_tested", "a_percent_wells_above_guideline", "a_median", "a_percentile_95", "a_maximum")
head(Arsenic_data)

##         location a_n_wells_tested a_percent_wells_above_guideline a_median
## 1     Manchester              275                            58.9     14.0
## 2         Gorham              467                            50.1     10.5
## 3       Columbia               42                            50.0      9.8
## 4       Monmouth              277                            49.5     10.0
## 5          Eliot               73                            49.3      9.7
## 6 Columbia Falls               25                            48.0      8.1
##   a_percentile_95 a_maximum
## 1           93.00       200
## 2          130.00       460
## 3           65.90       200
## 4          110.00       368
## 5           41.35        45
## 6           53.75        71

After the columns were renamed, I needed to join the two data sets by location. Since they had identical town names, I knew I could use the innerjoin command to link the two pieces. I checked the variable names with the head() command to be sure that everything was setup how I wanted.

Arsenic_Flouride_data<-Arsenic_data%>%inner_join(Flouride_data)

## Joining, by = "location"

head(Arsenic_Flouride_data)

##         location a_n_wells_tested a_percent_wells_above_guideline a_median
## 1     Manchester              275                            58.9     14.0
## 2         Gorham              467                            50.1     10.5
## 3       Columbia               42                            50.0      9.8
## 4       Monmouth              277                            49.5     10.0
## 5          Eliot               73                            49.3      9.7
## 6 Columbia Falls               25                            48.0      8.1
##   a_percentile_95 a_maximum f_n_wells_tested
## 1           93.00       200              276
## 2          130.00       460              452
## 3           65.90       200               54
## 4          110.00       368              288
## 5           41.35        45               84
## 6           53.75        71               38
##   f_percent_wells_above_guideline f_median f_percentile_95 f_maximum
## 1                             3.3     0.30           1.700      3.60
## 2                             0.0     0.10           0.682      2.00
## 3                             1.9     0.31           1.329      4.30
## 4                             3.1     0.30           1.676      3.40
## 5                             0.0     0.20           0.658      1.54
## 6                             0.0     0.21           0.641      0.90

At first, I started looking at the data to see which towns were in the top 25% for the largest number of wells tested.

Top25FlourideWellsTested<-Arsenic_Flouride_data%>% arrange(desc(f_n_wells_tested))%>%filter(cume_dist(f_n_wells_tested)>=.75)%>%select(location,f_n_wells_tested)
Top25ArsenicWellsTested<-Arsenic_Flouride_data%>% arrange(desc(a_n_wells_tested))%>%filter(cume_dist(a_n_wells_tested)>=.75)%>%select(location,a_n_wells_tested)

When I looked at the massive amount of data that pulled, I decided to cut it down to the top 15 towns. After I reviewed the results, I found it interesting that between the two contaminants there were thirteen towns who shared the largest number of wells tested.

To see this in a table, I inner joined the two data sets.

Top15FLwellstested<-Top25FlourideWellsTested%>%top_n(15)

## Selecting by f_n_wells_tested

Top15ARwellstested<-Top25ArsenicWellsTested%>%top_n(15)

## Selecting by a_n_wells_tested

Top15FLwellstested

##     location f_n_wells_tested
## 1  Ellsworth              503
## 2    Augusta              479
## 3   Winthrop              453
## 4     Gorham              452
## 5   Belgrade              417
## 6     Buxton              383
## 7  Readfield              351
## 8  Harpswell              318
## 9     Sidney              312
## 10  Gardiner              299
## 11 Brunswick              299
## 12  Standish              290
## 13  Monmouth              288
## 14   Windham              282
## 15  Freeport              278

Top15ARwellstested

##      location a_n_wells_tested
## 1    Standish              632
## 2      Gorham              467
## 3     Augusta              454
## 4   Ellsworth              428
## 5    Winthrop              424
## 6    Belgrade              401
## 7   Readfield              344
## 8      Buxton              334
## 9   Harpswell              300
## 10     Sidney              287
## 11   Gardiner              279
## 12   Monmouth              277
## 13 Manchester              275
## 14 Litchfield              262
## 15  Brunswick              255

WellsTested<-Top15FLwellstested%>%inner_join(Top15ARwellstested)

## Joining, by = "location"

#This table shows 13 towns with the greatest number of wells tested for flouride and arsenic.
kable(WellsTested)

location	f_n_wells_tested	a_n_wells_tested
Ellsworth	503	428
Augusta	479	454
Winthrop	453	424
Gorham	452	467
Belgrade	417	401
Buxton	383	334
Readfield	351	344
Harpswell	318	300
Sidney	312	287
Gardiner	299	279
Brunswick	299	255
Standish	290	632
Monmouth	288	277

After I reviewed this data, I wanted to see which towns carried the greatest percentage of wells that tested above the maximum exposure guidelines to see if there was any correlation between the number of tests and test levels. I would expect to see a larger number of tests for areas that are testing above the maximum levels. In some aspects, it could indicate that the wells are being monitored and adjusted to lower the levels of contaminants.

I went about the the next piece of code the same route as above. See below:

Top15Flourideaboveguide<-Arsenic_Flouride_data%>%arrange(desc(f_percent_wells_above_guideline))%>%filter(cume_dist(f_percent_wells_above_guideline)>=.75)%>%select(location,f_percent_wells_above_guideline)%>%top_n(15)

## Selecting by f_percent_wells_above_guideline

kable(Top15Flourideaboveguide)

location	f_percent_wells_above_guideline
Otis	30.0
Dedham	22.5
Denmark	19.6
Surry	18.3
Prospect	17.5
Eastbrook	16.1
Mercer	15.6
Fryeburg	15.4
Brownfield	15.2
Stockton Springs	14.3
Clifton	14.0
Starks	13.6
Marshfield	12.9
Kennebunk	12.7
Charlotte	12.5

Top15Arsenicboveguide<-Arsenic_Flouride_data%>%arrange(desc(a_percent_wells_above_guideline))%>%filter(cume_dist(a_percent_wells_above_guideline)>=.75)%>%select(location,a_percent_wells_above_guideline)%>%top_n(15)

## Selecting by a_percent_wells_above_guideline

kable(Top15Arsenicboveguide)

location	a_percent_wells_above_guideline
Manchester	58.9
Gorham	50.1
Columbia	50.0
Monmouth	49.5
Eliot	49.3
Columbia Falls	48.0
Winthrop	44.8
Hallowell	44.6
Buxton	43.4
Blue Hill	42.7
Litchfield	42.0
Hollis	41.4
Orland	40.7
Surry	40.3
Danforth	40.0
Mariaville	40.0

When I looked at the results, I noticed that Gorham was among the highest number of wells tested with one of the greatest levels of percentages of wells that tested above the maximum exposure guideline for arsenic, which led me to my next question. How often are Gorham’s wells testing at this level? Is Gorham part of the top 15 for the 95th percentile?

I used the same method to view the 95th percentile for arsenic and flouride.

Top15FL95percentile<-Arsenic_Flouride_data%>% arrange(desc(f_percentile_95))%>% select(location,f_percentile_95)%>%top_n(15)

## Selecting by f_percentile_95

Top15FL95percentile

##      location f_percentile_95
## 1   Charlotte           4.440
## 2      Mercer           4.180
## 3      Starks           3.800
## 4  Marshfield           3.570
## 5       Surry           3.525
## 6        York           3.419
## 7   Otisfield           3.300
## 8      Dedham           3.270
## 9        Otis           3.200
## 10  Kennebunk           3.200
## 11    Denmark           3.150
## 12   Fryeburg           3.120
## 13     Sumner           3.025
## 14  Biddeford           2.940
## 15   Sedgwick           2.870

Top15AR95percentile<-Arsenic_Flouride_data%>% arrange(desc(a_percentile_95))%>% select(location,a_percentile_95)%>%top_n(15)

## Selecting by a_percentile_95

Top15AR95percentile

##              location a_percentile_95
## 1            Danforth           372.5
## 2  Matinicus Isle Plt           256.5
## 3            Sedgwick           236.6
## 4           Blue Hill           229.0
## 5            Standish           154.0
## 6               Surry           145.7
## 7              Gorham           130.0
## 8               Wales           129.0
## 9              Camden           125.0
## 10               Otis           121.4
## 11          Ellsworth           119.6
## 12          Northport           113.7
## 13           Monmouth           110.0
## 14             Buxton           110.0
## 15             Hollis           110.0

To look at all of the data that I had selected in a more interactive way, I decided to use data viewer. First, I merged the data and uploaded the DT package.

percentile<-merge(Top15AR95percentile,Top15FL95percentile, all=TRUE)
guide<-merge(Top15Arsenicboveguide,Top15Flourideaboveguide, all=TRUE)
wells<-merge(Top15ARwellstested,Top15FLwellstested, all=TRUE)
Selected1<-merge(percentile,guide,all=TRUE)
AllselectedData<-merge(Selected1,wells,all=TRUE)

DT::datatable(AllselectedData)

When I reviewed this table, it was a little large for the page. However, it’s very helpful to see the numbers for the locations that were in the top 15 of the specific category. As I noted earlier, Gorham was in the top for number of wells tested and percentage above guideline for arsenic. From the table, you can see that it also has one of the highest levels in the 95th percentile for arsenic. I am curious to understand the process behind well testing and actions once results like this are discovered. Is the town made aware? Is the community alerted?

In addition, it was interesting to find that there were multiple towns with the highest level of wells tested but did not land in the top 15 for percentile or wells above the maximum guidelines. They were: Augusta, Belgrade, Readfield, Harpswell, Sidney, Gardiner, and Brunswick. What do these towns have in place that Gorham seems to be missing?

Assignment 1

Lexi Armstrong

September 24, 2016