In this report I will be analyzing a subset of two data sets that contain flouride and arsenic levels, by town, for private well water samples tested by the State of Maine Health and Environmental Testing Laboratory (HETL) between the years 1999 and 2013. The files flouride.csv and arsenic.csv were downloaded from the Maine Tracking Network.
To begin, I read in the two csv files, arsenic.csv and flouride.csv, into the respective data frames, arsenic and flouride.
arsenic <- read.csv("arsenic.csv", header = TRUE, stringsAsFactors = FALSE)
flouride <- read.csv("flouride.csv", header = TRUE, stringsAsFactors = FALSE)
Since I needed to tidy and sort the data before it could be analyzed, I used two additional packages tidyr and dplyr.
library(tidyr)
library(dplyr)
Next, I used the head() command to look at the variable names of the two dataframes I just created.
head(arsenic)
## location n_wells_tested percent_wells_above_guideline median
## 1 Manchester 275 58.9 14.0
## 2 Gorham 467 50.1 10.5
## 3 Columbia 42 50.0 9.8
## 4 Monmouth 277 49.5 10.0
## 5 Eliot 73 49.3 9.7
## 6 Columbia Falls 25 48.0 8.1
## percentile_95 maximum
## 1 93.00 200
## 2 130.00 460
## 3 65.90 200
## 4 110.00 368
## 5 41.35 45
## 6 53.75 71
head(flouride)
## location n_wells_tested percent_wells_above_guideline median
## 1 Otis 60 30.0 1.130
## 2 Dedham 102 22.5 0.940
## 3 Denmark 46 19.6 0.450
## 4 Surry 175 18.3 0.800
## 5 Prospect 57 17.5 0.785
## 6 Eastbrook 31 16.1 1.290
## percentile_95 maximum
## 1 3.200 3.6
## 2 3.270 7.0
## 3 3.150 3.9
## 4 3.525 6.9
## 5 2.500 2.7
## 6 2.445 3.3
For this analysis, I have decided to narrow my focus to the towns and cities with the 50 highest percentages of wells above Maine’s Maximum Exposure Guideline for arsenic (10 micrograms per liter) and flouride (2 milligrams per liter), respectively. Within that subset, I would like to know which towns and cities place in both categories. These towns and cities could then be recommended as the first target areas for arsenic and flouride rememediation.
For this reason, I renamed some of the variables in each dataset to make for easier analysis. The variables I am looking at are “location” and “percent_wells_above_quideline” but I need to be able to distinguish “percent_wells_above_guideline” for arsenic and flouride. I also wanted to rename “location” “city_town” to make the nature of the variable more specific. I do this using the names() command.
names(arsenic) <- c("town_city", "n_wells_tested", "arsenic_wells_above", "median", "percentile_95", "maximum")
names(arsenic)
## [1] "town_city" "n_wells_tested" "arsenic_wells_above"
## [4] "median" "percentile_95" "maximum"
names(flouride) <- c("town_city", "n_wells_tested", "flouride_wells_above", "median", "percentile_95", "maximum")
names(flouride)
## [1] "town_city" "n_wells_tested" "flouride_wells_above"
## [4] "median" "percentile_95" "maximum"
Next, I created two new dataframes that just contain the data I want to analyze and excluding “n_wells_test”, “median”, “percentile_95”, and “maximum”.
arsenic_wells <- arsenic %>% select(town_city, arsenic_wells_above) %>% top_n(50)
## Selecting by arsenic_wells_above
arsenic_wells
## town_city arsenic_wells_above
## 1 Manchester 58.9
## 2 Gorham 50.1
## 3 Columbia 50.0
## 4 Monmouth 49.5
## 5 Eliot 49.3
## 6 Columbia Falls 48.0
## 7 Winthrop 44.8
## 8 Hallowell 44.6
## 9 Buxton 43.4
## 10 Blue Hill 42.7
## 11 Litchfield 42.0
## 12 Hollis 41.4
## 13 Orland 40.7
## 14 Surry 40.3
## 15 Danforth 40.0
## 16 Mariaville 40.0
## 17 Readfield 39.8
## 18 Otis 39.6
## 19 Dayton 37.7
## 20 Sedgwick 37.3
## 21 Mercer 36.4
## 22 Scarborough 35.2
## 23 Saco 34.4
## 24 Camden 34.0
## 25 Trenton 33.7
## 26 Anson 33.3
## 27 Wales 33.3
## 28 Rangeley 33.1
## 29 Oakland 33.0
## 30 Carrabassett Valley 32.5
## 31 Minot 32.5
## 32 Kingfield 32.1
## 33 Belgrade 31.2
## 34 Millinocket 31.0
## 35 Rockport 31.0
## 36 Greene 30.8
## 37 Ellsworth 29.7
## 38 Fairfield 29.7
## 39 Owls Head 29.7
## 40 Mount Vernon 29.5
## 41 Starks 28.6
## 42 Biddeford 28.3
## 43 Island Falls 27.6
## 44 Auburn 27.2
## 45 Arundel 27.1
## 46 Penobscot 27.1
## 47 Standish 26.9
## 48 Sidney 26.8
## 49 Rome 26.6
## 50 Augusta 26.4
flouride_wells <- flouride %>% select(town_city, flouride_wells_above) %>% top_n(50)
## Selecting by flouride_wells_above
flouride_wells
## town_city flouride_wells_above
## 1 Otis 30.0
## 2 Dedham 22.5
## 3 Denmark 19.6
## 4 Surry 18.3
## 5 Prospect 17.5
## 6 Eastbrook 16.1
## 7 Mercer 15.6
## 8 Fryeburg 15.4
## 9 Brownfield 15.2
## 10 Stockton Springs 14.3
## 11 Clifton 14.0
## 12 Starks 13.6
## 13 Marshfield 12.9
## 14 Kennebunk 12.7
## 15 Charlotte 12.5
## 16 York 12.4
## 17 Chesterville 12.3
## 18 Stoneham 12.0
## 19 Sedgwick 11.2
## 20 Mechanic Falls 11.1
## 21 Swans Island 10.5
## 22 Franklin 10.3
## 23 Smithfield 10.1
## 24 Biddeford 9.7
## 25 Otisfield 9.7
## 26 Blue Hill 9.6
## 27 Arundel 9.5
## 28 Ellsworth 9.3
## 29 Hiram 8.9
## 30 Norridgewock 8.9
## 31 Orland 8.6
## 32 Jackman 8.3
## 33 Southwest Harbor 8.1
## 34 Greenwood 8.0
## 35 Hartland 8.0
## 36 New Sweden 8.0
## 37 Farmington 7.9
## 38 Casco 7.6
## 39 Mariaville 7.5
## 40 Gouldsboro 7.3
## 41 Lovell 7.0
## 42 Cornish 6.9
## 43 Millinocket 6.9
## 44 Bridgton 6.6
## 45 Penobscot 5.9
## 46 Sumner 5.9
## 47 Machias 5.8
## 48 Belgrade 5.5
## 49 Mount Desert 5.5
## 50 Jefferson 5.4
Finally, I joined the two subset dataframes to generate the list of towns and cities that have the top 50 highest percentage of wells with levels above guideline for arsenic AND flouride. I use the inner_join() command to do this.
arsenic_flouride_wells <- arsenic_wells %>% inner_join(flouride_wells)
## Joining, by = "town_city"
arsenic_flouride_wells
## town_city arsenic_wells_above flouride_wells_above
## 1 Blue Hill 42.7 9.6
## 2 Orland 40.7 8.6
## 3 Surry 40.3 18.3
## 4 Mariaville 40.0 7.5
## 5 Otis 39.6 30.0
## 6 Sedgwick 37.3 11.2
## 7 Mercer 36.4 15.6
## 8 Belgrade 31.2 5.5
## 9 Millinocket 31.0 6.9
## 10 Ellsworth 29.7 9.3
## 11 Starks 28.6 13.6
## 12 Biddeford 28.3 9.7
## 13 Arundel 27.1 9.5
## 14 Penobscot 27.1 5.9