First I imported the two datasets, arsenic.csv and flouride.csv, and saved them as arsenic and flouride, respectively.
arsenic <- read.csv((file = "arsenic.csv"), header = TRUE, stringsAsFactors = FALSE)
flouride <- read.csv((file = "flouride.csv"), header = TRUE, stringsAsFactors = FALSE)
I decided to look at three different things with these data:
In this section I wanted to see which 5 locations had the absolute highest levels of each arsenic and flouride, as well as which 5 locations were on the bottom of the maximum list. For these I included only locations that had tested at least 20 wells.
These are the locations with the top 5 highest maximum levels of arsenic in wells.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
arsenic_max1 <- arsenic %>% arrange(n_wells_tested) %>% filter(n_wells_tested >=20) %>% select(location, maximum) %>% arrange(desc(maximum)) %>% top_n(5)
## Selecting by maximum
kable(arsenic_max1)
| location | maximum |
|---|---|
| Danforth | 3100 |
| Northport | 1700 |
| Blue Hill | 930 |
| Sedgwick | 840 |
| Buxton | 670 |
These are the locations with the bottom 5 maximum (5 lowest) levels of arsenic in wells.
arsenic_min1 <- arsenic %>% arrange(n_wells_tested) %>% filter(n_wells_tested >=20) %>% select(location, maximum) %>% arrange(maximum) %>% top_n(-5)
## Selecting by maximum
kable(arsenic_min1)
| location | maximum |
|---|---|
| Waterford | 1.0 |
| Mexico | 1.0 |
| Presque Isle | 1.0 |
| Andover | 1.1 |
| Carthage | 1.3 |
These are the locations with the top 5 highest maximum levels of flouride in wells.
flouride_max1 <- flouride %>% arrange(n_wells_tested) %>% filter(n_wells_tested >=20) %>% select(location, maximum) %>% arrange(desc(maximum)) %>% top_n(5)
## Selecting by maximum
kable(flouride_max1)
| location | maximum |
|---|---|
| Anson | 14.0 |
| Ashland | 10.0 |
| Peru | 9.9 |
| Kennebunk | 9.6 |
| Raymond | 9.1 |
These are the locations with the bottom 5 maximums (5 lowest) levels of flouride in wells.
flouride_min1 <- flouride %>% arrange(n_wells_tested) %>% filter(n_wells_tested >=20) %>% select(location, maximum) %>% arrange(maximum) %>% top_n(-5)
## Selecting by maximum
kable(flouride_min1)
| location | maximum |
|---|---|
| Hodgdon | 0.1 |
| Boothbay Harbor | 0.1 |
| Wallagrass | 0.1 |
| Sangerville | 0.1 |
| Garland | 0.1 |
| Sherman | 0.1 |
| Etna | 0.1 |
| Newburgh | 0.1 |
In this section I looked at only two columns, the location and the percent of wells above guideline. I created a table with the top 30 locations for arsenic and a separate table for the top 30 for flouride. Again, I only included locations that had at least 20 wells measured.
These are the top 30 locations for arsenic.
arsenic2 <- arsenic %>% arrange(n_wells_tested) %>% filter(n_wells_tested >=20) %>% select(location, ar_percent_above = percent_wells_above_guideline) %>% arrange(desc(ar_percent_above)) %>% top_n(30)
## Selecting by ar_percent_above
kable(arsenic2, digits = 1)
| location | ar_percent_above |
|---|---|
| Manchester | 58.9 |
| Gorham | 50.1 |
| Columbia | 50.0 |
| Monmouth | 49.5 |
| Eliot | 49.3 |
| Columbia Falls | 48.0 |
| Winthrop | 44.8 |
| Hallowell | 44.6 |
| Buxton | 43.4 |
| Blue Hill | 42.7 |
| Litchfield | 42.0 |
| Hollis | 41.4 |
| Orland | 40.7 |
| Surry | 40.3 |
| Mariaville | 40.0 |
| Danforth | 40.0 |
| Readfield | 39.8 |
| Otis | 39.6 |
| Dayton | 37.7 |
| Sedgwick | 37.3 |
| Mercer | 36.4 |
| Scarborough | 35.2 |
| Saco | 34.4 |
| Camden | 34.0 |
| Trenton | 33.7 |
| Anson | 33.3 |
| Wales | 33.3 |
| Rangeley | 33.1 |
| Oakland | 33.0 |
| Carrabassett Valley | 32.5 |
| Minot | 32.5 |
These are the top 30 locations for flouride.
flouride2 <- flouride %>% arrange(n_wells_tested) %>% filter(n_wells_tested >=20) %>% select(location, fl_percent_above = percent_wells_above_guideline) %>% arrange(desc(fl_percent_above)) %>% top_n(30)
## Selecting by fl_percent_above
kable(flouride2, digits = 1)
| location | fl_percent_above |
|---|---|
| Otis | 30.0 |
| Dedham | 22.5 |
| Denmark | 19.6 |
| Surry | 18.3 |
| Prospect | 17.5 |
| Eastbrook | 16.1 |
| Mercer | 15.6 |
| Fryeburg | 15.4 |
| Brownfield | 15.2 |
| Stockton Springs | 14.3 |
| Clifton | 14.0 |
| Starks | 13.6 |
| Marshfield | 12.9 |
| Kennebunk | 12.7 |
| Charlotte | 12.5 |
| York | 12.4 |
| Chesterville | 12.3 |
| Stoneham | 12.0 |
| Sedgwick | 11.2 |
| Mechanic Falls | 11.1 |
| Swans Island | 10.5 |
| Franklin | 10.3 |
| Smithfield | 10.1 |
| Otisfield | 9.7 |
| Biddeford | 9.7 |
| Blue Hill | 9.6 |
| Arundel | 9.5 |
| Ellsworth | 9.3 |
| Hiram | 8.9 |
| Norridgewock | 8.9 |
In this section I joined the dataframes from the previous section to see which locations showed up in the top 30 for both arsenic and flouride levels about the guideline. I then created a graph of the results. There were only 5 locations that showed up in both top 30 lists, so the scatterplot is not very informative.
library(ggvis)
arsenic_and_flouride <- arsenic2 %>% inner_join(flouride2)
## Joining, by = "location"
kable(arsenic_and_flouride, digits=1)
| location | ar_percent_above | fl_percent_above |
|---|---|---|
| Blue Hill | 42.7 | 9.6 |
| Surry | 40.3 | 18.3 |
| Otis | 39.6 | 30.0 |
| Sedgwick | 37.3 | 11.2 |
| Mercer | 36.4 | 15.6 |
arsenic_and_flouride %>% ggvis(~ar_percent_above, ~fl_percent_above) %>% layer_points()