MBA 676- Assignment 1

Introduction

The State of Maine Health and Environmental Testing Laboratory (HETL) conducted private well water sample tests between 1999 and 2013 and saved them within two data sets titled arsenic and flouride. After looking through the available data between the two sets, the information regarding the percents of wells which were above the Maine Maximum Exposure Guidelines, one for flouride and another for arsenic, seemed most interesting. My goal is to see which locations rank poorly (bottom ten percent) in percent of wells above the guideline for both arsenic and flouride levels.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(knitr)
flouride <- read.csv("flouride.csv", header = TRUE, 
                stringsAsFactors = FALSE) 

arsenic <- read.csv("arsenic.csv", header = TRUE, 
                stringsAsFactors = FALSE)

Anaylsis

The first step in the analysis is understanding the data sets themselves. Both flouride and arsenic share the same columns, and for both any locations with fewer than 20 wells tested only the n_wells_tested and maximum columns are populated. Since I want to compare the percent of wells above the State’s guidelines, I need to filter out locations where that data is missing. Due to the size of the data sets I display the top 5 rows of data to give a truncated view of the outputs until the final result.

flouride <- flouride %>% filter(n_wells_tested > 20)
arsenic <- arsenic %>% filter(n_wells_tested > 20)
kable(flouride %>% top_n(5))

## Selecting by maximum

location	n_wells_tested	percent_wells_above_guideline	median	percentile_95	maximum
Kennebunk	110	12.7	0.1	3.200	9.6
Anson	40	5.0	0.1	1.400	14.0
Raymond	181	5.0	0.3	1.900	9.1
Ashland	27	3.7	0.1	0.551	10.0
Peru	58	1.7	0.1	1.255	9.9

kable(arsenic %>% top_n(5))

## Selecting by maximum

location	n_wells_tested	percent_wells_above_guideline	median	percentile_95	maximum
Buxton	334	43.4	6.0	110.0	670
Blue Hill	241	42.7	7.0	229.0	930
Danforth	35	40.0	5.0	372.5	3100
Sedgwick	142	37.3	4.2	236.6	840
Northport	157	26.1	2.1	113.7	1700

Now focusing on the locations where we know the percent of wells above the State guidelines, I rearrange both data sets by the Percent_wells_above_guideline column from highest value to lowest. Since we will be focusing on location and percent_wells_above_guideline, I select only those columns to use going forward.

flouride <- flouride %>%  arrange(desc(percent_wells_above_guideline)) %>% select(location, percent_wells_above_guideline)
arsenic <- arsenic %>%  arrange(desc(percent_wells_above_guideline)) %>% select(location, percent_wells_above_guideline)
kable(flouride %>% top_n(5))

## Selecting by percent_wells_above_guideline

location	percent_wells_above_guideline
Otis	30.0
Dedham	22.5
Denmark	19.6
Surry	18.3
Prospect	17.5

kable(arsenic %>% top_n(5))

## Selecting by percent_wells_above_guideline

location	percent_wells_above_guideline
Manchester	58.9
Gorham	50.1
Columbia	50.0
Monmouth	49.5
Eliot	49.3

Next I want to understand where each location ranks based on the percent of wells above guideline, so I add a percent rank column to both data sets based on percent of wells above guideline. Because having a high percentage of wells above the guideline is negative, I have the percent rank inverted to give locations with a high percentage of wells above the guidelines a low percent rank. I add the chemcial names to he new columns for when we combine the two data sets later.

flouride <- flouride %>% mutate(Above_Flouride_Guideline_Percent_Rank = percent_rank(1-percent_wells_above_guideline)*100)
arsenic <- arsenic %>% mutate(Above_Arsenic_Guideline_Percent_Rank = percent_rank(1-percent_wells_above_guideline)*100)
kable(flouride %>% slice(1:5), digits = 2)

location	percent_wells_above_guideline	Above_Flouride_Guideline_Percent_Rank
Otis	30.0	0.00
Dedham	22.5	0.28
Denmark	19.6	0.56
Surry	18.3	0.84
Prospect	17.5	1.12

kable(arsenic %>% slice(1:5), digits = 2)

location	percent_wells_above_guideline	Above_Arsenic_Guideline_Percent_Rank
Manchester	58.9	0.0
Gorham	50.1	0.3
Columbia	50.0	0.6
Monmouth	49.5	0.9
Eliot	49.3	1.2

Now that I have the percent rankings for both arsenic and flouride locations, I then look to join the two data sets together. Because I am looking to evaluate locations that rank poorly for both arsenic and flouride, I use and inner join to filter out locations where we do not have pecent rankings for both tests. This could be due to the locaton not being tested for either or both chemicals, or if fewer than 20 wells were tested. I call this new combined data set wells.

wells <- inner_join(arsenic, flouride, by = "location") %>% select(location, Above_Flouride_Guideline_Percent_Rank, Above_Arsenic_Guideline_Percent_Rank)
kable(wells %>% slice(1:5), digits = 2)

location	Above_Flouride_Guideline_Percent_Rank	Above_Arsenic_Guideline_Percent_Rank
Manchester	22.75	0.0
Gorham	52.25	0.3
Columbia	36.24	0.6
Monmouth	25.28	0.9
Eliot	52.25	1.2

To determine which locations rank poorly for both arsenic and flouride, I decide to see which fall into the bottom ten percent for both tests. Here I output all results as the bottom_locations.

bottom_locations <- wells %>% filter(Above_Flouride_Guideline_Percent_Rank < 10, Above_Arsenic_Guideline_Percent_Rank < 10)
kable(bottom_locations, digits = 2)

location	Above_Flouride_Guideline_Percent_Rank	Above_Arsenic_Guideline_Percent_Rank
Blue Hill	7.02	2.70
Orland	8.43	3.60
Surry	0.84	3.90
Otis	0.00	5.11
Sedgwick	5.06	5.71
Mercer	1.69	6.01

MBA 676- Assignment 1

Sean McGowan

September 24, 2017

Introduction

Anaylsis

To determine which locations rank poorly for both arsenic and flouride, I decide to see which fall into the bottom ten percent for both tests. Here I output all results as the bottom_locations.

Conclusion