Refer to the detailed instructions for this assignment in Brightspace.
Don’t alter the three code chunks in this section. First we read in the two data sets and deleting missing values.
library(tidyverse)
fluoride <- read_csv("http://jamessuleiman.com/teaching/datasets/fluoride.csv")
fluoride <- fluoride %>% drop_na()
arsenic <- read_csv("http://jamessuleiman.com/teaching/datasets/arsenic.csv")
arsenic <- arsenic %>% drop_na()
Next we display the first few rows of fluoride.
head(fluoride)
## # A tibble: 6 x 6
## location n_wells_tested percent_wells_above_gui… median percentile_95 maximum
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Otis 60 30 1.13 3.2 3.6
## 2 Dedham 102 22.5 0.94 3.27 7
## 3 Denmark 46 19.6 0.45 3.15 3.9
## 4 Surry 175 18.3 0.8 3.52 6.9
## 5 Prospect 57 17.5 0.785 2.5 2.7
## 6 Eastbrook 31 16.1 1.29 2.44 3.3
Then we display the first few rows of arsenic.
head(arsenic)
## # A tibble: 6 x 6
## location n_wells_tested percent_wells_above_g… median percentile_95 maximum
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Manchester 275 58.9 14 93 200
## 2 Gorham 467 50.1 10.5 130 460
## 3 Columbia 42 50 9.8 65.9 200
## 4 Monmouth 277 49.5 10 110 368
## 5 Eliot 73 49.3 9.7 41.4 45
## 6 Columbia F… 25 48 8.1 53.8 71
In the code chunk below, create a new tibble called chemicals that joins fluoride and arsenic. You probably want to do an inner join but the join type is up to you.
"chemicals" <- (fluoride %>%
inner_join(arsenic,by = "location"))
The next code chunk displays the head of your newly created chemicals tibble. Take a look to verify that your join looks ok.
head(chemicals)
## # A tibble: 6 x 11
## location n_wells_tested.x percent_wells_a… median.x percentile_95.x maximum.x
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Otis 60 30 1.13 3.2 3.6
## 2 Dedham 102 22.5 0.94 3.27 7
## 3 Denmark 46 19.6 0.45 3.15 3.9
## 4 Surry 175 18.3 0.8 3.52 6.9
## 5 Prospect 57 17.5 0.785 2.5 2.7
## 6 Eastbro… 31 16.1 1.29 2.44 3.3
## # … with 5 more variables: n_wells_tested.y <dbl>,
## # percent_wells_above_guideline.y <dbl>, median.y <dbl>,
## # percentile_95.y <dbl>, maximum.y <dbl>
In the code chunk below create an interesting subset of the data. You’ll likely find an interesting subset by filtering for locations that have high or low levels of arsenic, flouride, or both.
chemicals$n_wells_above_guideline <- (((chemicals$percent_wells_above_guideline.x / 100) * chemicals$n_wells_tested.x) + ((chemicals$percent_wells_above_guideline.y / 100) * chemicals$n_wells_tested.y))
chemicals$total_wells_tested <- chemicals$n_wells_tested.x + chemicals$n_wells_tested.y
chemicals$percent_wells_above_chemical_guidelines <- chemicals$n_wells_above_guideline / chemicals$total_wells_tested
chemicals %>%
select(location, total_wells_tested, n_wells_above_guideline, percent_wells_above_chemical_guidelines) %>% mutate(n_wells_above_guideline = round(n_wells_above_guideline, digits = 0)) %>% mutate(percent_wells_above_chemical_guidelines = round((percent_wells_above_chemical_guidelines * 100), digits = 2)) %>% arrange(desc(percent_wells_above_chemical_guidelines))
## # A tibble: 341 x 4
## location total_wells_test… n_wells_above_guid… percent_wells_above_chemica…
## <chr> <dbl> <dbl> <dbl>
## 1 Otis 113 39 34.5
## 2 Manchester 551 171 31.0
## 3 Surry 356 105 29.5
## 4 Blue Hill 450 123 27.3
## 5 Mercer 65 17 26.2
## 6 Monmouth 565 146 25.8
## 7 Gorham 919 234 25.5
## 8 Sedgwick 285 69 24.2
## 9 Hallowell 124 29 23.4
## 10 Winthrop 877 204 23.3
## # … with 331 more rows
Edit this part to discuss how you selected your interesting subset.
I was interested to see what locations in Maine had the highest percentage of chemicals (either fluoride or arsenic) in the wells. Since different numbers of wells were tested for each chemical, I could not simply take the average percentage of wells with contaminates.
After joining the fluoride and arsenic data sets into the chemicals tibble, I created three new columns to capture the needed data. The first was the total number of wells above guidelines per location, calculated by summing the products of number of wells and percentage positives for both fluoride and arsenic. This provided the number of positive wells per location, but did not take into account varying sample sizes.
To fix this, I summed the number of wells tested for fluoride and arsenic for each location. Finally, by dividing total number of wells above chemical guidelines by total number of wells tested per location, I obtained the overall percentage of positive wells by location. The inital output above did not list locations in order, so I arranged them in descending order based on percentage of wells above guidelines.
One struggle I had was figuring out how to manipulate the data to show the info I wanted. I ended up coding the equations by hand, which leaves room for error, but eventually got me to the results I needed.
Display the first few rows of your interesting subset in the code chunk below.
chemicals %>% select(location, total_wells_tested, n_wells_above_guideline, percent_wells_above_chemical_guidelines) %>% mutate(n_wells_above_guideline = round(n_wells_above_guideline, digits = 0)) %>% mutate(percent_wells_above_chemical_guidelines = round((percent_wells_above_chemical_guidelines * 100), digits = 2)) %>% arrange(desc(percent_wells_above_chemical_guidelines)) %>% top_n(5)
## Selecting by percent_wells_above_chemical_guidelines
## # A tibble: 5 x 4
## location total_wells_tested n_wells_above_guid… percent_wells_above_chemica…
## <chr> <dbl> <dbl> <dbl>
## 1 Otis 113 39 34.5
## 2 Manchester 551 171 31.0
## 3 Surry 356 105 29.5
## 4 Blue Hill 450 123 27.3
## 5 Mercer 65 17 26.2
In the code chunk below, create a ggplot visualization of your subset that is fairly simple for a viewer to comprehend.
## Selecting by percent_wells_above_chemical_guidelines
Once you are done, knit, publish, and then submit your link to your published RPubs document in Brightspace.