Assignment 2

Instructions

Refer to the detailed instructions for this assignment in Brightspace.

Data Import

Don’t alter the three code chunks in this section. First we read in the two data sets and deleting missing values.

library(tidyverse)
fluoride <- read_csv("http://jamessuleiman.com/teaching/datasets/fluoride.csv")
fluoride <- fluoride %>% drop_na()
arsenic <- read_csv("http://jamessuleiman.com/teaching/datasets/arsenic.csv")
arsenic <- arsenic %>% drop_na()

Next we display the first few rows of fluoride.

head(fluoride)

## # A tibble: 6 x 6
##   location  n_wells_tested percent_wells_above_gui… median percentile_95 maximum
##   <chr>              <dbl>                    <dbl>  <dbl>         <dbl>   <dbl>
## 1 Otis                  60                     30    1.13           3.2      3.6
## 2 Dedham               102                     22.5  0.94           3.27     7  
## 3 Denmark               46                     19.6  0.45           3.15     3.9
## 4 Surry                175                     18.3  0.8            3.52     6.9
## 5 Prospect              57                     17.5  0.785          2.5      2.7
## 6 Eastbrook             31                     16.1  1.29           2.44     3.3

Then we display the first few rows of arsenic.

head(arsenic)

## # A tibble: 6 x 6
##   location    n_wells_tested percent_wells_above_g… median percentile_95 maximum
##   <chr>                <dbl>                  <dbl>  <dbl>         <dbl>   <dbl>
## 1 Manchester             275                   58.9   14            93       200
## 2 Gorham                 467                   50.1   10.5         130       460
## 3 Columbia                42                   50      9.8          65.9     200
## 4 Monmouth               277                   49.5   10           110       368
## 5 Eliot                   73                   49.3    9.7          41.4      45
## 6 Columbia F…             25                   48      8.1          53.8      71

Join data

In the code chunk below, create a new tibble called chemicals that joins fluoride and arsenic. You probably want to do an inner join but the join type is up to you.

"chemicals" <- (fluoride %>%
inner_join(arsenic,by = "location"))

The next code chunk displays the head of your newly created chemicals tibble. Take a look to verify that your join looks ok.

head(chemicals)

## # A tibble: 6 x 11
##   location n_wells_tested.x percent_wells_a… median.x percentile_95.x maximum.x
##   <chr>               <dbl>            <dbl>    <dbl>           <dbl>     <dbl>
## 1 Otis                   60             30      1.13             3.2        3.6
## 2 Dedham                102             22.5    0.94             3.27       7  
## 3 Denmark                46             19.6    0.45             3.15       3.9
## 4 Surry                 175             18.3    0.8              3.52       6.9
## 5 Prospect               57             17.5    0.785            2.5        2.7
## 6 Eastbro…               31             16.1    1.29             2.44       3.3
## # … with 5 more variables: n_wells_tested.y <dbl>,
## #   percent_wells_above_guideline.y <dbl>, median.y <dbl>,
## #   percentile_95.y <dbl>, maximum.y <dbl>

Interesting subset

In the code chunk below create an interesting subset of the data. You’ll likely find an interesting subset by filtering for locations that have high or low levels of arsenic, flouride, or both.

 chemicals$n_wells_above_guideline <- (((chemicals$percent_wells_above_guideline.x / 100) * chemicals$n_wells_tested.x) + ((chemicals$percent_wells_above_guideline.y / 100) * chemicals$n_wells_tested.y))
chemicals$total_wells_tested <- chemicals$n_wells_tested.x + chemicals$n_wells_tested.y
chemicals$percent_wells_above_chemical_guidelines <-   chemicals$n_wells_above_guideline / chemicals$total_wells_tested
chemicals %>%
select(location, total_wells_tested, n_wells_above_guideline, percent_wells_above_chemical_guidelines) %>% mutate(n_wells_above_guideline = round(n_wells_above_guideline, digits = 0)) %>% mutate(percent_wells_above_chemical_guidelines = round((percent_wells_above_chemical_guidelines  * 100), digits = 2)) %>% arrange(desc(percent_wells_above_chemical_guidelines))

## # A tibble: 341 x 4
##    location   total_wells_test… n_wells_above_guid… percent_wells_above_chemica…
##    <chr>                  <dbl>               <dbl>                        <dbl>
##  1 Otis                     113                  39                         34.5
##  2 Manchester               551                 171                         31.0
##  3 Surry                    356                 105                         29.5
##  4 Blue Hill                450                 123                         27.3
##  5 Mercer                    65                  17                         26.2
##  6 Monmouth                 565                 146                         25.8
##  7 Gorham                   919                 234                         25.5
##  8 Sedgwick                 285                  69                         24.2
##  9 Hallowell                124                  29                         23.4
## 10 Winthrop                 877                 204                         23.3
## # … with 331 more rows

Edit this part to discuss how you selected your interesting subset.

I was interested to see what locations in Maine had the highest percentage of chemicals (either fluoride or arsenic) in the wells. Since different numbers of wells were tested for each chemical, I could not simply take the average percentage of wells with contaminates.

After joining the fluoride and arsenic data sets into the chemicals tibble, I created three new columns to capture the needed data. The first was the total number of wells above guidelines per location, calculated by summing the products of number of wells and percentage positives for both fluoride and arsenic. This provided the number of positive wells per location, but did not take into account varying sample sizes.

To fix this, I summed the number of wells tested for fluoride and arsenic for each location. Finally, by dividing total number of wells above chemical guidelines by total number of wells tested per location, I obtained the overall percentage of positive wells by location. The inital output above did not list locations in order, so I arranged them in descending order based on percentage of wells above guidelines.

One struggle I had was figuring out how to manipulate the data to show the info I wanted. I ended up coding the equations by hand, which leaves room for error, but eventually got me to the results I needed.

Display the first few rows of your interesting subset in the code chunk below.

chemicals %>% select(location, total_wells_tested, n_wells_above_guideline, percent_wells_above_chemical_guidelines) %>% mutate(n_wells_above_guideline = round(n_wells_above_guideline, digits = 0)) %>% mutate(percent_wells_above_chemical_guidelines = round((percent_wells_above_chemical_guidelines  * 100), digits = 2)) %>% arrange(desc(percent_wells_above_chemical_guidelines)) %>% top_n(5)

## Selecting by percent_wells_above_chemical_guidelines

## # A tibble: 5 x 4
##   location   total_wells_tested n_wells_above_guid… percent_wells_above_chemica…
##   <chr>                   <dbl>               <dbl>                        <dbl>
## 1 Otis                      113                  39                         34.5
## 2 Manchester                551                 171                         31.0
## 3 Surry                     356                 105                         29.5
## 4 Blue Hill                 450                 123                         27.3
## 5 Mercer                     65                  17                         26.2

Visualize your subset

In the code chunk below, create a ggplot visualization of your subset that is fairly simple for a viewer to comprehend.

## Selecting by percent_wells_above_chemical_guidelines