Instructions

Refer to the detailed instructions for this assignment in Brightspace.

Data Import

Don’t alter the three code chunks in this section. First we read in the two data sets and deleting missing values.

library(tidyverse)
fluoride <- read_csv("http://jamessuleiman.com/teaching/datasets/fluoride.csv")
fluoride <- fluoride %>% drop_na()
arsenic <- read_csv("http://jamessuleiman.com/teaching/datasets/arsenic.csv")
arsenic <- arsenic %>% drop_na()

Next we display the first few rows of fluoride.

head(fluoride)
## # A tibble: 6 x 6
##   location  n_wells_tested percent_wells_above_gui… median percentile_95 maximum
##   <chr>              <dbl>                    <dbl>  <dbl>         <dbl>   <dbl>
## 1 Otis                  60                     30    1.13           3.2      3.6
## 2 Dedham               102                     22.5  0.94           3.27     7  
## 3 Denmark               46                     19.6  0.45           3.15     3.9
## 4 Surry                175                     18.3  0.8            3.52     6.9
## 5 Prospect              57                     17.5  0.785          2.5      2.7
## 6 Eastbrook             31                     16.1  1.29           2.44     3.3

Then we display the first few rows of arsenic.

head(arsenic)
## # A tibble: 6 x 6
##   location    n_wells_tested percent_wells_above_g… median percentile_95 maximum
##   <chr>                <dbl>                  <dbl>  <dbl>         <dbl>   <dbl>
## 1 Manchester             275                   58.9   14            93       200
## 2 Gorham                 467                   50.1   10.5         130       460
## 3 Columbia                42                   50      9.8          65.9     200
## 4 Monmouth               277                   49.5   10           110       368
## 5 Eliot                   73                   49.3    9.7          41.4      45
## 6 Columbia F…             25                   48      8.1          53.8      71

Join data

In the code chunk below, create a new tibble called chemicals that joins fluoride and arsenic. You probably want to do an inner join but the join type is up to you.

chemicals <- fluoride %>% inner_join(arsenic, by = "location")

The next code chunk displays the head of your newly created chemicals tibble. Take a look to verify that your join looks ok.

head(chemicals)
## # A tibble: 6 x 11
##   location n_wells_tested.x percent_wells_a… median.x percentile_95.x maximum.x
##   <chr>               <dbl>            <dbl>    <dbl>           <dbl>     <dbl>
## 1 Otis                   60             30      1.13             3.2        3.6
## 2 Dedham                102             22.5    0.94             3.27       7  
## 3 Denmark                46             19.6    0.45             3.15       3.9
## 4 Surry                 175             18.3    0.8              3.52       6.9
## 5 Prospect               57             17.5    0.785            2.5        2.7
## 6 Eastbro…               31             16.1    1.29             2.44       3.3
## # … with 5 more variables: n_wells_tested.y <dbl>,
## #   percent_wells_above_guideline.y <dbl>, median.y <dbl>,
## #   percentile_95.y <dbl>, maximum.y <dbl>

Interesting subset

In the code chunk below create an interesting subset of the data. You’ll likely find an interesting subset by filtering for locations that have high or low levels of arsenic, fluoride, or both.

high_arsenic <- chemicals %>% filter (maximum.y > 10) %>% arrange(desc(maximum.y))
summary(high_arsenic)
##    location         n_wells_tested.x percent_wells_above_guideline.x
##  Length:291         Min.   : 21.0    Min.   : 0.000                 
##  Class :character   1st Qu.: 45.5    1st Qu.: 0.000                 
##  Mode  :character   Median : 74.0    Median : 0.800                 
##                     Mean   :105.2    Mean   : 2.396                 
##                     3rd Qu.:139.0    3rd Qu.: 2.950                 
##                     Max.   :503.0    Max.   :30.000                 
##     median.x     percentile_95.x   maximum.x      n_wells_tested.y
##  Min.   :0.100   Min.   :0.100   Min.   : 0.100   Min.   : 20.00  
##  1st Qu.:0.100   1st Qu.:0.587   1st Qu.: 1.200   1st Qu.: 39.00  
##  Median :0.100   Median :1.034   Median : 2.200   Median : 62.00  
##  Mean   :0.175   Mean   :1.160   Mean   : 2.613   Mean   : 94.36  
##  3rd Qu.:0.200   3rd Qu.:1.562   3rd Qu.: 3.600   3rd Qu.:118.50  
##  Max.   :1.290   Max.   :4.180   Max.   :14.000   Max.   :632.00  
##  percent_wells_above_guideline.y    median.y      percentile_95.y  
##  Min.   : 0.80                   Min.   : 0.250   Min.   :  1.000  
##  1st Qu.: 5.40                   1st Qu.: 0.500   1st Qu.:  9.832  
##  Median :10.60                   Median : 1.000   Median : 16.900  
##  Mean   :14.50                   Mean   : 1.814   Mean   : 28.629  
##  3rd Qu.:19.55                   3rd Qu.: 2.000   3rd Qu.: 31.500  
##  Max.   :58.90                   Max.   :14.000   Max.   :372.500  
##    maximum.y     
##  Min.   :  11.0  
##  1st Qu.:  26.5  
##  Median :  46.0  
##  Mean   : 110.4  
##  3rd Qu.: 110.0  
##  Max.   :3100.0
highest_arsenic <- chemicals %>% filter (maximum.y > 110.4) %>% arrange(desc(maximum.y))

Narrative

For the interesting subset, I first filtered out the joined entries where arsenic levels were greater than 10 ug/L into a tibble called high_arsenic. This resulted in 291 observations out of the original 360.

To narrow down the number of observations, I created a summary of the high_arsenic tibble to determine the mean of arsenic levels greater than 10 ug/L, which resulted in 110.4 ug/L. I re-ran a new filter based on all observations that were greater than this mean value (110.4 ug/L) and stored the results in a second tibble called highest_arsenic. This second table dropped to 65 observations, which I considered my interesting subset.

I was curious to see if there were any correlations between these highest arsenic levels and fluoride levels, and created a plot to compare these 2 variables. I labeled the x-axis ‘Max Arsenic ug/L’ and the y-axis ‘Max Fluoride mg/L’. I also set the code chunk option echo to FALSE to prevent the code from being displayed in the markup. From the plot, there does not appear to be any correlation between high levels of arsenic versus fluoride.

Display the first few rows of your interesting subset in the code chunk below.

head (highest_arsenic)
## # A tibble: 6 x 11
##   location n_wells_tested.x percent_wells_a… median.x percentile_95.x maximum.x
##   <chr>               <dbl>            <dbl>    <dbl>           <dbl>     <dbl>
## 1 Danforth               35              0      0.2             0.96        1.9
## 2 Northpo…               87              1.1    0.275           1.25        4.1
## 3 Blue Hi…              209              9.6    0.43            2.86        4.5
## 4 Sedgwick              143             11.2    0.425           2.87        4.2
## 5 Buxton                383              1      0.1             0.899       3.2
## 6 Standish              290              1.7    0.1             1.19        4.7
## # … with 5 more variables: n_wells_tested.y <dbl>,
## #   percent_wells_above_guideline.y <dbl>, median.y <dbl>,
## #   percentile_95.y <dbl>, maximum.y <dbl>

Visualize your subset

In the code chunk below, create a ggplot visualization of your subset that is fairly simple for a viewer to comprehend.

Once you are done, knit, publish, and then submit your link to your published RPubs document in Brightspace.