Instructions

Refer to the detailed instructions for this assignment in Brightspace.

Data Import

Don’t alter the three code chunks in this section. First we read in the two data sets and deleting missing values.

#install.packages('readr')
library(tidyverse)
fluoride <- read_csv("http://jamessuleiman.com/teaching/datasets/fluoride.csv")
fluoride <- fluoride %>% drop_na()
arsenic <- read_csv("http://jamessuleiman.com/teaching/datasets/arsenic.csv")
arsenic <- arsenic %>% drop_na()

Next we display the first few rows of fluoride.

head(fluoride)
## # A tibble: 6 x 6
##   location  n_wells_tested percent_wells_above_gui… median percentile_95 maximum
##   <chr>              <dbl>                    <dbl>  <dbl>         <dbl>   <dbl>
## 1 Otis                  60                     30    1.13           3.2      3.6
## 2 Dedham               102                     22.5  0.94           3.27     7  
## 3 Denmark               46                     19.6  0.45           3.15     3.9
## 4 Surry                175                     18.3  0.8            3.52     6.9
## 5 Prospect              57                     17.5  0.785          2.5      2.7
## 6 Eastbrook             31                     16.1  1.29           2.44     3.3

Then we display the first few rows of arsenic.

head(arsenic)
## # A tibble: 6 x 6
##   location    n_wells_tested percent_wells_above_g… median percentile_95 maximum
##   <chr>                <dbl>                  <dbl>  <dbl>         <dbl>   <dbl>
## 1 Manchester             275                   58.9   14            93       200
## 2 Gorham                 467                   50.1   10.5         130       460
## 3 Columbia                42                   50      9.8          65.9     200
## 4 Monmouth               277                   49.5   10           110       368
## 5 Eliot                   73                   49.3    9.7          41.4      45
## 6 Columbia F…             25                   48      8.1          53.8      71

Join data

In the code chunk below, create a new tibble called chemicals that joins fluoride and arsenic. You probably want to do an inner join but the join type is up to you.

chemicals <- arsenic %>% inner_join(fluoride, by= "location")

The next code chunk displays the head of your newly created chemicals tibble. Take a look to verify that your join looks ok.

head(chemicals)
## # A tibble: 6 x 11
##   location n_wells_tested.x percent_wells_a… median.x percentile_95.x maximum.x
##   <chr>               <dbl>            <dbl>    <dbl>           <dbl>     <dbl>
## 1 Manches…              275             58.9     14              93         200
## 2 Gorham                467             50.1     10.5           130         460
## 3 Columbia               42             50        9.8            65.9       200
## 4 Monmouth              277             49.5     10             110         368
## 5 Eliot                  73             49.3      9.7            41.4        45
## 6 Columbi…               25             48        8.1            53.8        71
## # … with 5 more variables: n_wells_tested.y <dbl>,
## #   percent_wells_above_guideline.y <dbl>, median.y <dbl>,
## #   percentile_95.y <dbl>, maximum.y <dbl>
# x=arsenic, y=fluoride 

Intersting subset

In the code chunk below create an interesting subset of the data. You’ll likely find an interesting subset by filtering for locations that have high or low levels of arsenic, fluoride, or both.

median_arsenic <- chemicals %>% filter(chemicals$median.x >=10 & chemicals$median.y <2) %>% arrange(desc(median.x))
median_fluoride <- chemicals %>% filter(chemicals$median.x <10 & chemicals$median.y >=2) %>% arrange(desc(median.y))
median_chemicals <- chemicals %>% filter(chemicals$median.x >=10 & chemicals$median.y >=2) 
dim(median_arsenic)[1]
## [1] 3
dim(median_fluoride)[1]
## [1] 0
dim(median_chemicals)[1]
## [1] 0
percentile_95_arsenic <- chemicals %>% filter(chemicals$percentile_95.x >=10 & chemicals$percentile_95.y <2) %>% arrange(desc(percentile_95.x))
percentile_95_fluoride <- chemicals %>% filter(chemicals$percentile_95.x <10 & chemicals$percentile_95.y >=2) %>% arrange(desc(percentile_95.x))
percentile_95_chemicals <- chemicals %>% filter(chemicals$percentile_95.x >=10 & chemicals$percentile_95.y >=2)
dim(percentile_95_arsenic)[1]
## [1] 190
dim(percentile_95_fluoride)[1]
## [1] 23
dim(percentile_95_chemicals)[1]
## [1] 27

Above, we observe the number of towns where at least 50% of the wells were out of compliance regarding the amount of arsenic and fluoride found in their wells. By this measure, 3 towns failed outright regarding arsenic. One the other hand, there were no towns where the majority of tested wells were outside of acceptable parameters for fluoride.

I have also pulled out the 95th percentile measure, which shows towns where at least 5% of the wells were out of compliance. This indicates towns that seem to have problem areas but not necessarily an overwhelming amount of noncompliance. 190 towns had wells that tested high for only arsenic, 23 towns had wells that tested high on only fluoride, and 27 towns had wells out of compliance with both (though, not necessarily high levels of both fluoride and arsenic in the same wells).

I think it’s clear that arsenic was a larger (and potentially much more dangerous) concern in Maine wells than fluoride was.

head(median_arsenic)
## # A tibble: 3 x 11
##   location n_wells_tested.x percent_wells_a… median.x percentile_95.x maximum.x
##   <chr>               <dbl>            <dbl>    <dbl>           <dbl>     <dbl>
## 1 Manches…              275             58.9     14                93       200
## 2 Gorham                467             50.1     10.5             130       460
## 3 Monmouth              277             49.5     10               110       368
## # … with 5 more variables: n_wells_tested.y <dbl>,
## #   percent_wells_above_guideline.y <dbl>, median.y <dbl>,
## #   percentile_95.y <dbl>, maximum.y <dbl>
head(median_fluoride)
## # A tibble: 0 x 11
## # … with 11 variables: location <chr>, n_wells_tested.x <dbl>,
## #   percent_wells_above_guideline.x <dbl>, median.x <dbl>,
## #   percentile_95.x <dbl>, maximum.x <dbl>, n_wells_tested.y <dbl>,
## #   percent_wells_above_guideline.y <dbl>, median.y <dbl>,
## #   percentile_95.y <dbl>, maximum.y <dbl>
head(median_chemicals)
## # A tibble: 0 x 11
## # … with 11 variables: location <chr>, n_wells_tested.x <dbl>,
## #   percent_wells_above_guideline.x <dbl>, median.x <dbl>,
## #   percentile_95.x <dbl>, maximum.x <dbl>, n_wells_tested.y <dbl>,
## #   percent_wells_above_guideline.y <dbl>, median.y <dbl>,
## #   percentile_95.y <dbl>, maximum.y <dbl>
head(percentile_95_arsenic)
## # A tibble: 6 x 11
##   location n_wells_tested.x percent_wells_a… median.x percentile_95.x maximum.x
##   <chr>               <dbl>            <dbl>    <dbl>           <dbl>     <dbl>
## 1 Danforth               35             40        5              372.      3100
## 2 Standish              632             26.9      2              154        550
## 3 Gorham                467             50.1     10.5            130        460
## 4 Wales                  42             33.3      2              129        180
## 5 Camden                 50             34        5              125        404
## 6 Northpo…              157             26.1      2.1            114.      1700
## # … with 5 more variables: n_wells_tested.y <dbl>,
## #   percent_wells_above_guideline.y <dbl>, median.y <dbl>,
## #   percentile_95.y <dbl>, maximum.y <dbl>
head(percentile_95_fluoride)
## # A tibble: 6 x 11
##   location n_wells_tested.x percent_wells_a… median.x percentile_95.x maximum.x
##   <chr>               <dbl>            <dbl>    <dbl>           <dbl>     <dbl>
## 1 Cornish                23              4.3    0.785            9.86        68
## 2 York                  119              2.5    0.73             9.72        35
## 3 Southwe…               34              5.9    0.5              8.97        23
## 4 Prospect               50              4      1                8           16
## 5 Lebanon                48              4.2    0.5              7.96        31
## 6 Greenwo…               23              4.3    0.25             7.12        15
## # … with 5 more variables: n_wells_tested.y <dbl>,
## #   percent_wells_above_guideline.y <dbl>, median.y <dbl>,
## #   percentile_95.y <dbl>, maximum.y <dbl>
head(percentile_95_chemicals)
## # A tibble: 6 x 11
##   location n_wells_tested.x percent_wells_a… median.x percentile_95.x maximum.x
##   <chr>               <dbl>            <dbl>    <dbl>           <dbl>     <dbl>
## 1 Blue Hi…              241             42.7      7             229         930
## 2 Orland                123             40.7      5.4            44.6       160
## 3 Surry                 181             40.3      6             146.        470
## 4 Mariavi…               30             40        7.2            55.5        57
## 5 Otis                   53             39.6      4.8           121.        200
## 6 Sedgwick              142             37.3      4.2           237.        840
## # … with 5 more variables: n_wells_tested.y <dbl>,
## #   percent_wells_above_guideline.y <dbl>, median.y <dbl>,
## #   percentile_95.y <dbl>, maximum.y <dbl>

Visualize your subset

In the code chunk below, create a ggplot visualization of your subset that is fairly simple for a viewer to comprehend.

ggplot(data=chemicals, aes(x=percentile_95.x, y=percentile_95.y)) +
  geom_point(size=1, alpha= 0.3) +
  geom_hline(yintercept=2, col="blue") +
  geom_vline(xintercept=10, col = "red") +
  ggtitle("Minimum Levels of Flouride vs Arsentic in the Top 5% of Wells") +
  xlab("Arsenic in Top 5% of Wells") +
  ylab("Fluoride in Top 5% of Wells")

chemicals<- chemicals %>% mutate(noncompliance_95 = if_else(percentile_95.x < 10 & percentile_95.y < 2, "   in compliance/ none", if_else(percentile_95.y >= 2 & percentile_95.x < 10, " only fluoride", if_else(percentile_95.x >= 10 & percentile_95.y < 2, "  only arsenic", if_else (percentile_95.x >= 10 & percentile_95.y >= 2, "both", "missing")))))

counts_95 <- chemicals %>% group_by (noncompliance_95) %>%
  summarize(count=n())
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(data= counts_95, aes(x= noncompliance_95, y= count)) +
  geom_bar(stat= "identity") +
  ggtitle("Number of Towns 5% or More Out of Compliance") +
  xlab("Chemicals") +
  ylab("Number of Towns")

ggplot(data=chemicals, aes(x=median.x, y=median.y)) +
  geom_point(size=1, alpha= 0.3) +
  geom_hline(yintercept=2, col="blue") +
  geom_vline(xintercept=10, col = "red") +
  ggtitle("Median Levels of Flouride vs Arsentic in Maine Towns") +
  xlab(" Median Arsenic") +
  ylab("Median Fluoride")

chemicals<- chemicals %>% mutate(noncompliance_median = if_else(median.x < 10 & median.y < 2, "   in compliance/ none", if_else(median.y >= 2 & median.x < 10, " only fluoride", if_else(median.x >= 10 & median.y < 2, "  only arsenic", if_else (median.x >= 10 & median.y >= 2, "both", "missing")))))

counts_median <- chemicals %>% group_by (noncompliance_median) %>%
  summarize(count=n())
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(data= counts_median, aes(x= noncompliance_median, y= count)) +
  geom_bar(stat= "identity") +
  ggtitle("Number of Towns 50% or More Out of Compliance") +
  xlab("Chemicals") +
  ylab("Number of Towns")