Assignment 2

Instructions

Refer to the detailed instructions for this assignment in Brightspace.

Data Import

Don’t alter the three code chunks in this section. First we read in the two data sets and deleting missing values.

library(tidyverse)
library(dplyr)
fluoride <- read_csv("http://jamessuleiman.com/teaching/datasets/fluoride.csv")
fluoride <- fluoride %>% drop_na()
arsenic <- read_csv("http://jamessuleiman.com/teaching/datasets/arsenic.csv")
arsenic <- arsenic %>% drop_na()

Next we display the first few rows of fluoride.

head(fluoride)

## # A tibble: 6 × 6
##   location  n_wells_tested percent_wells_above_guideline median percen…¹ maximum
##   <chr>              <dbl>                         <dbl>  <dbl>    <dbl>   <dbl>
## 1 Otis                  60                          30    1.13      3.2      3.6
## 2 Dedham               102                          22.5  0.94      3.27     7  
## 3 Denmark               46                          19.6  0.45      3.15     3.9
## 4 Surry                175                          18.3  0.8       3.52     6.9
## 5 Prospect              57                          17.5  0.785     2.5      2.7
## 6 Eastbrook             31                          16.1  1.29      2.44     3.3
## # … with abbreviated variable name ¹percentile_95

Then we display the first few rows of arsenic.

head(arsenic)

## # A tibble: 6 × 6
##   location       n_wells_tested percent_wells_above_gui…¹ median perce…² maximum
##   <chr>                   <dbl>                     <dbl>  <dbl>   <dbl>   <dbl>
## 1 Manchester                275                      58.9   14      93       200
## 2 Gorham                    467                      50.1   10.5   130       460
## 3 Columbia                   42                      50      9.8    65.9     200
## 4 Monmouth                  277                      49.5   10     110       368
## 5 Eliot                      73                      49.3    9.7    41.4      45
## 6 Columbia Falls             25                      48      8.1    53.8      71
## # … with abbreviated variable names ¹percent_wells_above_guideline,
## #   ²percentile_95

Join data

In the code chunk below, create a new tibble called chemicals that joins fluoride and arsenic. You probably want to do an inner join but the join type is up to you.

chemicals<- inner_join(fluoride, arsenic, by = "location")

The next code chunk displays the head of your newly created chemicals tibble. Take a look to verify that your join looks ok.

head(chemicals)

## # A tibble: 6 × 11
##   location  n_wells_te…¹ perce…² media…³ perce…⁴ maxim…⁵ n_wel…⁶ perce…⁷ media…⁸
##   <chr>            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 Otis                60    30     1.13     3.2      3.6      53    39.6    4.8 
## 2 Dedham             102    22.5   0.94     3.27     7        97    17.5    1   
## 3 Denmark             46    19.6   0.45     3.15     3.9      42     0      0.25
## 4 Surry              175    18.3   0.8      3.52     6.9     181    40.3    6   
## 5 Prospect            57    17.5   0.785    2.5      2.7      50     4      1   
## 6 Eastbrook           31    16.1   1.29     2.44     3.3      28    10.7    1.5 
## # … with 2 more variables: percentile_95.y <dbl>, maximum.y <dbl>, and
## #   abbreviated variable names ¹n_wells_tested.x,
## #   ²percent_wells_above_guideline.x, ³median.x, ⁴percentile_95.x, ⁵maximum.x,
## #   ⁶n_wells_tested.y, ⁷percent_wells_above_guideline.y, ⁸median.y

Interesting subset

In the code chunk below create an interesting subset of the data. You’ll likely find an interesting subset by filtering for locations that have high or low levels of arsenic, flouride, or both.

chemicals %>% select(location,n_wells_tested.x,n_wells_tested.y)

## # A tibble: 341 × 3
##    location         n_wells_tested.x n_wells_tested.y
##    <chr>                       <dbl>            <dbl>
##  1 Otis                           60               53
##  2 Dedham                        102               97
##  3 Denmark                        46               42
##  4 Surry                         175              181
##  5 Prospect                       57               50
##  6 Eastbrook                      31               28
##  7 Mercer                         32               33
##  8 Fryeburg                       52               37
##  9 Brownfield                     33               24
## 10 Stockton Springs               56               63
## # … with 331 more rows

This table here is interesting because it demonstrates the number of wells tested for arsenic and fluoride given the top 5 locations. The variable n_wells_tested.x represents the wells tested for fluoride. The variable n_wells_tested.y represents the wells tested for arsenic! The most difficult task was joining the two datasets, whilst trying to audit and verify that the data was matching from the original to the innerjoin columns.

Display the first few rows of your interesting subset in the code chunk below.

chemicals  %>% select(location,n_wells_tested.x,n_wells_tested.y) %>% top_n(5)

## Selecting by n_wells_tested.y

## # A tibble: 5 × 3
##   location  n_wells_tested.x n_wells_tested.y
##   <chr>                <dbl>            <dbl>
## 1 Ellsworth              503              428
## 2 Winthrop               453              424
## 3 Augusta                479              454
## 4 Standish               290              632
## 5 Gorham                 452              467

Visualize your subset

In the code chunk below, create a ggplot visualization of your subset that is fairly simple for a viewer to comprehend.