library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(readxl)

Introduction

The goal of this assignment is to take datasets that are either messy or simply not tidy and to make them tidy datasets. The objective is to gain some familiarity with the functions for reading in data into R and calculating basic summary statistics on the data. In particular, we will make use of the following packages: dplyr, tidyr, readr, and readxl.

Before staring the quiz you will need to download the data for the quiz, which can be found in the file quiz_data.zip. The zip archive file contains two files:

daily_SPEC_2014.csv.bz2: a compressed CSV file containing daily measurements of particulate matter chemical constituents in the United States for the year 2014. Note that you should NOT have to decompress this file. The data are measured at a network of federal, state, and local monitors and assembled by the EPA. In this dataset, the “Arithmetic Mean” column provides the level of the indicated chemical constituent and the “Parameter.Name” column provides the name of the chemical constituent. The combination of a “State Code”, a “County Code”, and a “Site Num”, uniquely identifies a monitoring site (the location of which is provided by the “Latitude” and “Longitude” columns).

if(!file.exists("./data/pollution.csv")){
pollution <- read_csv("data/daily_SPEC_2014.csv.bz2")
write_csv(pollution, "./data/pollution.csv")
}
pollution <- read_csv("./data/pollution.csv")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `Parameter Code` = col_double(),
##   POC = col_double(),
##   Latitude = col_double(),
##   Longitude = col_double(),
##   `Pollutant Standard` = col_logical(),
##   `Date Local` = col_date(format = ""),
##   `Observation Count` = col_double(),
##   `Observation Percent` = col_double(),
##   `Arithmetic Mean` = col_double(),
##   `1st Max Value` = col_double(),
##   `1st Max Hour` = col_double(),
##   AQI = col_logical(),
##   `Method Code` = col_double(),
##   `Date of Last Change` = col_date(format = "")
## )

## See spec(...) for full column specifications.

aqs_sites.xlsx: An excel spreadsheet containing metadata about each of the monitoring sites in the United States where pollution measurements are made. In particular, the “Land Use” and “Location Setting” variables contain information about what kinds of areas the monitors are located in (i.e. “residential” vs. “forest”). Once the data have been downloaded to your working directory, you can begin the quiz assignment. For this assignment, you may want to review Sections 1.2 through 1.5 of Mastering Software Development in R.

head(pollution)

1.Question

Use the readr package to read the daily_SPEC_2014.csv.bz2 data file in to R. This file contains daily levels of fine particulate matter (PM2.5) chemical constituents across the United States. The data are measured at a network of federal, state, and local monitors and assembled by the EPA.

In this dataset, the “Sample.Value” (see bolded words a couple of paragraphs before) column provides the level of the indicated chemical constituent and the “Parameter.Name” column provides the name of the chemical constituent. The combination of a “State.Code”, a “County.Code”, and a “Site.Num”, uniquely identifies a monitoring site (the location of which is provided by the “Latitude” and “Longitude” columns).

For all of the questions below, you can ignore the missing values in the dataset, so when taking averages, just remove the missing values before taking the average (i.e. you can use na.rm = TRUE in the mean() function)

What is average Sample.Value for “Bromine PM2.5 LC” in the state of Wisconsin in this dataset?

Q1 <- pollution %>% filter(`Parameter Name`== "Bromine PM2.5 LC", `State Name` == "Wisconsin") %>% select(`Parameter Name`, `State Name`, `Arithmetic Mean`) 
Q1

unique(Q1[["Parameter Name"]])

## [1] "Bromine PM2.5 LC"

unique(Q1[["State Name"]])

## [1] "Wisconsin"

Q11 <-  mean(Q1[["Arithmetic Mean"]], na.rm=TRUE)
Q11

## [1] 0.003960482

2.Question

Calculate the average of each chemical constituent across all states, monitoring sites and all time points.

Which constituent Parameter.Name has the highest average level?

Q2 <- pollution %>% group_by(`Parameter Name`,`State Name`, `Site Num`, `Date Local`) %>% summarise(avg = mean(`Arithmetic Mean`, na.rm = TRUE))%>% arrange(desc(avg))
Q2

length(unique(Q2[["Parameter Name"]]))

## [1] 92

Question 3

Which monitoring site has the highest average level of “Sulfate PM2.5 LC” across all time?

Indicate the state code, county code, and site number.

Q3 <- pollution %>% filter(`Parameter Name` == "Sulfate PM2.5 LC") %>%
        group_by(`Site Num`,`County Code`,`State Code`) %>%
        summarise(avg = mean(`Arithmetic Mean`)) %>%
        arrange(desc(avg))
Q3

Question 4

What is the absolute difference in the average levels of “EC PM2.5 LC TOR” between the states California and Arizona, across all time and all monitoring sites?

Q4 <- pollution %>% filter(`State Name` %in% c("California", "Arizona"), `Parameter Name`=="EC PM2.5 LC TOR") %>% 
  group_by(`State Name`) %>% 
  summarise(avg_concentration = mean(`Arithmetic Mean`, na.rm = TRUE))
Q4

0.1977374    - 0.1791704

## [1] 0.018567

Question 5

What is the median level of “OC PM2.5 LC TOR” in the western United States, across all time? Define western as any monitoring location that has a Longitude LESS THAN -100.

Q5 <- pollution %>% filter(Longitude < -100, `Parameter Name` == "OC PM2.5 LC TOR") %>% select(Longitude, `Parameter Name`, `Arithmetic Mean`) %>% arrange(`Arithmetic Mean`)
Q5

median(Q5[["Arithmetic Mean"]])

## [1] 0.43

Question 6

Use the readxl package to read the file aqs_sites.xlsx into R (you may need to install the package first). This file contains metadata about each of the monitoring sites in the EPA’s monitoring system. In particular, the “Land Use” and “Location Setting” variables contain information about what kinds of areas the monitors are located in (i.e. “residential” vs. “forest”).

sites <- read_excel("./data/aqs_sites.xlsx")

## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Expecting numeric in A20237 / R20237C1: got 'CC'

## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Expecting numeric in A20238 / R20238C1: got 'CC'

## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Expecting numeric in A20239 / R20239C1: got 'CC'

## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
## sheet, : Expecting numeric in A20240 / R20240C1: got 'CC'

Q6 <- sites %>% filter(`Land Use` == "RESIDENTIAL", `Location Setting` =="SUBURBAN") %>% group_by(`Land Use`, `Location Setting`) %>% summarise(N = n())
Q6

Question 7

What is the median level of “EC PM2.5 LC TOR” amongst monitoring sites that are labelled as both “RESIDENTIAL” and “SUBURBAN” in the eastern U.S., where eastern is defined as Longitude greater than or equal to -100?

In order to join the tables we have to set the codes the same

pollution$`State Code` <- as.numeric(pollution$`State Code`)
pollution$`County Code` <- as.numeric(pollution$`County Code`)
pollution$`Site Num` <- as.numeric(pollution$`Site Num`)

Now you can join the two DFs on (State, County, Site Id)

joined_tbl <- inner_join(sites, pollution)

## Joining, by = c("State Code", "County Code", "Latitude", "Longitude", "Datum", "Local Site Name", "Address", "State Name", "County Name", "City Name", "CBSA Name")

Q7 <- joined_tbl %>% filter(`Location Setting`== "SUBURBAN", `Land Use`=="RESIDENTIAL", `Parameter Name`== "EC PM2.5 LC TOR", Longitude >= -100) %>% select(`State Code`, `County Code`, `Site Number`, `Parameter Name`, `Arithmetic Mean`) 
Q7

mean(Q7[["Arithmetic Mean"]])

## [1] 0.6309167

Question 8

Amongst monitoring sites that are labeled as COMMERCIAL for “Land Use”, which month of the year has the highest average levels of “Sulfate PM2.5 LC”?

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

Q8 <- joined_tbl %>% filter(`Land Use`=="COMMERCIAL", `Parameter Name`=="Sulfate PM2.5 LC") %>%
     group_by(month(`Date Local`, label = TRUE)) %>% 
     summarise(avg_concentration = mean(`Arithmetic Mean`)) %>%
     arrange(desc(avg_concentration))
Q8

Question 9:

Take a look at the data for the monitoring site identified by State Code 6, County Code 65, and Site Number 8001 (this monitor is in California). At this monitor, for how many days is the sum of “Sulfate PM2.5 LC” and “Total Nitrate PM2.5 LC” greater than 10?

For each of the chemical constituents, there will be some dates that have multiple Sample.Value’s at this monitoring site. When there are multiple values on a given date, take the average of the constituent values for that date.

Q9 <- joined_tbl %>% 
  filter(`State Code`==6, `County Code`==65, `Site Number`==8001, `Parameter Name` %in% c("Total Nitrate PM2.5 LC", "Sulfate PM2.5 LC")) %>% 
  select(`Site Number`, `Date Local`,`Parameter Name`, )
Q9

Q91 <- Q9 %>% mutate(month = month(`Date Local`), day = day(`Date Local`))
Q91