Writing Functions, Combining Files, Summarizing
Data
QUESTION 1
Introduction
For this first programming assignment you will write three functions
that are meant to interact with dataset that accompanies this
assignment. The dataset is contained in a zip file specdata.zip that you
can download from the Coursera web site.
Although this is a programming assignment, you will be assessed using
a separate quiz. Data
The zip file containing the data can be downloaded here: Coursera
The zip file contains 332 comma-separated-value (CSV) files
containing pollution monitoring data for fine particulate matter (PM)
air pollution at 332 locations in the United States. Each file contains
data from a single monitor and the ID number for each monitor is
contained in the file name. For example, data for monitor 200 is
contained in the file “200.csv”. Each file contains variables:
Date: the date of the observation in MM-DD-YYYY format (month-day-year)
sulfate: the level of sulfate PM in the air on that date (measured in micrograms per cubic meter)
nitrate: the level of nitrate PM in the air on that date (measured in micrograms per cubic meter)

For this programming assignment you will need to unzip this file and
create the directory ‘specdata’. Once you have unzipped the zip file, do
not make any modifications to the files in the ‘specdata’ directory. In
each file you’ll notice that there are many days where either sulfate or
nitrate (or both) are missing (coded as NA). This is common with air
pollution monitoring data in the United States. Part 1
Write a function named ‘pollutantmean’ that calculates the mean of a
pollutant (sulfate or nitrate) across a specified list of monitors. The
function ‘pollutantmean’ takes three arguments: ‘directory’,
‘pollutant’, and ‘id’. Given a vector monitor ID numbers,
‘pollutantmean’ reads that monitors’ particulate matter data from the
directory specified in the ‘directory’ argument and returns the mean of
the pollutant across all of the monitors, ignoring any missing values
coded as NA. A prototype of the function is as follows:

You can see some example output from this function below. The
function that you write should be able to match this output.
Helper function for pollutant_mean()
-Combine files with given id numbers in specdata folder
# 'ID' is the fourth column in each file
combine_csv <- function(directory = 'specdata', id = 1:332){
df <- list.files(path = directory, full.names = TRUE) %>%
lapply(read.csv) %>%
bind_rows()
if(is.null(id)){
df
}else{
df %>%
filter(ID %in% id)
}
}
pollutant_mean <- function(directory = 'specdata', pollutant = 'sulfate', id = 1:332){
df <- combine_csv(directory = directory, id = id) #combine files in directory
pollutantVec <- df[, pollutant]; #mean takes a vector
mean(pollutantVec, na.rm = TRUE) # Find average
}
pollutant_mean("specdata", "sulfate")
[1] 3.189369
pollutant_mean(id = 1:10)
[1] 4.064128
pollutant_mean("specdata", "nitrate", 70:72)
[1] 1.706047
pollutant_mean('specdata', 'nitrate', 23)
[1] 1.280833
QUESTION 2
Part 2
Write a function that reads a directory full of files and reports the
number of completely observed cases in each data file. The function
should return a data frame where the first column is the name of the
file and the second column is the number of complete cases. A prototype
of this function follows:

You can see some example output from this function below. The
function that you write should be able to match this output.
complete <- function(directory = 'specdata', id = NULL){
df <- combine_csv(directory, id)
df %>%
drop_na() %>%
group_by(id = ID) %>%
summarize(nobs = n())
# summarize(num_complete_cases = n())
}
complete(id = 1)
complete(id = c(2, 4, 6, 8, 10, 12))
complete(id = 30:25)
complete(id = 3)
cc <- complete("specdata", c(6, 10, 20, 34, 100, 200, 310))
print(cc$nobs)
[1] 228 148 124 165 104 460 232
cc <- complete("specdata", 54)
print(cc$nobs)
[1] 219
QUESTION 3
Write a function that takes a directory of data files and a threshold
for complete cases and calculates the correlation between sulfate and
nitrate for monitor locations where the number of completely observed
cases (on all variables) is greater than the threshold. The function
should return a vector of correlations for the monitors that meet the
threshold requirement. If no monitors meet the threshold requirement,
then the function should return a numeric vector of length 0. A
prototype of this function follows:
knitr::include_graphics("corr_prototype.jpg")

For this function you will need to use the ‘cor’ function in R which
calculates the correlation between two vectors. Please read the help
page for this function via ‘?cor’ and make sure that you know how to use
it.
You can see some example output from this function below. The
function that you write should be able to approximately match this
output. Note that because of how R rounds and presents floating point
numbers, the output you generate may differ slightly from the example
output. Please save your code to a file named corr.R. To run the submit
script for this part, make sure your working directory has the file
corr.R in it.
corr <- function(directory = 'specdata', threshold = 0){
df <- combine_csv(directory = directory, id = NULL)
df2 <- df %>%
drop_na() %>%
group_by(id = ID)%>%
summarize(n = n(), correl = cor(sulfate, nitrate)) %>%
subset(subset = n > threshold, select = correl)
df2[[1]]
}
corr() %>%
head()
[1] -0.22255256 -0.01895754 -0.14051254 -0.04389737 -0.06815956 -0.12350667
cr <- corr('specdata', 150)
head(cr)
[1] -0.01895754 -0.14051254 -0.04389737 -0.06815956 -0.12350667 -0.07588814
summary(cr)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.21057 -0.04999 0.09463 0.12525 0.26844 0.76313
cr <- corr('specdata', 5000)
cr
numeric(0)
length(cr)
[1] 0
t <- corr("specdata")
length(t)
[1] 323
summary(t)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.00000 -0.05282 0.10718 0.13684 0.27831 1.00000
---
title: "Week 2: Programming Assignment"
output: html_notebook
---

### **Writing Functions, Combining Files, Summarizing Data**


##### **QUESTION 1**


Introduction

For this first programming assignment you will write three functions that are meant to interact with dataset that accompanies this assignment. The dataset is contained in a zip file specdata.zip that you can download from the Coursera web site. 

Although this is a programming assignment, you will be assessed using a separate quiz.
Data

The zip file containing the data can be downloaded here: Coursera

The zip file contains 332 comma-separated-value (CSV) files containing pollution monitoring data for fine particulate matter (PM) air pollution at 332 locations in the United States. Each file contains data from a single monitor and the ID number for each monitor is contained in the file name. For example, data for monitor 200 is contained in the file "200.csv". Each file contains variables:

    Date: the date of the observation in MM-DD-YYYY format (month-day-year)

    sulfate: the level of sulfate PM in the air on that date (measured in micrograms per cubic meter)

    nitrate: the level of nitrate PM in the air on that date (measured in micrograms per cubic meter)
    
```{r file, echo=FALSE, warning=FALSE}
knitr::include_graphics("file_prototype.jpg")
```



For this programming assignment you will need to unzip this file and create the directory 'specdata'. Once you have unzipped the zip file, do not make any modifications to the files in the 'specdata' directory. In each file you'll notice that there are many days where either sulfate or nitrate (or both) are missing (coded as NA). This is common with air pollution monitoring data in the United States.
Part 1

Write a function named 'pollutantmean' that calculates the mean of a pollutant (sulfate or nitrate) across a specified list of monitors. The function 'pollutantmean' takes three arguments: 'directory', 'pollutant', and 'id'. Given a vector monitor ID numbers, 'pollutantmean' reads that monitors' particulate matter data from the directory specified in the 'directory' argument and returns the mean of the pollutant across all of the monitors, ignoring any missing values coded as NA. A prototype of the function is as follows:
 
```{r pollutant_mean, echo=FALSE, warning=FALSE}
knitr::include_graphics("pollutant_mean_prototype.jpg")

```


You can see some example output from this function below. The function that you write should be able to match this output. 


##### **Helper function for pollutant_mean()**
-Combine files with given id numbers in specdata folder
```{r}
# 'ID' is the fourth column in each file
combine_csv <- function(directory = 'specdata', id = 1:332){
  
  df <- list.files(path = directory, full.names = TRUE) %>%
    lapply(read.csv) %>%
    bind_rows()
  
  if(is.null(id)){
    df
  }else{
    df %>%
    filter(ID %in% id)
  }
  
}

```


```{r}
pollutant_mean <- function(directory = 'specdata', pollutant = 'sulfate', id = 1:332){

  df <- combine_csv(directory = directory, id = id) #combine files in directory

  pollutantVec <- df[, pollutant]; #mean takes a vector
  mean(pollutantVec, na.rm = TRUE) # Find average
  }

```

```{r}
pollutant_mean("specdata", "sulfate")
```

```{r}
pollutant_mean(id = 1:10)
```

```{r}
pollutant_mean("specdata", "nitrate", 70:72)
```

```{r}
pollutant_mean('specdata', 'nitrate', 23)
```



##### **QUESTION 2**

Part 2

Write a function that reads a directory full of files and reports the number of completely observed cases in each data file. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases. 
 A prototype of this function follows:
 
```{r complete, echo=FALSE, warning=FALSE}
knitr::include_graphics("complete_prototype.jpg")
```
 

You can see some example output from this function below. The function that you write should be able to match this output.

#############################################################################


```{r}
complete <- function(directory = 'specdata', id = NULL){
  
  df <- combine_csv(directory, id)
  
  df %>%
    drop_na() %>%
    group_by(id = ID) %>%
    summarize(nobs = n())
    # summarize(num_complete_cases = n())
}

```

```{r}
complete(id = 1)
```

```{r}
complete(id = c(2, 4, 6, 8, 10, 12))
```

```{r}
complete(id = 30:25)
```

```{r}
complete(id = 3)
```

```{r}
cc <- complete("specdata", c(6, 10, 20, 34, 100, 200, 310))
print(cc$nobs)

```

```{r}
cc <- complete("specdata", 54)
print(cc$nobs)

```
 
##### **QUESTION 3**

Write a function that takes a directory of data files and a threshold for complete cases and calculates the correlation between sulfate and nitrate for monitor locations where the number of completely observed cases (on all variables) is greater than the threshold. The function should return a vector of correlations for the monitors that meet the threshold requirement. If no monitors meet the threshold requirement, then the function should return a numeric vector of length 0. A prototype of this function follows:

```{r corr, echo=TRUE, warning=FALSE}
knitr::include_graphics("corr_prototype.jpg")
```


For this function you will need to use the 'cor' function in R which calculates the correlation between two vectors. Please read the help page for this function via '?cor' and make sure that you know how to use it.

You can see some example output from this function below. The function that you write should be able to approximately match this output. Note that because of how R rounds and presents floating point numbers, the output you generate may differ slightly from the example output. Please save your code to a file named corr.R. To run the submit script for this part, make sure your working directory has the file corr.R in it.


```{r}
corr <- function(directory = 'specdata', threshold = 0){
  
  df <- combine_csv(directory = directory, id = NULL)
  
  df2 <- df %>%
    drop_na() %>%
    group_by(id = ID)%>%
    summarize(n = n(), correl = cor(sulfate, nitrate)) %>%
    subset(subset = n > threshold, select = correl) 
  
  df2[[1]]

  
}

```

```{r}
corr() %>% 
  head()
```

```{r}
cr <- corr('specdata', 150)
head(cr)
```
```{r}
summary(cr)
```

```{r}
cr <- corr('specdata', 5000)
cr
```
```{r}
length(cr)
```


```{r}
t <- corr("specdata")
length(t)

```
```{r}
summary(t)
```

