Air Pollution Project

Introduction

For this first programming assignment you will write three functions that are meant to interact with dataset that accompanies this assignment. The dataset is contained in a zip file specdata.zip that you can download from the Coursera web site.

Data

Downloading data

The zip file containing the data can be downloaded here: specdata.zip

Set up directory for downloading data

downloadURL <- "https://d396qusza40orc.cloudfront.net/rprog%2Fdata%2Fspecdata.zip"
downloadedFile <- "./specdata.zip"

Download and unzip data

if(!file.exists(downloadedFile)) {
    download.file(downloadURL, downloadedFile, method = "curl")
    unzip(downloadedFile)
}

Data Summary

The zip file contains 332 comma-separated-value (CSV) files containing pollution monitoring data for fine particulate matter (PM) air pollution at 332 locations in the United States. Each file contains data from a single monitor and the ID number for each monitor is contained in the file name. For example, data for monitor 200 is contained in the file “200.csv”. Each file contains three variables:

Date: the date of the observation in YYYY-MM-DD format (year-month-day)
sulfate: the level of sulfate PM in the air on that date (measured in micrograms per cubic meter)
nitrate: the level of nitrate PM in the air on that date (measured in micrograms per cubic meter)

Part 1

Write a function named ‘pollutantmean’ that calculates the mean of a pollutant (sulfate or nitrate) across a specified list of monitors. The function ‘pollutantmean’ takes three arguments: ‘directory’, ‘pollutant’, and ‘id’. Given a vector monitor ID numbers, ‘pollutantmean’ reads that monitors’ particulate matter data from the directory specified in the ‘directory’ argument and returns the mean of the pollutant across all of the monitors, ignoring any missing values coded as NA.

Function summary

pollutantmean(directory, pollutant, id = 1:332)

Input:

directory is a character vector of length 1 indicating the location of the CSV files
pollutant is a character vector of length 1 indicating the name of the pollutant for which we will calculate the mean; either “sulfate” or “nitrate”
id is an integer vector indicating the monitor ID numbers to be used

Output:

Return the mean of the pullutant across all monitors list in the id vector (ignoring NA values)

Function definitions

Util functions

construct.path(id, directory): construct the full path from the main directory to a specific given csv file based on the id of the file. Example: if the directory and id arguments are data and 10, respectively, the result is data/010.csv.

construct.path <- function(id, directory) {
    return(paste(directory, 
                 "/", 
                 formatC(id, width = 3, flag = "0"), 
                 ".csv", 
                 sep = ""))     
}

read.monitor(path): read the data from a particular monitor stored in a file.

read.monitor <- function(id, directory) {
    path <- construct.path(id, directory)
    return(read.csv(path))
}

Main function

library(plyr)
pollutantmean <- function(directory, pollutant, id = 1:332) {
    ## read all the data frames
    dfs <- lapply(id, read.monitor, directory)  
    
    ## concatenate data frames in dfs
    df <- ldply(dfs, rbind)
    
    return(mean(df[, pollutant], na.rm = TRUE))
}

Some illustrations

pollutantmean("specdata", "sulfate", 1:10)

## [1] 4.064128

pollutantmean("specdata", "nitrate", 70:72)

## [1] 1.706047

pollutantmean("specdata", "nitrate", 23)

## [1] 1.280833

Part 2

Write a function that reads a directory full of files and reports the number of completely observed cases in each data file. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases.

Function summary

complete(directory, pollutant, id = 1:332)

Input

directory is a character vector of length 1 indicating the location of the CSV files
id is an integer vector indicating the monitor ID numbers to be used

Output

Return a data frame of the form:

##    id      nobs
##    1       117
##    2       1041
##    ...

where id if the monitor ID number and nobs is the number of complete cases.

Function definition

complete <- function(directory, id = 1:332) {
    ## read all the data frames
    dfs <- lapply(id, read.monitor, directory)  
    
    ## check whether an observation in dfs is complete
    cpte.cases.rough <- lapply(dfs, complete.cases)
    ## sum of completely observed cases
    cpte.cases.rough <- sapply(cpte.cases.rough, sum)
    
    ## return the result as a proper data frame
    return(data.frame(
        id = id,
        nobs = cpte.cases.rough
    ))
}

Some illutrations

complete("specdata", 1)

##   id nobs
## 1  1  117

complete("specdata", c(2, 4, 8, 10, 12))

##   id nobs
## 1  2 1041
## 2  4  474
## 3  8  192
## 4 10  148
## 5 12   96

complete("specdata", 30:25)

##   id nobs
## 1 30  932
## 2 29  711
## 3 28  475
## 4 27  338
## 5 26  586
## 6 25  463

complete("specdata", 3)

##   id nobs
## 1  3  243

Part 3

Write a function that takes a directory of data files and a threshold for complete cases and calculates the correlation between sulfate and nitrate for monitor locations where the number of completely observed cases (on all variables) is greater than the threshold. The function should return a vector of correlations for the monitors that meet the threshold requirement. If no monitors meet the threshold requirement, then the function should return a numeric vector of length 0.

Function summary

corr(directory, threshold = 0)

Input

directory is a character vector of length 1 indicating the location of the CSV files
threshold is a numeric vector of length 1 indicating the number of completely observed observation (on all variables) required to compute the correlation between nitrate and sulfate; the default is 0

Output

Return a numeric vector of correlations

Function definitions

Util function

get.cor(df): get the correlation between sulfate and nitrate in df.

get.cor <- function(df) cor(df$sulfate, df$nitrate, use = "complete.obs")

Main function

corr <- function(directory, threshold = 0) {
    ## find the number of completely observed cases on each monitor
    cpte.cases <- complete(directory)
    
    ## filter the monitors that meet the threshold
    proper.cases <- cpte.cases[cpte.cases$nobs > threshold, ]
    
    ## read data of the monitors that meets the threshold
    dfs <- lapply(proper.cases$id, read.monitor, directory)  
    
    return(sapply(dfs, get.cor))
}

Some illustrations

cr <- corr("specdata", 150)
head(cr)

## [1] -0.01895754 -0.14051254 -0.04389737 -0.06815956 -0.12350667 -0.07588814

summary(cr)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.21057 -0.04999  0.09463  0.12525  0.26844  0.76313

cr <- corr("specdata", 400)
head(cr)

## [1] -0.01895754 -0.04389737 -0.06815956 -0.07588814  0.76312884 -0.15782860

summary(cr)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.17623 -0.03109  0.10021  0.13969  0.26849  0.76313

cr <- corr("specdata", 5000)
summary(cr)

## Length  Class   Mode 
##      0   list   list

length(cr)

## [1] 0

cr <- corr("specdata")
summary(cr)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.00000 -0.05282  0.10718  0.13684  0.27831  1.00000

length(cr)

## [1] 323

Air Pollution Project

nthehai01

2/7/2022

Introduction

Data

Downloading data

Data Summary

Part 1

Function summary

Function definitions

Part 2

Function summary

Function definition

Part 3

Function summary

Function definitions