Introduction

Data analysis of two major pollutants i.e Sulphates and Nitrates in US by exploring the US Air Pollution dataset for fine particulate matter (PM) at 332 locations.

DataSet

The dataset used for this analysis can be downloaded here: Pollution Dataset

The zip file contains 332 comma-separated-value (CSV) files containing pollution monitoring data for fine particulate matter (PM) air pollution at 332 locations in the United States. Each file contains data from a single monitor and the ID number for each monitor is contained in the file name. For example, data for monitor 200 is contained in the file “200.csv”. Each file contains three variables:

library(ggplot2)
library(data.table)
knitr::opts_chunk$set(echo = TRUE)
file<-"https://d396qusza40orc.cloudfront.net/rprog%2Fdata%2Fspecdata.zip" #assigns the URL to the object
filename<-"Pollution DataSet.zip"                                         #assigns the filename to the object
if(!file.exists(filename)){                                               #checks whether the file exists 
  download.file(file,filename,method="curl")                              #if not it downloads through the URL 
  unzip(filename)                                                         #unzips the zip file
  }                               

In each file you’ll notice that there are many days where either sulfate or nitrate (or both) are missing (coded as NA). This is common with air pollution monitoring data in the United States. The following graph shows the data about the day-by-day changes in level of sulfate and nitrate PM in one location of US.

setwd("./rprog_data_specdata/specdata")   #sets the working directory
a<-read.csv("001.csv")                    #reads the csv file which depicts data of a particular location
a<-a[complete.cases(a),]                  #removes the all the NA value 
ggplot(a,aes(Date,sulfate,col=nitrate))+  #creates the ggplot
  geom_point(size=3)+                     #adds all the points of particular size
  theme(axis.text.x=element_text(angle=90),axis.text = element_text(size=12), axis.title.y.left = element_text(size = 18), axis.title.x.bottom = element_text(size=18)) #roates the x-elements by 90 degrees and increases the size of both axes' elements

Complete or Real cases

This function reads a directory full of files and reports the number of completely observed cases in each data file. The function returns a data frame where the first column is the name of the file and the second column is the number of complete cases.

complete<-function(directory,id=1:332){
  nobs<-numeric()                                      #creates a exmpty numeric vector
  my_files<-list.files(path=directory,pattern=".csv")  #lists all the files
  for(i in id){
    data<-read.csv(my_files[i])                        #reads every single files in the list
    s<-sum(complete.cases(data))                       #sums all the real value cases
    nobs<-c(nobs,s)                                    #adds the value to existing vector
  }
  data.frame(id,nobs)                                  #creates the required dataframe
}

Pollutant Mean

The function ‘pollutantmean’ takes three arguments: ‘directory’, ‘pollutant’, and ‘id’. Given a vector monitor ID numbers, ‘pollutantmean’ reads that monitors’ particulate matter data from the directory specified in the ‘directory’ argument and returns the mean of the pollutant across all of the monitors, ignoring any missing values coded as NA.

pollutantmean<-function(directory,pollutant,id=1:332){
  my_list<-list.files(path=directory,pattern = ".csv")
  x<-numeric()  #creates an empty vector
  for(i in id){
    data<-read.csv(my_list[i])  # reads every single csv file and assigns to the data object
    x<-c(x,data[[pollutant]])   #adds the element to the existing vector
  }
  mean(x,na.rm=TRUE)  #computes the value of mean after removing NA values
}

Correlation

This function takes a directory of data files and a threshold for complete cases and calculates the correlation between sulfate and nitrate for monitor locations where the number of completely observed cases (on all variables) is greater than the threshold. The function returns a vector of correlations for the monitors that meet the threshold requirement. If no monitors meet the threshold requirement, then the function will return a numeric vector of length 0

corr<-function(directory,threshold=0){
  my_files<-list.files(path=directory,pattern = ".csv")       #lists all the files
  a<-complete(directory)                                      #assigns a new directory to the object
  ids<- a[a["nobs"]>threshold, ]$id                           #creates a vector of IDs which has surpassed a particular threshold 
  corr<-numeric()                                             #creates a empty numeric vector
  for(i in ids){
    data<-read.csv(my_files[i])
    ac<-data[complete.cases(data), ]                          #segregates all the values which arent NA
    corr<-c(corr,cor(ac$sulfate,ac$nitrate))                  #adds the correlation to the existing vector
  }
  return(corr)
}