Data analysis of two major pollutants i.e Sulphates and Nitrates in US by exploring the US Air Pollution dataset for fine particulate matter (PM) at 332 locations.
The dataset used for this analysis can be downloaded here: Pollution Dataset
The zip file contains 332 comma-separated-value (CSV) files containing pollution monitoring data for fine particulate matter (PM) air pollution at 332 locations in the United States. Each file contains data from a single monitor and the ID number for each monitor is contained in the file name. For example, data for monitor 200 is contained in the file “200.csv”. Each file contains three variables:
library(ggplot2)
library(data.table)
knitr::opts_chunk$set(echo = TRUE)
file<-"https://d396qusza40orc.cloudfront.net/rprog%2Fdata%2Fspecdata.zip" #assigns the URL to the object
filename<-"Pollution DataSet.zip" #assigns the filename to the object
if(!file.exists(filename)){ #checks whether the file exists
download.file(file,filename,method="curl") #if not it downloads through the URL
unzip(filename) #unzips the zip file
}
In each file you’ll notice that there are many days where either sulfate or nitrate (or both) are missing (coded as NA). This is common with air pollution monitoring data in the United States. The following graph shows the data about the day-by-day changes in level of sulfate and nitrate PM in one location of US.
setwd("./rprog_data_specdata/specdata") #sets the working directory
a<-read.csv("001.csv") #reads the csv file which depicts data of a particular location
a<-a[complete.cases(a),] #removes the all the NA value
ggplot(a,aes(Date,sulfate,col=nitrate))+ #creates the ggplot
geom_point(size=3)+ #adds all the points of particular size
theme(axis.text.x=element_text(angle=90),axis.text = element_text(size=12), axis.title.y.left = element_text(size = 18), axis.title.x.bottom = element_text(size=18)) #roates the x-elements by 90 degrees and increases the size of both axes' elements
This function reads a directory full of files and reports the number of completely observed cases in each data file. The function returns a data frame where the first column is the name of the file and the second column is the number of complete cases.
complete<-function(directory,id=1:332){
nobs<-numeric() #creates a exmpty numeric vector
my_files<-list.files(path=directory,pattern=".csv") #lists all the files
for(i in id){
data<-read.csv(my_files[i]) #reads every single files in the list
s<-sum(complete.cases(data)) #sums all the real value cases
nobs<-c(nobs,s) #adds the value to existing vector
}
data.frame(id,nobs) #creates the required dataframe
}
The function ‘pollutantmean’ takes three arguments: ‘directory’, ‘pollutant’, and ‘id’. Given a vector monitor ID numbers, ‘pollutantmean’ reads that monitors’ particulate matter data from the directory specified in the ‘directory’ argument and returns the mean of the pollutant across all of the monitors, ignoring any missing values coded as NA.
pollutantmean<-function(directory,pollutant,id=1:332){
my_list<-list.files(path=directory,pattern = ".csv")
x<-numeric() #creates an empty vector
for(i in id){
data<-read.csv(my_list[i]) # reads every single csv file and assigns to the data object
x<-c(x,data[[pollutant]]) #adds the element to the existing vector
}
mean(x,na.rm=TRUE) #computes the value of mean after removing NA values
}
This function takes a directory of data files and a threshold for complete cases and calculates the correlation between sulfate and nitrate for monitor locations where the number of completely observed cases (on all variables) is greater than the threshold. The function returns a vector of correlations for the monitors that meet the threshold requirement. If no monitors meet the threshold requirement, then the function will return a numeric vector of length 0
corr<-function(directory,threshold=0){
my_files<-list.files(path=directory,pattern = ".csv") #lists all the files
a<-complete(directory) #assigns a new directory to the object
ids<- a[a["nobs"]>threshold, ]$id #creates a vector of IDs which has surpassed a particular threshold
corr<-numeric() #creates a empty numeric vector
for(i in ids){
data<-read.csv(my_files[i])
ac<-data[complete.cases(data), ] #segregates all the values which arent NA
corr<-c(corr,cor(ac$sulfate,ac$nitrate)) #adds the correlation to the existing vector
}
return(corr)
}