OBJETIVE
Calculate the average sulfate or nitrate particle contamination, with data obtained at 332 locations in the United States, each site has generated a monitoring file.
DESCRIPTION
Each file contains several records, each record contains 4 variables:
• Date: date of observation in YYYY-MM-DD format (year-month-day)
• sulphate: the level of PM sulphate in the air at that date (measured in micrograms per cubic metre)
• nitrate: the level of PM nitrate in the air at that date (measured in micrograms per cubic meter)
• Id: file cabinet.
Note, that many registers contain variable with the empty value (NA), which is common in monitoring.
TASKS
Homework 1
Develop a function to calculate the average pollutant by sulphate or nitrate, passing three parameters:
Parameter 1 = the directory where the files are located.
Parameter 2 = the pollutant to be calculated; sulphate or nitrate.
Parameter3 = the list of files to be processed.
Download the data:
# we create a work folder
# mkdir workTmp
# cd workTmp
# folder where the files with the data will be, WE DO NOT CREATE IT
# THE SAME SYSTEM WILL CREATE IT THE FIRST TIME, ONLY ONCE
folderData <- "data"
# we downloaded the file only once
if( !file.exists(folderData) ){
url = "https://d396qusza40orc.cloudfront.net/rprog%2Fdata%2Fspecdata.zip"
download.file(url, "urlFileData.zip")
unzip("urlFileData.zip", exdir = "data")
}
# after downloading the data - files.
# we inspect a file, visualizing the content.
file.temp = read.csv(file="data/specdata/001.csv", header = TRUE, sep = ",")
head(file.temp)
## Date sulfate nitrate ID
## 1 2003-01-01 NA NA 1
## 2 2003-01-02 NA NA 1
## 3 2003-01-03 NA NA 1
## 4 2003-01-04 NA NA 1
## 5 2003-01-05 NA NA 1
## 6 2003-01-06 NA NA 1
We created the function to calculate the average pollutant by sulfate or nitrate.
# function to be created
pollutantmean <- function(path.files, pollutant, id = 0) {
# We store all the names of the csv files
list.filenames <- list.files(path.files, pattern="*.csv")
# We create a numerical vector
list.values <- numeric()
# we create a validation if the value 0 is entered
if(length(id) == 1){
if(id == 0){
id = 332
}
}
# we created a loop to go through the values of 1:X
for(i in c(id) ) {
# We process the files
# concatenate path.files/list.filenames[i]
filename <- paste(path.files, list.filenames[i], sep = "/")
# We load the file
df.File <- read.csv(filename, head=TRUE, sep=",")
# We get the values of sulfate or nitrate
if(pollutant == "sulfate"){
valores <- df.File$sulfate[!is.na(df.File$sulfate)]
}
else{
valores <- df.File$nitrate[!is.na(df.File$nitrate)]
}
# The values of each register accumulate
list.values <- append(list.values, valores)
}
# The average is calculated
return(mean(list.values))
}
# Example, calculate the average sulfate contamination for the archives
# from 001.csv to 010.csv
pollutantmean ("data/specdata", "sulfate", 1:10)
## [1] 4.064128
Conclusion
For a group of files from 1 to 10 the mean sulphate contamination is calculated to be equal to 4.064128.
Homework 2
Write a function that reads a directory full of files and reports the number of completely observed cases in each data file for sulfates and nitrates.
Make a sum for each day observed of the variables.
complete <- function(path.files, id) {
#We store all file names
list.filenames <- list.files(path.files, pattern="*.csv")
id.file <- integer()
num.filas1 <- integer()
num.filas2 <- integer()
for(i in c(id) ) {
#file name processing
filename <- paste(path.files, list.filenames[i], sep = "/")
#load the file contents
df.File <- read.csv(filename, head=TRUE, sep=",")
filas1 <- length(df.File$sulfate[!is.na(df.File$sulfate)])
filas2 <- length(df.File$nitrate[!is.na(df.File$nitrate)])
id.file <- append(id.file, i)
num.filas1 <- append(num.filas1, filas1)
num.filas2 <- append(num.filas2, filas2)
}
rspta <- data.frame(id.file , num.filas1, num.filas2)
#rspta.names <- c("id", "nobs")
colnames(rspta) <- c("id.file", "nobs.sulfate", "nobs.nitrate")
rspta
}
# Example
tmp <- complete ("data/specdata", 1:10)
tmp
## id.file nobs.sulfate nobs.nitrate
## 1 1 117 122
## 2 2 1041 1051
## 3 3 243 249
## 4 4 474 479
## 5 5 402 405
## 6 6 228 229
## 7 7 442 453
## 8 8 192 194
## 9 9 275 279
## 10 10 148 183
Conclusion
We obtain the accumulated values of the files analyzed by each record.
Homework 3
Write a function that takes a directory of data files and a threshold for full cases and calculates the correlation between sulfate and nitrate for monitor locations where the number of cases completely.
The correlation coefficient provides a measure of how two random variables are associated in a “sample”.
tmp.cor <- cor(tmp)
tmp.cor
## id.file nobs.sulfate nobs.nitrate
## id.file 1.0000000 -0.3812529 -0.3665732
## nobs.sulfate -0.3812529 1.0000000 0.9993217
## nobs.nitrate -0.3665732 0.9993217 1.0000000
plot(tmp$nobs.sulfate, tmp$nobs.nitrate)
Conclusion
The correlation between the two variables is strong, if one variable grows the other also grows.
source code: https://github.com/magzupao/DataSciences-Course2-programming-assignment-1-instructions-air-pollution