DATA SCIENCE - R programming, Air pollution

Data Science Specialization - course 2 - 1

OBJETIVE
Calculate the average sulfate or nitrate particle contamination, with data obtained at 332 locations in the United States, each site has generated a monitoring file.

DESCRIPTION
Each file contains several records, each record contains 4 variables:
• Date: date of observation in YYYY-MM-DD format (year-month-day)
• sulphate: the level of PM sulphate in the air at that date (measured in micrograms per cubic metre)
• nitrate: the level of PM nitrate in the air at that date (measured in micrograms per cubic meter)
• Id: file cabinet.

Note, that many registers contain variable with the empty value (NA), which is common in monitoring.

TASKS

Homework 1

Develop a function to calculate the average pollutant by sulphate or nitrate, passing three parameters:
Parameter 1 = the directory where the files are located.
Parameter 2 = the pollutant to be calculated; sulphate or nitrate.
Parameter3 = the list of files to be processed.

Download the data:

# we create a work folder
# mkdir workTmp
# cd workTmp
# folder where the files with the data will be, WE DO NOT CREATE IT 
# THE SAME SYSTEM WILL CREATE IT THE FIRST TIME, ONLY ONCE

folderData <- "data"

# we downloaded the file only once
if( !file.exists(folderData) ){  
url = "https://d396qusza40orc.cloudfront.net/rprog%2Fdata%2Fspecdata.zip"
download.file(url, "urlFileData.zip") 
unzip("urlFileData.zip", exdir = "data")
}

# after downloading the data - files.
# we inspect a file, visualizing the content.

file.temp = read.csv(file="data/specdata/001.csv", header = TRUE, sep = ",")
head(file.temp)

##         Date sulfate nitrate ID
## 1 2003-01-01      NA      NA  1
## 2 2003-01-02      NA      NA  1
## 3 2003-01-03      NA      NA  1
## 4 2003-01-04      NA      NA  1
## 5 2003-01-05      NA      NA  1
## 6 2003-01-06      NA      NA  1

We created the function to calculate the average pollutant by sulfate or nitrate.

# function to be created
pollutantmean <- function(path.files, pollutant, id = 0) {
  
  # We store all the names of the csv files
  list.filenames <- list.files(path.files, pattern="*.csv")
  
  # We create a numerical vector
  list.values <- numeric()
  
  # we create a validation if the value 0 is entered
  if(length(id) == 1){
    if(id == 0){
      id = 332
    }
  }
  
  # we created a loop to go through the values of 1:X
  for(i in c(id) ) {
    
    # We process the files
    # concatenate path.files/list.filenames[i]
    filename <- paste(path.files, list.filenames[i], sep = "/")
    
    # We load the file
    df.File <- read.csv(filename, head=TRUE, sep=",")
    
    # We get the values of sulfate or nitrate
    if(pollutant == "sulfate"){
      valores <- df.File$sulfate[!is.na(df.File$sulfate)]
    }
    else{
      valores <- df.File$nitrate[!is.na(df.File$nitrate)]
    }
    
    # The values of each register accumulate
    list.values <- append(list.values, valores)
  }
  
  # The average is calculated
  return(mean(list.values))
}

# Example, calculate the average sulfate contamination for the archives
# from 001.csv to 010.csv
pollutantmean ("data/specdata", "sulfate", 1:10)

## [1] 4.064128

Conclusion
For a group of files from 1 to 10 the mean sulphate contamination is calculated to be equal to 4.064128.

Homework 2

Write a function that reads a directory full of files and reports the number of completely observed cases in each data file for sulfates and nitrates.

Make a sum for each day observed of the variables.

complete  <- function(path.files, id) {
  
  #We store all file names
  list.filenames <- list.files(path.files, pattern="*.csv")
  
  id.file <- integer()
  num.filas1 <- integer()
  num.filas2 <- integer()
  
  for(i in c(id) ) {
    
    #file name processing
    filename <- paste(path.files, list.filenames[i], sep = "/")
    
    #load the file contents
    df.File <- read.csv(filename, head=TRUE, sep=",")
    
    filas1 <- length(df.File$sulfate[!is.na(df.File$sulfate)])
    filas2 <- length(df.File$nitrate[!is.na(df.File$nitrate)])
    
    id.file <- append(id.file, i)
    num.filas1 <- append(num.filas1, filas1)
    num.filas2 <- append(num.filas2, filas2)
    
  }
  
  rspta <- data.frame(id.file , num.filas1, num.filas2)
  
  #rspta.names <- c("id", "nobs")
  colnames(rspta) <- c("id.file", "nobs.sulfate", "nobs.nitrate")
  
  rspta
  
}

# Example
tmp <- complete ("data/specdata", 1:10)
tmp

##    id.file nobs.sulfate nobs.nitrate
## 1        1          117          122
## 2        2         1041         1051
## 3        3          243          249
## 4        4          474          479
## 5        5          402          405
## 6        6          228          229
## 7        7          442          453
## 8        8          192          194
## 9        9          275          279
## 10      10          148          183

Conclusion
We obtain the accumulated values of the files analyzed by each record.

Homework 3

Write a function that takes a directory of data files and a threshold for full cases and calculates the correlation between sulfate and nitrate for monitor locations where the number of cases completely.

The correlation coefficient provides a measure of how two random variables are associated in a “sample”.

tmp.cor <- cor(tmp)
tmp.cor

##                 id.file nobs.sulfate nobs.nitrate
## id.file       1.0000000   -0.3812529   -0.3665732
## nobs.sulfate -0.3812529    1.0000000    0.9993217
## nobs.nitrate -0.3665732    0.9993217    1.0000000

plot(tmp$nobs.sulfate, tmp$nobs.nitrate)

Conclusion
The correlation between the two variables is strong, if one variable grows the other also grows.

source code: https://github.com/magzupao/DataSciences-Course2-programming-assignment-1-instructions-air-pollution

DATA SCIENCE - R programming, Air pollution

Marco Guado

October 2016

Data Science Specialization - course 2 - 1