This is my answer for Assignment in R programing course by John Hopkins University.

Part 1

Write a function named ‘pollutantmean’ that calculates the mean of a pollutant (sulfate or nitrate) across a specified list of monitors. The function ‘pollutantmean’ takes three arguments: ‘directory’, ‘pollutant’, and ‘id’. Given a vector monitor ID numbers, ‘pollutantmean’ reads that monitors’ particulate matter data from the directory specified in the ‘directory’ argument and returns the mean of the pollutant across all of the monitors, ignoring any missing values coded as NA.

pollutantmean<-function(directory,pollutant,id=1:332){
  files_list<-list.files(directory,full.names = TRUE)
    dat<-data.frame()
  for (i in id){
    dat<-rbind(dat,read.csv(files_list[i]))
  }
  mean(dat[,pollutant],na.rm=TRUE)
}

Part 2

Write a function that reads a directory full of files and reports the number of completely observed cases in each data file. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases.

complete<-function(directory,id=1:332){
  files_list<-list.files(directory,full.names = TRUE)
  dat<-data.frame()
  result<-data.frame(id=numeric(0),nobs=numeric(0))
  for (i in id){
    dat<-read.csv(files_list[i])
    dat<-dat[complete.cases(dat),]
    nobs<-nrow(dat)
    result<-rbind(result,data.frame(id=i,nobs=nobs))
  }
  result
  
}

Part 3

Write a function that takes a directory of data files and a threshold for complete cases and calculates the correlation between sulfate and nitrate for monitor locations where the number of completely observed cases (on all variables) is greater than the threshold. The function should return a vector of correlations for the monitors that meet the threshold requirement. If no monitors meet the threshold requirement, then the function should return a numeric vector of length 0.

corr <- function(directory, threshold = 0) {
  require(data.table)
  # Reading in all files and making a large data.table
  lst <- lapply(file.path(directory, list.files(path = directory, pattern="*.csv")), data.table::fread)
  dt <- rbindlist(lst)
  
  # Only keep completely observed cases
  dt <- dt[complete.cases(dt),]
  
  # Apply threshold
  dt <- dt[, .(nobs = .N, corr = cor(x = sulfate, y = nitrate)), by = ID][nobs > threshold]
  return(dt[, corr])
}

Test

pollutantmean("specdata","sulfate",1:10)
## [1] 4.064128
cc <- complete("specdata", c(6, 10, 20, 34, 100, 200, 310))
print(cc$nobs)
## [1] 228 148 124 165 104 460 232
cr <- corr("specdata", 129)
## Loading required package: data.table
## Warning: package 'data.table' was built under R version 4.3.3
head(cr)
## [1] -0.01895754 -0.14051254 -0.04389737 -0.06815956 -0.12350667 -0.07588814