Desire’s Final Exam Answers

Question

Take the two datasets provided in Moodle under the last session named “Finals”. The datasets are related to fine particulate matter (PM) air pollution in the United States using the Environmental Protection Agencies freely available national monitoring data. The two datasets are used to show fine particle (PM2.5) outdoor air pollution in the United States in 1999 and 2012. Perform the following steps in sequence then copy and paste your R codes for each section in a word file (you are also required to submit your R file in addition to your answers in a word file). Please also report the results in a word file (if it is a long result, then only report the dimension of the data frame). For the figures, copy and paste them directly from R to your answers in word file.

1) First read in the 1999 data from the raw text file. The data is a delimited file were fields are delimited with the | character and missing values are coded as blank fields. Skip some commented lines in the beginning of the file and initially do not read the header data. Then attach the column headers to the dataset and make sure that they are properly formatted for R data frames.

2) Do all above for 2012 data.

Response:

Load file 1 & 2; skip 2 first lines that are commented out, and finaly attaching column names.

setwd("/Users/daheza/Documents/Analytics/DataEXP/FinalExam/PM 2.5 Data-20180421")
data_1999 <-
  read.table(
    "RD_501_88101_1999-0.txt",
    sep = "|",
    comment.char = "#"
    ,
    col.names = c(
      "RD",
      "Action Code",
      "State Code",
      "County Code",
      "Site ID",
      "Parameter",
      "POC",
      "Sample Duration",
      "Unit",
      "Method",
      "Date",
      "Start Time",
      "Sample Value",
      "Null Data Code",
      "Sampling Frequency",
      "Monitor Protocol (MP) ID",
      "Qualifier - 1",
      "Qualifier - 2",
      "Qualifier - 3",
      "Qualifier - 4",
      "Qualifier - 5",
      "Qualifier - 6",
      "Qualifier - 7",
      "Qualifier - 8",
      "Qualifier - 9",
      "Qualifier - 10",
      "Alternate Method Detectable Limit",
      "Uncertainty"
    )
  )

data_2012 <-
  read.table(
    "RD_501_88101_2012-0.txt",
    sep = "|",
    comment.char = "#",
    col.names = c(
      "RD",
      "Action Code",
      "State Code",
      "County Code",
      "Site ID",
      "Parameter",
      "POC",
      "Sample Duration",
      "Unit",
      "Method",
      "Date",
      "Start Time",
      "Sample Value",
      "Null Data Code",
      "Sampling Frequency",
      "Monitor Protocol (MP) ID",
      "Qualifier - 1",
      "Qualifier - 2",
      "Qualifier - 3",
      "Qualifier - 4",
      "Qualifier - 5",
      "Qualifier - 6",
      "Qualifier - 7",
      "Qualifier - 8",
      "Qualifier - 9",
      "Qualifier - 10",
      "Alternate Method Detectable Limit",
      "Uncertainty"
    )
  )

3)The column we are interested in is the Sample.Value column which contains the PM2.5 measurements. Here, extract that column and print a summary.

Response:

sample.Value_1999 <- data_1999$Sample.Value
summary(sample.Value_1999)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    7.20   11.50   13.74   17.90  157.10   13217

sample.Value_2012 <- data_2012$Sample.Value
summary(sample.Value_2012)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -10.00    4.00    7.63    9.14   12.00  908.97   73133

4) Check to see what proportion of the observations are missing (i.e. coded as NA) in Sample.Value column.

Response:

# percentage of NAs in column Sample.Value 
sum(is.na(sample.Value_1999))/length(sample.Value_1999)# 0.1125608 -> 11%

## [1] 0.1125608

sum(is.na(sample.Value_2012))/length(sample.Value_2012)# 0.05607125 -> 5%

## [1] 0.05607125

5) Combine the two sets of data (1999 and 2012) into a single data frame and call it “pm”.

Response:

#merge two dataframe
pm <- rbind(data_1999,data_2012)# this yield rows =1421708, col= 28
dim(pm)

## [1] 1421708      28

head(pm,3)

6) Create a factor variable indicating which year the data comes from. We also rename the Sample.Value variable to a more sensible “PM”.

Response:

# extracting year and adding a new column to the merged dataset
pm$DateYear <- as.numeric(substring(pm$Date,0,4))
summary(pm$DateYear)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1999    2012    2012    2011    2012    2012

# rename column Sample.Value <-> PM
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

pm <- rename(pm, PM=Sample.Value)
dim(pm)

## [1] 1421708      29

7) Extract the unique county codes from the merged dataset.

Response:

# unique country name
unique_countyCode <-unique(pm$County.Code)# 133 unique country code
unique_countyCode

##   [1]  27  33  49  69  73  89  97 101 103 113 117 119 121 125 127  20  90
##  [18] 110 130 170   3   5   7  13  19  21  23   1  31  35  51  91 107 115
##  [35] 131 139 143   9  11  17  25  29  37  45  47  53  57  59  61  63  65
##  [52]  67  71  75  77  79  81  83  85  87  95  99 111  39  41 123  86 105
##  [69] 215 223 245 303 319  55  43 157 161 163 167 197 201 141 153 155 169
##  [86] 193 173 177 191 209  93 145 151 195 227  15 510 147 137 171 149 183
## [103] 186 189 109 129 133 135 165 315 339 439 453 479  36 520 550 650 680
## [120] 700 710 760 770 775 810  10 122 185 295 199 159 203 355

completeFunct <- function(data, desiredCols) {
  completeVec <- complete.cases(data[, desiredCols])
  return(data[completeVec, ])
}
pm <- completeFunct(pm, c("PM","County.Code"))
dim(pm)

## [1] 1335358      29

#which(is.nan(log(pm$PM)))
pm$LogPm<-log2(pm$PM+1)

## Warning: NaNs produced

8) Develop box plots of log transform of “PM” variable across all county codes.

Response:

boxplot(unique_countyCode,log(pm$PM))

## Warning in log(pm$PM): NaNs produced

## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z
## $group == : Outlier (-Inf) in boxplot 2 is not drawn

9) Calculate the average of “PM” variable for unique County.Code. (make sure you remove missing values).

Response:

pm_val_for_uniqueCC <- pm[!duplicated(pm[, c("County.Code")]), ]
mean(pm_val_for_uniqueCC$PM,na.rm = TRUE) #14.11

## [1] 13.77462

10) Then develop a scatter plot with the mean values of “PM” level across the unique county codes.

Response:

grouped_By_CountyCode <- group_by(pm, County.Code)
meanValuesOfPMPerCC <- summarize(grouped_By_CountyCode,pm_Mean=mean(PM,na.rm = TRUE))
plot(meanValuesOfPMPerCC$pm_Mean, meanValuesOfPMPerCC$County.Code, main="Mean Values of PM level across the unique county codes",ylab="County Code ", xlab="Mean Values PM level", pch=19)

11) Then develop a bar plot with the mean values of “PM” level across the unique county codes.

Response:

library(ggplot2)
p<-ggplot(data=meanValuesOfPMPerCC, aes(y=County.Code, x=pm_Mean)) +
  geom_bar(stat="identity",width = 0.7,  position = "stack")
p

## Warning: position_stack requires non-overlapping x intervals

12) Calculate the mean of PM for each county in 1999 and 2012.

Response:

grouped_By_Year <- group_by(pm, DateYear)
grouped_By_Year_1999_2012 <- filter(grouped_By_Year, DateYear==1999 || DateYear==2012)

## Warning: package 'bindrcpp' was built under R version 3.4.4

meanValuesOfPMPerYear <- summarize(grouped_By_Year_1999_2012,pm_Mean=mean(PM,na.rm = TRUE))
meanValuesOfPMPerYear # Year:1999.  PM Mean:13.7, Year:2012   PM Mean:9.14

13) Now make a plot that shows the 1999 county means in one “column” and the 2012 county means in another column.

Response:

ggplot(data=meanValuesOfPMPerYear, aes(x=DateYear, y=pm_Mean)) +
  geom_bar(stat="identity",width = 1,  position = "stack")