Take the two datasets provided in Moodle under the last session named “Finals”. The datasets are related to fine particulate matter (PM) air pollution in the United States using the Environmental Protection Agencies freely available national monitoring data. The two datasets are used to show fine particle (PM2.5) outdoor air pollution in the United States in 1999 and 2012. Perform the following steps in sequence then copy and paste your R codes for each section in a word file (you are also required to submit your R file in addition to your answers in a word file). Please also report the results in a word file (if it is a long result, then only report the dimension of the data frame). For the figures, copy and paste them directly from R to your answers in word file.
1) First read in the 1999 data from the raw text file. The data is a delimited file were fields are delimited with the | character and missing values are coded as blank fields. Skip some commented lines in the beginning of the file and initially do not read the header data. Then attach the column headers to the dataset and make sure that they are properly formatted for R data frames.
2) Do all above for 2012 data.
Load file 1 & 2; skip 2 first lines that are commented out, and finaly attaching column names.
setwd("/Users/daheza/Documents/Analytics/DataEXP/FinalExam/PM 2.5 Data-20180421")
data_1999 <-
read.table(
"RD_501_88101_1999-0.txt",
sep = "|",
comment.char = "#"
,
col.names = c(
"RD",
"Action Code",
"State Code",
"County Code",
"Site ID",
"Parameter",
"POC",
"Sample Duration",
"Unit",
"Method",
"Date",
"Start Time",
"Sample Value",
"Null Data Code",
"Sampling Frequency",
"Monitor Protocol (MP) ID",
"Qualifier - 1",
"Qualifier - 2",
"Qualifier - 3",
"Qualifier - 4",
"Qualifier - 5",
"Qualifier - 6",
"Qualifier - 7",
"Qualifier - 8",
"Qualifier - 9",
"Qualifier - 10",
"Alternate Method Detectable Limit",
"Uncertainty"
)
)
data_2012 <-
read.table(
"RD_501_88101_2012-0.txt",
sep = "|",
comment.char = "#",
col.names = c(
"RD",
"Action Code",
"State Code",
"County Code",
"Site ID",
"Parameter",
"POC",
"Sample Duration",
"Unit",
"Method",
"Date",
"Start Time",
"Sample Value",
"Null Data Code",
"Sampling Frequency",
"Monitor Protocol (MP) ID",
"Qualifier - 1",
"Qualifier - 2",
"Qualifier - 3",
"Qualifier - 4",
"Qualifier - 5",
"Qualifier - 6",
"Qualifier - 7",
"Qualifier - 8",
"Qualifier - 9",
"Qualifier - 10",
"Alternate Method Detectable Limit",
"Uncertainty"
)
)
3)The column we are interested in is the Sample.Value column which contains the PM2.5 measurements. Here, extract that column and print a summary.
sample.Value_1999 <- data_1999$Sample.Value
summary(sample.Value_1999)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 7.20 11.50 13.74 17.90 157.10 13217
sample.Value_2012 <- data_2012$Sample.Value
summary(sample.Value_2012)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -10.00 4.00 7.63 9.14 12.00 908.97 73133
4) Check to see what proportion of the observations are missing (i.e. coded as NA) in Sample.Value column.
# percentage of NAs in column Sample.Value
sum(is.na(sample.Value_1999))/length(sample.Value_1999)# 0.1125608 -> 11%
## [1] 0.1125608
sum(is.na(sample.Value_2012))/length(sample.Value_2012)# 0.05607125 -> 5%
## [1] 0.05607125
5) Combine the two sets of data (1999 and 2012) into a single data frame and call it “pm”.
#merge two dataframe
pm <- rbind(data_1999,data_2012)# this yield rows =1421708, col= 28
dim(pm)
## [1] 1421708 28
head(pm,3)
6) Create a factor variable indicating which year the data comes from. We also rename the Sample.Value variable to a more sensible “PM”.
# extracting year and adding a new column to the merged dataset
pm$DateYear <- as.numeric(substring(pm$Date,0,4))
summary(pm$DateYear)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1999 2012 2012 2011 2012 2012
# rename column Sample.Value <-> PM
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
pm <- rename(pm, PM=Sample.Value)
dim(pm)
## [1] 1421708 29
7) Extract the unique county codes from the merged dataset.
# unique country name
unique_countyCode <-unique(pm$County.Code)# 133 unique country code
unique_countyCode
## [1] 27 33 49 69 73 89 97 101 103 113 117 119 121 125 127 20 90
## [18] 110 130 170 3 5 7 13 19 21 23 1 31 35 51 91 107 115
## [35] 131 139 143 9 11 17 25 29 37 45 47 53 57 59 61 63 65
## [52] 67 71 75 77 79 81 83 85 87 95 99 111 39 41 123 86 105
## [69] 215 223 245 303 319 55 43 157 161 163 167 197 201 141 153 155 169
## [86] 193 173 177 191 209 93 145 151 195 227 15 510 147 137 171 149 183
## [103] 186 189 109 129 133 135 165 315 339 439 453 479 36 520 550 650 680
## [120] 700 710 760 770 775 810 10 122 185 295 199 159 203 355
completeFunct <- function(data, desiredCols) {
completeVec <- complete.cases(data[, desiredCols])
return(data[completeVec, ])
}
pm <- completeFunct(pm, c("PM","County.Code"))
dim(pm)
## [1] 1335358 29
#which(is.nan(log(pm$PM)))
pm$LogPm<-log2(pm$PM+1)
## Warning: NaNs produced
8) Develop box plots of log transform of “PM” variable across all county codes.
boxplot(unique_countyCode,log(pm$PM))
## Warning in log(pm$PM): NaNs produced
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z
## $group == : Outlier (-Inf) in boxplot 2 is not drawn
9) Calculate the average of “PM” variable for unique County.Code. (make sure you remove missing values).
pm_val_for_uniqueCC <- pm[!duplicated(pm[, c("County.Code")]), ]
mean(pm_val_for_uniqueCC$PM,na.rm = TRUE) #14.11
## [1] 13.77462
10) Then develop a scatter plot with the mean values of “PM” level across the unique county codes.
grouped_By_CountyCode <- group_by(pm, County.Code)
meanValuesOfPMPerCC <- summarize(grouped_By_CountyCode,pm_Mean=mean(PM,na.rm = TRUE))
plot(meanValuesOfPMPerCC$pm_Mean, meanValuesOfPMPerCC$County.Code, main="Mean Values of PM level across the unique county codes",ylab="County Code ", xlab="Mean Values PM level", pch=19)
11) Then develop a bar plot with the mean values of “PM” level across the unique county codes.
library(ggplot2)
p<-ggplot(data=meanValuesOfPMPerCC, aes(y=County.Code, x=pm_Mean)) +
geom_bar(stat="identity",width = 0.7, position = "stack")
p
## Warning: position_stack requires non-overlapping x intervals
12) Calculate the mean of PM for each county in 1999 and 2012.
grouped_By_Year <- group_by(pm, DateYear)
grouped_By_Year_1999_2012 <- filter(grouped_By_Year, DateYear==1999 || DateYear==2012)
## Warning: package 'bindrcpp' was built under R version 3.4.4
meanValuesOfPMPerYear <- summarize(grouped_By_Year_1999_2012,pm_Mean=mean(PM,na.rm = TRUE))
meanValuesOfPMPerYear # Year:1999. PM Mean:13.7, Year:2012 PM Mean:9.14
13) Now make a plot that shows the 1999 county means in one “column” and the 2012 county means in another column.
ggplot(data=meanValuesOfPMPerYear, aes(x=DateYear, y=pm_Mean)) +
geom_bar(stat="identity",width = 1, position = "stack")