We are going apply some of the techniques we learned in Exploratory Data Analysis to study air pollution data, specifically particulate matter (we’ll call it pm25 sometimes), collected by the U.S. Environmental Protection Agency. This website https://www.health.ny.gov/environmental/indoors/air/pmq_a.htm from New York State offers some basic information on this topic if you’re interested.
Particulate matter (less than 2.5 microns in diameter) is a fancy name for dust, and breathing in dust might pose health hazards to the population. We’ll study data from two years, 1999 (when monitoring of particulate matter started) and 2012. Our goal is to see if there’s been a noticeable decline in this type of air pollution between these two years.
There are 2 large files to read in using the R
command read_delim
. We’ll store the 1999 data in the array pm0
and let R
print the tibble to see its dimensions.
library(tidyverse)
package 㤼㸱tidyverse㤼㸲 was built under R version 3.6.3Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
[30m-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.3.0 --[39m
[30m[32mv[30m [34mggplot2[30m 3.2.1 [32mv[30m [34mpurrr [30m 0.3.3
[32mv[30m [34mtibble [30m 2.1.3 [32mv[30m [34mdplyr [30m 0.8.3
[32mv[30m [34mtidyr [30m 1.0.0 [32mv[30m [34mstringr[30m 1.4.0
[32mv[30m [34mreadr [30m 1.3.1 [32mv[30m [34mforcats[30m 0.4.0[39m
[30m-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[30m [34mdplyr[30m::[32mfilter()[30m masks [34mstats[30m::filter()
[31mx[30m [34mdplyr[30m::[32mlag()[30m masks [34mstats[30m::lag()[39m
pm0 <- read_delim("RD_501_88101_1999-0.txt", comment = "#", col_names = FALSE, delim = "|", na = "")
Parsed with column specification:
cols(
.default = col_logical(),
X1 = [31mcol_character()[39m,
X2 = [31mcol_character()[39m,
X3 = [31mcol_character()[39m,
X4 = [31mcol_character()[39m,
X5 = [31mcol_character()[39m,
X6 = [32mcol_double()[39m,
X7 = [32mcol_double()[39m,
X8 = [32mcol_double()[39m,
X9 = [32mcol_double()[39m,
X10 = [32mcol_double()[39m,
X11 = [32mcol_double()[39m,
X12 = [34mcol_time(format = "")[39m,
X13 = [32mcol_double()[39m,
X14 = [31mcol_character()[39m,
X15 = [32mcol_double()[39m,
X17 = [31mcol_character()[39m
)
See spec(...) for full column specifications.
pm0
We see that pm0
has over 117,000 lines, each containing 5 columns. In the original file, at the EPA website, each row had 28 columns, but since we’re only using a few of these, we’ve created and read in a somewhat smaller file.
We can also see there’s some missing data, but we won’t worry about that now. The column names, X1, X2, etc., are not informative. However, we know that the first line of the original file (a comment) explained what information the columns contained.
We’ll use read_lines
to read in the first line of the original file and use the functions in stringr
to parse out the 28 column names of the original file.
nm<-read_lines("RD_501_88101_1999-0.txt",n_max=1)
cnm<-str_split(nm,"\\|",simplify = FALSE)[[1]]
cnm<-str_replace_all(cnm," - ","")
cnm<-str_replace_all(cnm,"# ","")
cnm<-str_replace_all(cnm," ","_")
cnm<-str_replace_all(cnm,"\\(","")
cnm<-str_replace_all(cnm,"\\)","")
cnm
[1] "RD" "Action_Code"
[3] "State_Code" "County_Code"
[5] "Site_ID" "Parameter"
[7] "POC" "Sample_Duration"
[9] "Unit" "Method"
[11] "Date" "Start_Time"
[13] "Sample_Value" "Null_Data_Code"
[15] "Sampling_Frequency" "Monitor_Protocol_MP_ID"
[17] "Qualifier1" "Qualifier2"
[19] "Qualifier3" "Qualifier4"
[21] "Qualifier5" "Qualifier6"
[23] "Qualifier7" "Qualifier8"
[25] "Qualifier9" "Qualifier10"
[27] "Alternate_Method_Detectable_Limit" "Uncertainty"
The 28 column names are all jumbled together even though they were separated by |
characters. We fixed this by reassigning the column names to the output of a call to str_split
(string split) with 3 arguments. The first is cnm, the pipe symbol ‘|’ is the second (use the quotation marks), and the third is the argument simplify
set to FALSE
.
The variable cnm
now holds a list of the column headings which is nice, but we don’t need all these. We really only need State_Code
, County_Code
,Site_ID
, Date
, and Sample_Value
which are X3
, X4
, X5
, X11
, and X13
, respectivly. We’ll set the names of the data set to the values of our names vector, cnm
, and select
those variables for our the pm0
data set.
names(pm0)<-cnm
pm0 <- select(pm0,State_Code,County_Code,Site_ID,Date,Sample_Value)
pm0
Now it’s much more clear what information each column of pm0 holds. The measurements of particulate matter (pm25) are in the column named Sample_Value
.
Sample_Value
itself can be treated as a numeric vector (of length 117000+) with at least the first 3 values missing. Exactly what percentage of values are missing in this vector? We’ll se the R
function mean
with is.na()
as an argument to see what percentage of values are missing (NA
).
summarize(pm0,missing=mean(is.na(Sample_Value)))
So a little over 11% of the 1.1710^{5}+ are missing. We’ll keep that in mind. Now let’s start processing the 2012 data which we stored for you in the array pm1.
We’ll repeat what we did for pm0
, except a little more efficiently. First read in the file and assign column names. Then we’ll print the tibble to check the dimensions.
pm1 <- read_delim("RD_501_88101_2012-0.txt", comment = "#", col_names = FALSE, delim = "|", na = "",n_max=1000000)
Parsed with column specification:
cols(
.default = col_logical(),
X1 = [31mcol_character()[39m,
X2 = [31mcol_character()[39m,
X3 = [31mcol_character()[39m,
X4 = [31mcol_character()[39m,
X5 = [31mcol_character()[39m,
X6 = [32mcol_double()[39m,
X7 = [32mcol_double()[39m,
X8 = [32mcol_double()[39m,
X9 = [32mcol_double()[39m,
X10 = [32mcol_double()[39m,
X11 = [32mcol_double()[39m,
X12 = [34mcol_time(format = "")[39m,
X13 = [32mcol_double()[39m,
X14 = [31mcol_character()[39m,
X15 = [32mcol_double()[39m,
X17 = [31mcol_character()[39m
)
See spec(...) for full column specifications.
20268 parsing failures.
row col expected actual file
1082 X18 1/0/T/F/TRUE/FALSE 5 'RD_501_88101_2012-0.txt'
1091 X18 1/0/T/F/TRUE/FALSE 5 'RD_501_88101_2012-0.txt'
353841 X27 1/0/T/F/TRUE/FALSE 2 'RD_501_88101_2012-0.txt'
353842 X27 1/0/T/F/TRUE/FALSE 2 'RD_501_88101_2012-0.txt'
353843 X27 1/0/T/F/TRUE/FALSE 2 'RD_501_88101_2012-0.txt'
...... ... .................. ...... .........................
See problems(...) for more details.
names(pm1)<-cnm
pm1 <- select(pm1,State_Code,County_Code,Site_ID,Date,Sample_Value)
pm1
Wow! nrow(pm1)
rows! Particulate matter was first collected in 1999 so perhaps there weren’t as many sensors collecting data then as in 2012 when the program was more mature.
Now let’s see what percentage of values are missing in measurements of particulate matter (pm25) for 2012. As before, use the R
function mean
with is.na()
as an argument to find out.
summarize(pm1,missing=mean(is.na(Sample_Value)))
So only `round(mean(is.na(pm1$Sample_Value)*100,1)% of the particulate matter measurements are missing. That’s about half the percentage as in 1999.
Now let’s look at summaries (using the summary command) for both datasets.
summary(pm0$Sample_Value)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 7.20 11.50 13.74 17.90 157.10 13217
summary(pm1$Sample_Value)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-10.00 4.20 7.90 9.25 12.00 893.70 52608
The numbers in the vectors pm0$Sample_Value
and x1
pm1$Sample_Value` represent measurements taken in micrograms per cubic meter.
We see that both the median and the mean of measured particulate matter have declined from 1999 to 2012. In fact, all of the measurements, except for the maximum and missing values (Max and NA’s), have decreased. Even the Min has gone down from NA to NA! We’ll address what a negative measurment might mean a little later. Note that the Max has increased from NA in 1999 to NA in 2012. This is quite high and might reflect an error in the table or malfunctions in some monitors.
To make this data a bit easier to work with, I’m going to put them together. I’ll create a factor variable called Year to distinguish the measurements in 1999 from 2012. Then I’ll use the bind_rows function to sandwhich these two data frames together.
pm25<-bind_rows(pm0,pm1)
pm25<-mutate(pm25,Year=as_factor(str_sub(Date,1,4)))
pm25
Let’s make a boxplot function with 2 arguments, x0 and x1.
library(plotly)
package 㤼㸱plotly㤼㸲 was built under R version 3.6.3Registered S3 method overwritten by 'data.table':
method from
print.data.table
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Attaching package: 㤼㸱plotly㤼㸲
The following object is masked from 㤼㸱package:ggplot2㤼㸲:
last_plot
The following object is masked from 㤼㸱package:stats㤼㸲:
filter
The following object is masked from 㤼㸱package:graphics㤼㸲:
layout
p<-ggplot(pm25, aes(x = Year, y = Sample_Value)) +
geom_boxplot(notch = TRUE) +
stat_summary(fun.y=mean,geom="point",shape=23,size=4) # add mean points
ggplotly(p)
Removed 65825 rows containing non-finite values (stat_boxplot).Removed 65825 rows containing non-finite values (stat_summary).
Huh? Did somebody step on the boxes? It’s hard to see what’s going on here. There are so many values outside the boxes and the range of particulate matter (pm25) for 2012 is so big that the boxes are flattened. It might be more informative to call boxplot on the logs (base 2).
p<-ggplot(pm25, aes(x = Year, y = log2(Sample_Value))) +
geom_boxplot(notch = TRUE) +
stat_summary(fun.y=mean,geom="point",shape=23,size=4) # add mean points
ggplotly(p)
NaNs producedNaNs producedNaNs producedRemoved 109290 rows containing non-finite values (stat_boxplot).Removed 109290 rows containing non-finite values (stat_summary).