An air quality data analysis

This analysis seeks to investigate the results of efforts by the EPA to reduce emissions of fine particulates. Numerous studies have linked exposure to these particulates resulting in adverse health consequences. This analysis narrows it’s examination to emissions in the State of Texas.

“EPA issued the fine particle standards in 1997 after evaluating hundreds of health studies and conducting an extensive peer review process. The 1997 annual standard was established as a level of 15 micrograms per cubic meter (µg/m3), based on the 3-year average of annual mean PM2.5 concentrations. The 1997 24-hour standard was established as a level of 65 µg/m3, determined by the 3-year average of the annual 98th percentile concentrations. On September 21, 2006, EPA strengthened the 24-hour fine particle standard from the 1997 level of 65 µg/m3 to 35µg/m3, and retained the annual fine particle standard at 15µg/m3.”1

“Health studies have shown a significant association between exposure to fine particles and premature death from heart or lung disease. Fine particles can aggravate heart and lung diseases and have been linked to effects such as: cardiovascular symptoms; cardiac arrhythmias; heart attacks; respiratory symptoms; asthma attacks; and bronchitis. These effects can result in increased hospital admissions, emergency room visits, absences from school or work, and restricted activity days. Individuals that may be particularly sensitive to fine particle exposure include people with heart or lung disease, older adults, and children.”2

Data parocessing

Data processing steps, including:

# library('tidyr')
library('dplyr')
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library('ggplot2')
# library(sqldf)
dfile <- vector()
durl  <- vector()
# range of years to analyze
airYears <- seq(from = 1999, to =  2014, by = 1)
for (i in seq_along(airYears) ) {
  dfile[i] <- paste('daily_88101_', airYears[i], '.zip', sep ='')
  durl[i]  <- paste('http://aqsdr1.epa.gov/aqsweb/aqstmp/airdata/', dfile[i], sep = '')
  destf <- paste('data/', dfile[i], sep = '')
  # check to see if file exists prior to download
  if (!file.exists(destf) ) {
    download.file(durl[i], destf)
    cat('downloaded:  ') ; cat(dfile[i]) ; cat('  from:  ') ; print(durl[i])
  }
}
# list of all daily_88101 files
daily <- list.files('data/', pattern = '^daily_88101.*zip$', full.names = TRUE)
pm <- data.frame()
# load data
for (i in seq_along(daily)) {
  zipf <- unzip(daily[i], list=TRUE)[1]
  # read from the compressed data file
  pm <- read.table(unz(daily[i], zipf), header = TRUE, sep = ',')
  # State.Code 48 is Texas
  if (i == 1) {
    # pm-all <- pm[i]
    pm0 <- pm[pm$State.Code == 48, ]
  } else {
    # pm-all <- rbind(pm-all, pm[i])
    pm0 <- rbind(pm0, pm[pm$State.Code == 48, ])
  }
}
dim(pm0)
## [1] 79626    28
str(pm0)
## 'data.frame':    79626 obs. of  28 variables:
##  $ State.Code         : chr  "48" "48" "48" "48" ...
##  $ County.Code        : int  29 29 29 29 29 29 29 29 29 29 ...
##  $ Site.Num           : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ Parameter.Code     : int  88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
##  $ POC                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Latitude           : num  29.4 29.4 29.4 29.4 29.4 ...
##  $ Longitude          : num  -98.5 -98.5 -98.5 -98.5 -98.5 ...
##  $ Datum              : Factor w/ 4 levels "NAD27","NAD83",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Parameter.Name     : Factor w/ 1 level "PM2.5 - Local Conditions": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Sample.Duration    : Factor w/ 3 levels "24 HOUR","1 HOUR",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Pollutant.Standard : Factor w/ 2 levels "PM25 24-hour 2006",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Date.Local         : Factor w/ 5630 levels "1999-01-01","1999-01-02",..: 91 92 94 95 98 99 110 111 112 113 ...
##  $ Units.of.Measure   : Factor w/ 1 level "Micrograms/cubic meter (LC)": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Event.Type         : Factor w/ 3 levels "Included","None",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Observation.Count  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Observation.Percent: num  100 100 100 100 100 100 100 100 100 100 ...
##  $ Arithmetic.Mean    : num  17 17.7 15.5 8.5 17.3 16.9 9.4 11 18.3 22.8 ...
##  $ X1st.Max.Value     : num  17 17.7 15.5 8.5 17.3 16.9 9.4 11 18.3 22.8 ...
##  $ X1st.Max.Hour      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AQI                : int  61 63 58 35 62 61 39 46 64 74 ...
##  $ Method.Name        : Factor w/ 22 levels "Andersen RAAS2.5-100 PM2.5 SAM w/WINS - GRAVIMETRIC",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Local.Site.Name    : Factor w/ 1524 levels ""," Davidsonville",..: 489 489 489 489 489 489 489 489 489 489 ...
##  $ Address            : Factor w/ 1826 levels "'G' STREET AND KANSAS AVE.",..: 526 526 526 526 526 526 526 526 526 526 ...
##  $ State.Name         : Factor w/ 54 levels "Alabama","Alaska",..: 45 45 45 45 45 45 45 45 45 45 ...
##  $ County.Name        : Factor w/ 684 levels "Ada","Adams",..: 34 34 34 34 34 34 34 34 34 34 ...
##  $ City.Name          : Factor w/ 969 levels "Akron","Albany",..: 520 520 520 520 520 520 520 520 520 520 ...
##  $ CBSA.Name          : Factor w/ 492 levels "","Aguadilla-Isabela-San Sebastian, PR",..: 310 310 310 310 310 310 310 310 310 310 ...
##  $ Date.of.Last.Change: Factor w/ 51 levels "2013-06-29","2013-06-30",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(pm0)
##   State.Code         County.Code     Site.Num    Parameter.Code 
##  Length:79626       Min.   : 29   Min.   :   1   Min.   :88101  
##  Class :character   1st Qu.:113   1st Qu.:  32   1st Qu.:88101  
##  Mode  :character   Median :201   Median :  57   Median :88101  
##                     Mean   :224   Mean   : 453   Mean   :88101  
##                     3rd Qu.:355   3rd Qu.:1002   3rd Qu.:88101  
##                     Max.   :479   Max.   :3010   Max.   :88101  
##                                                                 
##       POC          Latitude      Longitude          Datum      
##  Min.   :1.00   Min.   :26.1   Min.   :-106.6   NAD27  : 2116  
##  1st Qu.:1.00   1st Qu.:29.7   1st Qu.: -97.9   NAD83  :66033  
##  Median :1.00   Median :31.8   Median : -96.9   UNKNOWN: 1218  
##  Mean   :1.92   Mean   :31.1   Mean   : -97.9   WGS84  :10259  
##  3rd Qu.:2.00   3rd Qu.:32.8   3rd Qu.: -95.3                  
##  Max.   :8.00   Max.   :35.2   Max.   : -93.8                  
##                                                                
##                   Parameter.Name       Sample.Duration 
##  PM2.5 - Local Conditions:79626   24 HOUR      :73915  
##                                   1 HOUR       : 3055  
##                                   24-HR BLK AVG: 2656  
##                                                        
##                                                        
##                                                        
##                                                        
##          Pollutant.Standard      Date.Local   
##  PM25 24-hour 2006:76571    2004-07-14:   74  
##                   : 3055    2004-06-20:   72  
##                             2002-06-25:   70  
##                             2002-05-14:   69  
##                             2002-05-02:   68  
##                             2002-10-05:   68  
##                             (Other)   :79205  
##                     Units.of.Measure    Event.Type    Observation.Count
##  Micrograms/cubic meter (LC):79626   Included:  867   Min.   : 1.00    
##                                      None    :78759   1st Qu.: 1.00    
##                                      Excluded:    0   Median : 1.00    
##                                                       Mean   : 1.78    
##                                                       3rd Qu.: 1.00    
##                                                       Max.   :24.00    
##                                                                        
##  Observation.Percent Arithmetic.Mean X1st.Max.Value  X1st.Max.Hour   
##  Min.   :  4.0       Min.   : -0.4   Min.   :  0.0   Min.   : 0.000  
##  1st Qu.:100.0       1st Qu.:  7.1   1st Qu.:  7.2   1st Qu.: 0.000  
##  Median :100.0       Median : 10.0   Median : 10.2   Median : 0.000  
##  Mean   : 99.6       Mean   : 11.2   Mean   : 11.8   Mean   : 0.433  
##  3rd Qu.:100.0       3rd Qu.: 14.0   3rd Qu.: 14.5   3rd Qu.: 0.000  
##  Max.   :100.0       Max.   :131.0   Max.   :716.0   Max.   :23.000  
##                                                                      
##       AQI       
##  Min.   :  0.0  
##  1st Qu.: 30.0  
##  Median : 42.0  
##  Mean   : 43.1  
##  3rd Qu.: 55.0  
##  Max.   :190.0  
##  NA's   :3055   
##                                                               Method.Name   
##  R & P Model 2025 PM2.5 Sequential w/WINS - GRAVIMETRIC             :42307  
##  R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC - Gravimetric:31608  
##  Met One BAM-1020 Mass Monitor w/VSCC - Beta Attenuation            : 3055  
##   -                                                                 : 2656  
##  Andersen RAAS2.5-100 PM2.5 SAM w/WINS - GRAVIMETRIC                :    0  
##  Andersen RAAS2.5-300 PM2.5 SEQ w/WINS - GRAVIMETRIC                :    0  
##  (Other)                                                            :    0  
##                 Local.Site.Name                      Address     
##  DALLAS HINTON          : 7882   1415 HINTON STREET      : 7882  
##  Clinton                : 5826   9525 1/2 Clinton Dr     : 5826  
##  Fort Worth Northwest   : 4219   3317 Ross Ave           : 4219  
##  El Paso Chamizal       : 4073   800 S San Marcial Street: 4073  
##  Haws Athletic Center   : 3139   600 1/2 Congress St     : 3139  
##  Corpus Christi Huisache: 2947   3810 Huisache Street    : 2947  
##  (Other)                :51540   (Other)                 :51540  
##       State.Name     County.Name             City.Name    
##  Texas     :79626   Dallas :14844   Dallas        :14844  
##  Alabama   :    0   Harris :13235   Fort Worth    :10664  
##  Alaska    :    0   Tarrant:10814   El Paso       : 9187  
##  Arizona   :    0   El Paso: 9620   Not in a city : 8198  
##  Arkansas  :    0   Nueces : 3998   Houston       : 7811  
##  California:    0   Travis : 3379   Corpus Christi: 3998  
##  (Other)   :    0   (Other):23736   (Other)       :24924  
##                            CBSA.Name     Date.of.Last.Change
##  Dallas-Fort Worth-Arlington, TX:26953   2013-06-29:64032   
##  Houston-Sugar Land-Baytown, TX :15705   2014-06-07:15594   
##  El Paso, TX                    : 9620   2013-06-30:    0   
##  Beaumont-Port Arthur, TX       : 5374   2013-07-10:    0   
##  Corpus Christi, TX             : 3998   2013-07-11:    0   
##  Austin-Round Rock, TX          : 3782   2013-07-15:    0   
##  (Other)                        :14194   (Other)   :    0
names(pm0)
##  [1] "State.Code"          "County.Code"         "Site.Num"           
##  [4] "Parameter.Code"      "POC"                 "Latitude"           
##  [7] "Longitude"           "Datum"               "Parameter.Name"     
## [10] "Sample.Duration"     "Pollutant.Standard"  "Date.Local"         
## [13] "Units.of.Measure"    "Event.Type"          "Observation.Count"  
## [16] "Observation.Percent" "Arithmetic.Mean"     "X1st.Max.Value"     
## [19] "X1st.Max.Hour"       "AQI"                 "Method.Name"        
## [22] "Local.Site.Name"     "Address"             "State.Name"         
## [25] "County.Name"         "City.Name"           "CBSA.Name"          
## [28] "Date.of.Last.Change"
save(pm0, file = 'pm0.Rdata')
grep('Date', names(pm0), value = T)
## [1] "Date.Local"          "Date.of.Last.Change"
pm0$Date.Local <- as.Date(as.character(pm0$Date.Local), '%Y-%m-%d')
summary(pm0$Date.Local)
##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "1999-01-01" "2001-11-18" "2004-02-02" "2005-04-24" "2008-05-01" 
##         Max. 
## "2014-03-31"
pm0$Date.of.Last.Change <- as.Date(as.character(pm0$Date.of.Last.Change), '%Y-%m-%d')
summary(pm0$Date.of.Last.Change)
##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "2013-06-29" "2013-06-29" "2013-06-29" "2013-09-04" "2013-06-29" 
##         Max. 
## "2014-06-07"
save(pm0, file = 'pm0.Rdata')
td0 <- pm0 %>%
  mutate(
    Station = paste(County.Code, Site.Num),
    Date    = as.Date(as.character(Date.Local), '%Y-%m-%d'),
    Year    = factor(format(Date, '%Y')),
    PM25    = Arithmetic.Mean,
    AQI     = AQI,
    ) %>%
  select(
    Year, Date, Station, PM25, AQI
    )

summary(td0)
##       Year            Date              Station               PM25      
##  2002   :10896   Min.   :1999-01-01   Length:79626       Min.   : -0.4  
##  2001   :10481   1st Qu.:2001-11-18   Class :character   1st Qu.:  7.1  
##  2000   : 8541   Median :2004-02-02   Mode  :character   Median : 10.0  
##  2003   : 7145   Mean   :2005-04-24                      Mean   : 11.2  
##  2004   : 6175   3rd Qu.:2008-05-01                      3rd Qu.: 14.0  
##  2005   : 5226   Max.   :2014-03-31                      Max.   :131.0  
##  (Other):31162                                                          
##       AQI       
##  Min.   :  0.0  
##  1st Qu.: 30.0  
##  Median : 42.0  
##  Mean   : 43.1  
##  3rd Qu.: 55.0  
##  Max.   :190.0  
##  NA's   :3055

Analysis

Values of interest:

aiq <- td0$AQI  # Air Quality index
mean(is.na(aiq), na.rm = TRUE)  # percentage of missing values?
## [1] 0.03837
pm25 <- td0$PM25
summary(aiq)  # summary air quality index
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    30.0    42.0    43.1    55.0   190.0    3055
summary(pm25)  # summary PM2.5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -0.4     7.1    10.0    11.2    14.0   131.0
mean(pm25, na.rm = TRUE)
## [1] 11.22

Both the Mean 11.2224and the Median 10 of the PM2.5 measures are below the recommended value of 15µg/m3 for the period from 1999 to 2013.

# compare years 1999 to 2013 for changees in levels
td1999 <- td0[td0$Year == '1999', ]  
td2013 <- td0[td0$Year == '2013', ]  
summary(td1999)
##       Year           Date              Station               PM25     
##  1999   :2203   Min.   :1999-01-01   Length:2203        Min.   : 2.5  
##  2000   :   0   1st Qu.:1999-05-29   Class :character   1st Qu.: 7.4  
##  2001   :   0   Median :1999-10-06   Mode  :character   Median :10.8  
##  2002   :   0   Mean   :1999-08-31                      Mean   :12.2  
##  2003   :   0   3rd Qu.:1999-11-18                      3rd Qu.:15.8  
##  2004   :   0   Max.   :1999-12-31                      Max.   :39.8  
##  (Other):   0                                                         
##       AQI       
##  Min.   : 10.0  
##  1st Qu.: 31.0  
##  Median : 45.0  
##  Mean   : 45.8  
##  3rd Qu.: 59.0  
##  Max.   :112.0  
## 
summary(td2013)
##       Year           Date              Station               PM25     
##  2013   :4496   Min.   :2013-01-01   Length:4496        Min.   :-0.4  
##  1999   :   0   1st Qu.:2013-04-03   Class :character   1st Qu.: 6.3  
##  2000   :   0   Median :2013-07-01   Mode  :character   Median : 9.1  
##  2001   :   0   Mean   :2013-07-01                      Mean   :10.4  
##  2002   :   0   3rd Qu.:2013-09-30                      3rd Qu.:12.8  
##  2003   :   0   Max.   :2013-12-31                      Max.   :78.7  
##  (Other):   0                                                         
##       AQI       
##  Min.   :  2.0  
##  1st Qu.: 26.0  
##  Median : 37.0  
##  Mean   : 39.8  
##  3rd Qu.: 52.0  
##  Max.   :163.0  
##  NA's   :1081
# WIP

Plot PM2.5 for Years 1999 thru 2014

# set labels
title <- expression(PM[2.5] * ' Readings for Texas from 1999 thru 2014')
ylabel <- expression(PM[2.5] * ' in Micrograms per ' * m^3)
xlabel <- 'Years'
# build plot
g <- ggplot(td0, aes(x = Date, y = PM25, fill = Year, color = Year) )
g <- g + ggtitle(title) + ylab(ylabel) + xlab(xlabel) + theme_bw()
g <- g + geom_boxplot(stat = 'boxplot', outlier.colour = NULL, position = 'dodge',
                      notch = TRUE, notchwidth = 0.5)
g <- g + geom_smooth(method = 'lm', se = FALSE, color = 'darkgreen', aes(group = 1) )
g <- g + geom_hline(aes(yintercept = 15), show_guide = T, color = 'red')
g <- g + geom_text(aes(x = td0$Date[1], y = 15, label = 'EPA Max 15µg/m3'), color = 'blue',
                                 vjust = -0.5, hjust = 0.25, size = 3)
g
## Warning: position_dodge requires constant width: output may be incorrect

plot of chunk plot1_pm25

Plot AQI for Years 1999 thru 2014

# set labels
title <- expression('AQI Readings for Texas from 1999 thru 2014')
ylabel <- expression('Air Quality Index')
xlabel <- 'Years'
# build plot
g <- ggplot(td0, aes(x = Date, y = AQI, fill = Year, color = Year) )
g <- g + ggtitle(title) + ylab(ylabel) + xlab(xlabel) + theme_bw()
g <- g + geom_boxplot(stat = 'boxplot', outlier.colour = NULL, position = 'dodge',
                      notch = TRUE, notchwidth = 0.5)
g <- g + geom_smooth(method = 'lm', se = FALSE, color = 'darkgreen', aes(group = 1) )
g
## Warning: Removed 3055 rows containing non-finite values (stat_boxplot).
## Warning: Removed 3055 rows containing missing values (stat_smooth).
## Warning: position_dodge requires constant width: output may be incorrect

plot of chunk plot2_AQI

It appears that the State of Texas, on average, is safely inside the EPA guidelines for PM 2.5 particulates. The data suggests that air quality in general is improving across the state.