Data import and Introduction

library("tidyverse")
library("lubridate")
library("stringr")
#import data
US_pollution_sample_2018 <- read_csv("US_pollution_sample_2018.csv")

#rename to remove whitespace
colnames(US_pollution_sample_2018) <- str_replace_all(colnames(US_pollution_sample_2018), " ", "_")

#preview data
glimpse(US_pollution_sample_2018)
Observations: 51,034
Variables: 28
$ State_Code        <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,...
$ County_Code       <dbl> 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, ...
$ Site_Num          <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
$ Address           <chr> "7823 BLACKFOOT WAY, NORTH HIGHLANDS", "7823...
$ State             <chr> "California", "California", "California", "C...
$ County            <chr> "Sacramento", "Sacramento", "Sacramento", "S...
$ City              <chr> "Not in a city", "Not in a city", "Not in a ...
$ Date_Local        <date> 2010-01-01, 2010-01-01, 2010-01-01, 2010-01...
$ NO2_Units         <chr> "Parts per billion", "Parts per billion", "P...
$ NO2_Mean          <dbl> 10.304348, 10.304348, 10.304348, 10.304348, ...
$ NO2_1st_Max_Value <dbl> 24, 24, 24, 24, 18, 18, 18, 18, 20, 20, 20, ...
$ NO2_1st_Max_Hour  <dbl> 19, 19, 19, 19, 21, 21, 21, 21, 1, 1, 1, 1, ...
$ NO2_AQI           <dbl> 23, 23, 23, 23, 17, 17, 17, 17, 19, 19, 19, ...
$ O3_Units          <chr> "Parts per million", "Parts per million", "P...
$ O3_Mean           <dbl> 0.013375, 0.013375, 0.013375, 0.013375, 0.00...
$ O3_1st_Max_Value  <dbl> 0.021, 0.021, 0.021, 0.021, 0.018, 0.018, 0....
$ O3_1st_Max_Hour   <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 12, 12, 12, ...
$ O3_AQI            <dbl> 18, 18, 18, 18, 15, 15, 15, 15, 13, 13, 13, ...
$ SO2_Units         <chr> "Parts per billion", "Parts per billion", "P...
$ SO2_Mean          <dbl> 0.695652, 0.695652, 0.685714, 0.685714, 0.34...
$ SO2_1st_Max_Value <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ SO2_1st_Max_Hour  <dbl> 0, 0, 14, 14, 0, 0, 2, 2, 0, 0, 2, 2, 4, 4, ...
$ SO2_AQI           <dbl> 1, 1, NA, NA, 1, 1, NA, NA, 1, 1, NA, NA, 1,...
$ CO_Units          <chr> "Parts per million", "Parts per million", "P...
$ CO_Mean           <dbl> 0.391304, 0.355556, 0.391304, 0.355556, 0.40...
$ CO_1st_Max_Value  <dbl> 0.7, 0.5, 0.7, 0.5, 1.0, 0.6, 1.0, 0.6, 1.0,...
$ CO_1st_Max_Hour   <dbl> 20, 21, 20, 21, 22, 0, 22, 0, 1, 2, 1, 2, 21...
$ CO_AQI            <dbl> NA, 6, NA, 6, NA, 7, NA, 7, NA, 9, NA, 9, NA...

The dataset contains 51,034 observations and 28 variables. The dataset provides 4 measurements (mean, max value, max hour and Air Quality Index) of four criteria gases (NO2, O3, SO2, CO) in different sites in the US.

At first glance, it can be seen that there are missing values in SO2 AQI and CO AQI. In addition, data is also duplicated.

Data cleaning

US_pollution_sample_2018 %>% 
  group_by(State_Code, County_Code, Site_Num, Date_Local) %>% 
  summarize(dup_group = n()) %>% 
  group_by(dup_group) %>% 
  summarize(no_of_record = n())
# A tibble: 5 x 2
  dup_group no_of_record
      <int>        <int>
1         2           18
2         4         8896
3         6            1
4         8         1923
5        12            2

When grouping data by each site and date, most of the group has 4 records or 8 records. There might be a systematic problem or reason for this duplication and missing values.

When compared to original data from https://aqs.epa.gov/aqsweb/airdata/download_files.html, I realize that in daily data, Nitrogen dioxide and Ozone have only one Pollutant Standard for each, which results in one record per day. Meanwhile, Carbon monoxide and Sulfur dioxide have two Pollutant Standards for each, which also results in two records per day.

Therefore, I suspect that when people combined data from these files, each pair of CO and SO2 matched and auto-created duplicated values for NO2 and O3. Thus, most of the records in the dataset has 4 observations per day.

In addition, CO and SO2 have two Standards, yet only one standard has AQI data available. Hence, when matching, it includes missing data of AQI from a specific Pollutant standard.

For these reasons, I can detect the rule to remove duplicated and missing values and eliminate previous data combination problem.

#clean duplication
data_cleaned <- US_pollution_sample_2018 %>% 
  filter(!is.na(SO2_AQI) & !is.na(CO_AQI))

#check duplication
data_cleaned %>% 
  group_by(State_Code, County_Code, Site_Num, Date_Local) %>% 
  summarize(dup_group = n()) %>% 
  group_by(dup_group) %>% 
  summarize(no_of_record = n())
# A tibble: 3 x 2
  dup_group no_of_record
      <int>        <int>
1         1         8914
2         2         1924
3         3            2

After cleaning, there are still 1926 groups of duplication. It is due to changes in the Measurement method for Ozone in one site in Ohio. After careful comparison, I detect one method the consistently used and try to remove the values of other inconsistent methods. The data from the original source is too big to include in this submission. Therefore, I only import the result after removal of these duplications. Further explanation will be provided if required.

Data summary

#import cleaned dataset
data_cleaned <- read_csv("data_cleaned.csv")

#check duplication
data_cleaned %>% 
  group_by(State_Code, County_Code, Site_Num, Date_Local) %>% 
  summarize(dup_group = n()) %>% 
  group_by(dup_group) %>% 
  summarize(no_of_record = n())
# A tibble: 1 x 2
  dup_group no_of_record
      <int>        <int>
1         1        10016
#check missing values
sapply(data_cleaned, function(x) sum(is.na(x)))
       State_Code       County_Code          Site_Num           Address 
                0                 0                 0                 0 
            State            County              City        Date_Local 
                0                 0                 0                 0 
         NO2_Mean NO2_1st_Max_Value  NO2_1st_Max_Hour           NO2_AQI 
                0                 0                 0                 0 
          O3_Mean  O3_1st_Max_Value   O3_1st_Max_Hour            O3_AQI 
                0                 0                 0                 0 
         SO2_Mean SO2_1st_Max_Value  SO2_1st_Max_Hour           SO2_AQI 
                0                 0                 0                 0 
          CO_Mean  CO_1st_Max_Value   CO_1st_Max_Hour            CO_AQI 
                0                 0                 0                 0 
#convert character to factor variables, date to date format
data_cleaned %>% 
  mutate_at(vars(State_Code,County_Code,Site_Num,Address,State,County,City), funs(factor(.))) %>% 
  mutate(Date_Local = as.Date(Date_Local))-> data_cleaned

#data summary
summary(data_cleaned)
 State_Code County_Code Site_Num   
 6 : 299    5  :2148    2   :1628  
 24:2148    7  :1329    8   :1817  
 34:1329    17 :3323    12  :1272  
 38:2051    35 :1100    60  :1100  
 39:1100    67 : 299    1004:2051  
 42:3089    133:1817    3001:2148  
                                   
                                Address              State     
 266 Spruce Street                  :1329   California  : 299  
 4266 40TH AVE NORTH                :2051   Maryland    :2148  
 600 Dorsey Avenue                  :2148   New Jersey  :1329  
 7823 BLACKFOOT WAY, NORTH HIGHLANDS: 299   North Dakota:2051  
 E. 14TH & ORANGE                   :1100   Ohio        :1100  
 HILL ST.                           :1817   Pennsylvania:3089  
 ROCKVIEW LANE                      :1272                      
        County                City        Date_Local        
 Baltimore :2148   Bristol      :1272   Min.   :2010-01-01  
 Bucks     :1272   Camden       :1329   1st Qu.:2011-09-15  
 Camden    :1329   Cleveland    :1100   Median :2013-01-29  
 Cass      :2051   Essex        :2148   Mean   :2013-02-12  
 Cuyahoga  :1100   Not in a city:2350   3rd Qu.:2014-07-14  
 Sacramento: 299   York         :1817   Max.   :2016-04-30  
 York      :1817                                            
    NO2_Mean      NO2_1st_Max_Value NO2_1st_Max_Hour    NO2_AQI    
 Min.   : 0.000   Min.   : 0.00     Min.   : 0.00    Min.   : 0.0  
 1st Qu.: 5.250   1st Qu.:13.00     1st Qu.: 5.00    1st Qu.:12.0  
 Median : 8.953   Median :21.00     Median : 8.00    Median :20.0  
 Mean   :10.151   Mean   :21.81     Mean   :11.28    Mean   :20.5  
 3rd Qu.:13.674   3rd Qu.:30.00     3rd Qu.:20.00    3rd Qu.:28.0  
 Max.   :53.000   Max.   :95.00     Max.   :23.00    Max.   :95.0  
                                                                   
    O3_Mean        O3_1st_Max_Value  O3_1st_Max_Hour     O3_AQI     
 Min.   :0.00000   Min.   :0.00000   Min.   : 0.00   Min.   :  0.0  
 1st Qu.:0.01850   1st Qu.:0.02800   1st Qu.: 9.00   1st Qu.: 25.0  
 Median :0.02600   Median :0.03600   Median :10.00   Median : 32.0  
 Mean   :0.02597   Mean   :0.03767   Mean   :10.37   Mean   : 34.8  
 3rd Qu.:0.03342   3rd Qu.:0.04700   3rd Qu.:11.00   3rd Qu.: 42.0  
 Max.   :0.06821   Max.   :0.11500   Max.   :23.00   Max.   :200.0  
                                                                    
    SO2_Mean       SO2_1st_Max_Value SO2_1st_Max_Hour    SO2_AQI       
 Min.   :-0.1333   Min.   : -0.100   Min.   : 0.00    Min.   :  0.000  
 1st Qu.: 0.1304   1st Qu.:  0.600   1st Qu.: 1.00    1st Qu.:  0.000  
 Median : 0.7500   Median :  2.000   Median : 8.00    Median :  3.000  
 Mean   : 1.3465   Mean   :  4.324   Mean   : 8.55    Mean   :  5.775  
 3rd Qu.: 1.8667   3rd Qu.:  5.000   3rd Qu.:13.00    3rd Qu.:  7.000  
 Max.   :23.1174   Max.   :191.000   Max.   :23.00    Max.   :153.000  
                                                                       
    CO_Mean       CO_1st_Max_Value CO_1st_Max_Hour      CO_AQI      
 Min.   :0.0000   Min.   :0.000    Min.   : 0.000   Min.   : 0.000  
 1st Qu.:0.1164   1st Qu.:0.200    1st Qu.: 0.000   1st Qu.: 2.000  
 Median :0.2000   Median :0.200    Median : 0.000   Median : 2.000  
 Mean   :0.2076   Mean   :0.302    Mean   : 5.358   Mean   : 3.326  
 3rd Qu.:0.2667   3rd Qu.:0.400    3rd Qu.: 8.000   3rd Qu.: 5.000  
 Max.   :1.3500   Max.   :2.200    Max.   :23.000   Max.   :25.000  
                                                                    

Now data is cleaned with no duplication and known missing values. The dataset contains 10,016 observations and 24 variables. (I dropped 4 label variables to simplify the dataset as they contain only one type of units for each type of gases. It can be easily added later on if needed).

It can be seen from the summary that:

Before further analysis, I transform the cleaned dataset from wide to long format and create three new columns of year, month, day for data manipulation and visualisation.

data_cleaned %>% 
  gather("KPIs", "Value", 9:24) %>% 
  separate(KPIs, c("Gases","Measure"), extra = "merge")%>% 
  mutate(year = factor(year(Date_Local)),
         month = factor(month(Date_Local)),
         day = factor(day(Date_Local))) -> data_cleaned

glimpse(data_cleaned)
Observations: 160,256
Variables: 14
$ State_Code  <fct> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,...
$ County_Code <fct> 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67...
$ Site_Num    <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
$ Address     <fct> "7823 BLACKFOOT WAY, NORTH HIGHLANDS", "7823 BLACK...
$ State       <fct> California, California, California, California, Ca...
$ County      <fct> Sacramento, Sacramento, Sacramento, Sacramento, Sa...
$ City        <fct> Not in a city, Not in a city, Not in a city, Not i...
$ Date_Local  <date> 2010-01-01, 2010-01-02, 2010-01-03, 2010-01-04, 2...
$ Gases       <chr> "NO2", "NO2", "NO2", "NO2", "NO2", "NO2", "NO2", "...
$ Measure     <chr> "Mean", "Mean", "Mean", "Mean", "Mean", "Mean", "M...
$ Value       <dbl> 10.304348, 9.913043, 10.260870, 9.521739, 16.31818...
$ year        <fct> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 20...
$ month       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ day         <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16,...
#Distribution of Air Quality Index
data_cleaned %>%  
  filter(Measure =="AQI") -> chart
ggplot(chart, aes(Value))+
  geom_histogram(binwidth = 5)

There is a huge amount AQI value equal to zero.

#Histogram breaks down by gases
ggplot(chart, aes(Value))+
  geom_histogram(binwidth = 5)+
  facet_wrap(~Gases)+
  xlab("Ozone AQI")

Most of them come from CO and SO2, while O3 and NO2 seem to be normally distributed (approximately).

It seems that there are still missing values coded “0” in the dataset. For the purpose of finding a meaningful pattern, I decided to postpone this additional missing value problem and let concentrate with more usable data first.

Focusing on Ozone data exploration

Ozone has a very right-skewed distribution. It reaches up to 200, which is an alarming “Unhealthy” point. Let have a look at Ozone distribution:

data_cleaned %>%  
  filter(Measure =="AQI", Gases=="O3") -> chart

ggplot(chart, aes(Value))+
  geom_histogram(binwidth = 5)+
  xlab("Ozone AQI")

There is a sudden drop around the value of 50. The standard cut off ranges between “Good” and “Moderate” are [0,50] to [51,100]. Does this coincidence happen by chance or having any underline factor impact on the AQI calculation?

ggplot(chart, aes(Value))+
  geom_histogram(binwidth = 1)+
  xlab("Ozone AQI")

Zooming into smaller binwidth, the cut-off point even clearer. There are also some spikes with a roughly equal interval.

#Try to understand AQI measurement

#Ozone AQI vs. Mean
data_cleaned %>%
  filter( Gases=="O3", Measure %in% c("AQI", "Mean")) %>%
  spread(Measure, Value) -> chart

ggplot(chart, aes(AQI, Mean))+
  geom_jitter(alpha =0.1)

The connection between mean and AQI is not that close. They are correlated but not exactly linear and any model form.

#Ozone AQI vs. 1st_Max_Value
data_cleaned %>%
  filter( Gases=="O3", Measure %in% c("AQI", "1st_Max_Value")) %>%
  spread(Measure, Value) -> chart

ggplot(chart, aes(AQI, `1st_Max_Value`))+
  geom_jitter(alpha =0.1)

It seems that AQI is calculated based on Maximum value in a day rather than its mean. It forms a very clear linear relationship, broken down into two range [0,50] & [>50]. This is consistent with what has found in the previous section and make sense with the AQI formula described on https://en.wikipedia.org/wiki/Air_quality_index#United_States

However, AQI seems to be calculated by two different formulas, as it forms two separate lines in each segment. It is probably that the defined range is slightly different between locations or methods.

#Relationship between Ozone, other gases and time variant

#relationshiop between 4 gases base on mean
data_cleaned %>%
  filter(Measure =="Mean") %>%
  spread(Gases, Value) -> chart
pairs(chart %>% select(CO,SO2,O3,NO2), pch =21, cex=0.01)

There seems to be no strong correlation between Ozone and any other gases. Other measurements are tried, with similar results.

There is a positive relationship between NO2 and CO, yet we will not discuss further this as we are focusing on Ozone.

###4 gases AQI trend

data_cleaned %>% 
  spread(Measure, Value) %>% 
  filter(AQI != 0) %>% 
  gather("Measure","Value", 13:16) -> data_cleaned

data_cleaned %>% 
  group_by(Measure, Gases, year) %>% 
  summarize(mean = round(mean(Value),2),
            min = min(Value),
            max = max(Value)) %>% 
  gather("Sub_measure", "Value", 4:6)-> sum_1

sum_1
# A tibble: 336 x 5
# Groups:   Measure, Gases [16]
   Measure      Gases year  Sub_measure Value
   <chr>        <chr> <fct> <chr>       <dbl>
 1 1st_Max_Hour CO    2010  mean         5.76
 2 1st_Max_Hour CO    2011  mean         5.98
 3 1st_Max_Hour CO    2012  mean         6.04
 4 1st_Max_Hour CO    2013  mean         5.56
 5 1st_Max_Hour CO    2014  mean         5.8 
 6 1st_Max_Hour CO    2015  mean         5.71
 7 1st_Max_Hour CO    2016  mean         5.81
 8 1st_Max_Hour NO2   2010  mean        12.1 
 9 1st_Max_Hour NO2   2011  mean        11.2 
10 1st_Max_Hour NO2   2012  mean        11.4 
# ... with 326 more rows
ggplot(sum_1 %>%  filter(Measure == "AQI", Sub_measure=="mean"), 
       aes(year,Value, group=Sub_measure, col = Sub_measure))+
  geom_line()+
  geom_point()+
  facet_grid(~Gases)+
  theme(axis.text.x = element_text(angle = 75, hjust = 1, size=11))+
  ylab("AQI")

Even though the AQI index of O3 remains highest among 4 gases, it might be a positive signal of a downward trend on this index. However, data in 2016 is not fully included (only updated until April 2016). Therefore, it might be not a proper indication.

ggplot(data_cleaned %>%  filter(Measure == "AQI"), 
       aes(year,Value))+
  geom_boxplot()+
  facet_grid(~Gases)+
  theme(axis.text.x = element_text(angle = 75, hjust = 1, size=11))+
  ylab("AQI")

Looking at AQI distribution within each year, it is noticeable that O3 (and SO2) has a very huge variation. There are many days that AQI stays out of “good” range (>50). In addition, we can see the boxplot shape of O3 in 2016 is totally different from the previous year. This may imply that 4-month data in 2016 is not comparable with the rest of the data. In addition, O3 might have seasonality feature, which makes the shape of the first 4-month in 2016 is completely different from full-year data.

For this reason, let have a closer look about the trend of O3 over the year.

data_cleaned %>% 
  filter(Gases == "O3", Measure =="AQI") %>% 
  mutate(State_Site = paste0(State, " - ", Site_Num)) %>% 
  group_by(year, month, State_Site) %>% 
  summarize(mean_AQI = mean(Value))-> chart
ggplot(chart, aes(month, mean_AQI, group = State_Site, col = State_Site))+
  geom_line()+
  geom_point(alpha=0.5)+
  facet_grid(State_Site ~ year) + 
  geom_vline(xintercept = c(5,9), alpha =0.3, linetype ="longdash")

Interestingly, O3 AQI has quite a consistent trend over the years and regions. AQI increases significantly around April and decreases around September every year. In addition, the seasonality feature is shown very obviously in the East Coast locations such as Pennsylvania, Maryland and New Jersey, but not so clear in Ohio and not seen in North Dakota.

Locations of selected sites in the provided data

Locations of selected sites in the provided data

When comparing the seasonal trend with their locations, it seems to have a meaningful relationship.

data_cleaned %>% 
  filter(Gases == "O3", Measure =="AQI") %>% 
  mutate(State_Site = paste0(State, " - ", Site_Num)) %>% 
  filter(State_Site %in% c("Maryland - 3001","New Jersey - 2","Pennsylvania - 12","Pennsylvania - 8"))-> chart

ggplot(chart, aes(month, Value, col = State_Site))+
  geom_boxplot(alpha=0.5)+
  # geom_jitter(alpha=0.5)+
  facet_grid(State_Site ~ year) + 
  geom_vline(xintercept = c(5,9), alpha =0.3, linetype ="longdash")+
  ylab("Ozone AQI")

There was a larger variation within the month during this peak period before 2013. It seems to more stabilize in 2013 and 2014, yet have a signal to fluctuate again in 2015.

Conclusion