library("tidyverse")
library("lubridate")
library("stringr")
#import data
US_pollution_sample_2018 <- read_csv("US_pollution_sample_2018.csv")
#rename to remove whitespace
colnames(US_pollution_sample_2018) <- str_replace_all(colnames(US_pollution_sample_2018), " ", "_")
#preview data
glimpse(US_pollution_sample_2018)
Observations: 51,034
Variables: 28
$ State_Code <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,...
$ County_Code <dbl> 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, ...
$ Site_Num <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
$ Address <chr> "7823 BLACKFOOT WAY, NORTH HIGHLANDS", "7823...
$ State <chr> "California", "California", "California", "C...
$ County <chr> "Sacramento", "Sacramento", "Sacramento", "S...
$ City <chr> "Not in a city", "Not in a city", "Not in a ...
$ Date_Local <date> 2010-01-01, 2010-01-01, 2010-01-01, 2010-01...
$ NO2_Units <chr> "Parts per billion", "Parts per billion", "P...
$ NO2_Mean <dbl> 10.304348, 10.304348, 10.304348, 10.304348, ...
$ NO2_1st_Max_Value <dbl> 24, 24, 24, 24, 18, 18, 18, 18, 20, 20, 20, ...
$ NO2_1st_Max_Hour <dbl> 19, 19, 19, 19, 21, 21, 21, 21, 1, 1, 1, 1, ...
$ NO2_AQI <dbl> 23, 23, 23, 23, 17, 17, 17, 17, 19, 19, 19, ...
$ O3_Units <chr> "Parts per million", "Parts per million", "P...
$ O3_Mean <dbl> 0.013375, 0.013375, 0.013375, 0.013375, 0.00...
$ O3_1st_Max_Value <dbl> 0.021, 0.021, 0.021, 0.021, 0.018, 0.018, 0....
$ O3_1st_Max_Hour <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 12, 12, 12, ...
$ O3_AQI <dbl> 18, 18, 18, 18, 15, 15, 15, 15, 13, 13, 13, ...
$ SO2_Units <chr> "Parts per billion", "Parts per billion", "P...
$ SO2_Mean <dbl> 0.695652, 0.695652, 0.685714, 0.685714, 0.34...
$ SO2_1st_Max_Value <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ SO2_1st_Max_Hour <dbl> 0, 0, 14, 14, 0, 0, 2, 2, 0, 0, 2, 2, 4, 4, ...
$ SO2_AQI <dbl> 1, 1, NA, NA, 1, 1, NA, NA, 1, 1, NA, NA, 1,...
$ CO_Units <chr> "Parts per million", "Parts per million", "P...
$ CO_Mean <dbl> 0.391304, 0.355556, 0.391304, 0.355556, 0.40...
$ CO_1st_Max_Value <dbl> 0.7, 0.5, 0.7, 0.5, 1.0, 0.6, 1.0, 0.6, 1.0,...
$ CO_1st_Max_Hour <dbl> 20, 21, 20, 21, 22, 0, 22, 0, 1, 2, 1, 2, 21...
$ CO_AQI <dbl> NA, 6, NA, 6, NA, 7, NA, 7, NA, 9, NA, 9, NA...
The dataset contains 51,034 observations and 28 variables. The dataset provides 4 measurements (mean, max value, max hour and Air Quality Index) of four criteria gases (NO2, O3, SO2, CO) in different sites in the US.
At first glance, it can be seen that there are missing values in SO2 AQI and CO AQI. In addition, data is also duplicated.
US_pollution_sample_2018 %>%
group_by(State_Code, County_Code, Site_Num, Date_Local) %>%
summarize(dup_group = n()) %>%
group_by(dup_group) %>%
summarize(no_of_record = n())
# A tibble: 5 x 2
dup_group no_of_record
<int> <int>
1 2 18
2 4 8896
3 6 1
4 8 1923
5 12 2
When grouping data by each site and date, most of the group has 4 records or 8 records. There might be a systematic problem or reason for this duplication and missing values.
When compared to original data from https://aqs.epa.gov/aqsweb/airdata/download_files.html, I realize that in daily data, Nitrogen dioxide and Ozone have only one Pollutant Standard for each, which results in one record per day. Meanwhile, Carbon monoxide and Sulfur dioxide have two Pollutant Standards for each, which also results in two records per day.
Therefore, I suspect that when people combined data from these files, each pair of CO and SO2 matched and auto-created duplicated values for NO2 and O3. Thus, most of the records in the dataset has 4 observations per day.
In addition, CO and SO2 have two Standards, yet only one standard has AQI data available. Hence, when matching, it includes missing data of AQI from a specific Pollutant standard.
For these reasons, I can detect the rule to remove duplicated and missing values and eliminate previous data combination problem.
#clean duplication
data_cleaned <- US_pollution_sample_2018 %>%
filter(!is.na(SO2_AQI) & !is.na(CO_AQI))
#check duplication
data_cleaned %>%
group_by(State_Code, County_Code, Site_Num, Date_Local) %>%
summarize(dup_group = n()) %>%
group_by(dup_group) %>%
summarize(no_of_record = n())
# A tibble: 3 x 2
dup_group no_of_record
<int> <int>
1 1 8914
2 2 1924
3 3 2
After cleaning, there are still 1926 groups of duplication. It is due to changes in the Measurement method for Ozone in one site in Ohio. After careful comparison, I detect one method the consistently used and try to remove the values of other inconsistent methods. The data from the original source is too big to include in this submission. Therefore, I only import the result after removal of these duplications. Further explanation will be provided if required.
#import cleaned dataset
data_cleaned <- read_csv("data_cleaned.csv")
#check duplication
data_cleaned %>%
group_by(State_Code, County_Code, Site_Num, Date_Local) %>%
summarize(dup_group = n()) %>%
group_by(dup_group) %>%
summarize(no_of_record = n())
# A tibble: 1 x 2
dup_group no_of_record
<int> <int>
1 1 10016
#check missing values
sapply(data_cleaned, function(x) sum(is.na(x)))
State_Code County_Code Site_Num Address
0 0 0 0
State County City Date_Local
0 0 0 0
NO2_Mean NO2_1st_Max_Value NO2_1st_Max_Hour NO2_AQI
0 0 0 0
O3_Mean O3_1st_Max_Value O3_1st_Max_Hour O3_AQI
0 0 0 0
SO2_Mean SO2_1st_Max_Value SO2_1st_Max_Hour SO2_AQI
0 0 0 0
CO_Mean CO_1st_Max_Value CO_1st_Max_Hour CO_AQI
0 0 0 0
#convert character to factor variables, date to date format
data_cleaned %>%
mutate_at(vars(State_Code,County_Code,Site_Num,Address,State,County,City), funs(factor(.))) %>%
mutate(Date_Local = as.Date(Date_Local))-> data_cleaned
#data summary
summary(data_cleaned)
State_Code County_Code Site_Num
6 : 299 5 :2148 2 :1628
24:2148 7 :1329 8 :1817
34:1329 17 :3323 12 :1272
38:2051 35 :1100 60 :1100
39:1100 67 : 299 1004:2051
42:3089 133:1817 3001:2148
Address State
266 Spruce Street :1329 California : 299
4266 40TH AVE NORTH :2051 Maryland :2148
600 Dorsey Avenue :2148 New Jersey :1329
7823 BLACKFOOT WAY, NORTH HIGHLANDS: 299 North Dakota:2051
E. 14TH & ORANGE :1100 Ohio :1100
HILL ST. :1817 Pennsylvania:3089
ROCKVIEW LANE :1272
County City Date_Local
Baltimore :2148 Bristol :1272 Min. :2010-01-01
Bucks :1272 Camden :1329 1st Qu.:2011-09-15
Camden :1329 Cleveland :1100 Median :2013-01-29
Cass :2051 Essex :2148 Mean :2013-02-12
Cuyahoga :1100 Not in a city:2350 3rd Qu.:2014-07-14
Sacramento: 299 York :1817 Max. :2016-04-30
York :1817
NO2_Mean NO2_1st_Max_Value NO2_1st_Max_Hour NO2_AQI
Min. : 0.000 Min. : 0.00 Min. : 0.00 Min. : 0.0
1st Qu.: 5.250 1st Qu.:13.00 1st Qu.: 5.00 1st Qu.:12.0
Median : 8.953 Median :21.00 Median : 8.00 Median :20.0
Mean :10.151 Mean :21.81 Mean :11.28 Mean :20.5
3rd Qu.:13.674 3rd Qu.:30.00 3rd Qu.:20.00 3rd Qu.:28.0
Max. :53.000 Max. :95.00 Max. :23.00 Max. :95.0
O3_Mean O3_1st_Max_Value O3_1st_Max_Hour O3_AQI
Min. :0.00000 Min. :0.00000 Min. : 0.00 Min. : 0.0
1st Qu.:0.01850 1st Qu.:0.02800 1st Qu.: 9.00 1st Qu.: 25.0
Median :0.02600 Median :0.03600 Median :10.00 Median : 32.0
Mean :0.02597 Mean :0.03767 Mean :10.37 Mean : 34.8
3rd Qu.:0.03342 3rd Qu.:0.04700 3rd Qu.:11.00 3rd Qu.: 42.0
Max. :0.06821 Max. :0.11500 Max. :23.00 Max. :200.0
SO2_Mean SO2_1st_Max_Value SO2_1st_Max_Hour SO2_AQI
Min. :-0.1333 Min. : -0.100 Min. : 0.00 Min. : 0.000
1st Qu.: 0.1304 1st Qu.: 0.600 1st Qu.: 1.00 1st Qu.: 0.000
Median : 0.7500 Median : 2.000 Median : 8.00 Median : 3.000
Mean : 1.3465 Mean : 4.324 Mean : 8.55 Mean : 5.775
3rd Qu.: 1.8667 3rd Qu.: 5.000 3rd Qu.:13.00 3rd Qu.: 7.000
Max. :23.1174 Max. :191.000 Max. :23.00 Max. :153.000
CO_Mean CO_1st_Max_Value CO_1st_Max_Hour CO_AQI
Min. :0.0000 Min. :0.000 Min. : 0.000 Min. : 0.000
1st Qu.:0.1164 1st Qu.:0.200 1st Qu.: 0.000 1st Qu.: 2.000
Median :0.2000 Median :0.200 Median : 0.000 Median : 2.000
Mean :0.2076 Mean :0.302 Mean : 5.358 Mean : 3.326
3rd Qu.:0.2667 3rd Qu.:0.400 3rd Qu.: 8.000 3rd Qu.: 5.000
Max. :1.3500 Max. :2.200 Max. :23.000 Max. :25.000
Now data is cleaned with no duplication and known missing values. The dataset contains 10,016 observations and 24 variables. (I dropped 4 label variables to simplify the dataset as they contain only one type of units for each type of gases. It can be easily added later on if needed).
It can be seen from the summary that:
Before further analysis, I transform the cleaned dataset from wide to long format and create three new columns of year, month, day for data manipulation and visualisation.
data_cleaned %>%
gather("KPIs", "Value", 9:24) %>%
separate(KPIs, c("Gases","Measure"), extra = "merge")%>%
mutate(year = factor(year(Date_Local)),
month = factor(month(Date_Local)),
day = factor(day(Date_Local))) -> data_cleaned
glimpse(data_cleaned)
Observations: 160,256
Variables: 14
$ State_Code <fct> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,...
$ County_Code <fct> 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67...
$ Site_Num <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
$ Address <fct> "7823 BLACKFOOT WAY, NORTH HIGHLANDS", "7823 BLACK...
$ State <fct> California, California, California, California, Ca...
$ County <fct> Sacramento, Sacramento, Sacramento, Sacramento, Sa...
$ City <fct> Not in a city, Not in a city, Not in a city, Not i...
$ Date_Local <date> 2010-01-01, 2010-01-02, 2010-01-03, 2010-01-04, 2...
$ Gases <chr> "NO2", "NO2", "NO2", "NO2", "NO2", "NO2", "NO2", "...
$ Measure <chr> "Mean", "Mean", "Mean", "Mean", "Mean", "Mean", "M...
$ Value <dbl> 10.304348, 9.913043, 10.260870, 9.521739, 16.31818...
$ year <fct> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 20...
$ month <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ day <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16,...
#Distribution of Air Quality Index
data_cleaned %>%
filter(Measure =="AQI") -> chart
ggplot(chart, aes(Value))+
geom_histogram(binwidth = 5)
There is a huge amount AQI value equal to zero.
#Histogram breaks down by gases
ggplot(chart, aes(Value))+
geom_histogram(binwidth = 5)+
facet_wrap(~Gases)+
xlab("Ozone AQI")
Most of them come from CO and SO2, while O3 and NO2 seem to be normally distributed (approximately).
It seems that there are still missing values coded “0” in the dataset. For the purpose of finding a meaningful pattern, I decided to postpone this additional missing value problem and let concentrate with more usable data first.
Ozone has a very right-skewed distribution. It reaches up to 200, which is an alarming “Unhealthy” point. Let have a look at Ozone distribution:
data_cleaned %>%
filter(Measure =="AQI", Gases=="O3") -> chart
ggplot(chart, aes(Value))+
geom_histogram(binwidth = 5)+
xlab("Ozone AQI")
There is a sudden drop around the value of 50. The standard cut off ranges between “Good” and “Moderate” are [0,50] to [51,100]. Does this coincidence happen by chance or having any underline factor impact on the AQI calculation?
ggplot(chart, aes(Value))+
geom_histogram(binwidth = 1)+
xlab("Ozone AQI")
Zooming into smaller binwidth, the cut-off point even clearer. There are also some spikes with a roughly equal interval.
#Try to understand AQI measurement
#Ozone AQI vs. Mean
data_cleaned %>%
filter( Gases=="O3", Measure %in% c("AQI", "Mean")) %>%
spread(Measure, Value) -> chart
ggplot(chart, aes(AQI, Mean))+
geom_jitter(alpha =0.1)
The connection between mean and AQI is not that close. They are correlated but not exactly linear and any model form.
#Ozone AQI vs. 1st_Max_Value
data_cleaned %>%
filter( Gases=="O3", Measure %in% c("AQI", "1st_Max_Value")) %>%
spread(Measure, Value) -> chart
ggplot(chart, aes(AQI, `1st_Max_Value`))+
geom_jitter(alpha =0.1)
It seems that AQI is calculated based on Maximum value in a day rather than its mean. It forms a very clear linear relationship, broken down into two range [0,50] & [>50]. This is consistent with what has found in the previous section and make sense with the AQI formula described on https://en.wikipedia.org/wiki/Air_quality_index#United_States
However, AQI seems to be calculated by two different formulas, as it forms two separate lines in each segment. It is probably that the defined range is slightly different between locations or methods.
#Relationship between Ozone, other gases and time variant
#relationshiop between 4 gases base on mean
data_cleaned %>%
filter(Measure =="Mean") %>%
spread(Gases, Value) -> chart
pairs(chart %>% select(CO,SO2,O3,NO2), pch =21, cex=0.01)
There seems to be no strong correlation between Ozone and any other gases. Other measurements are tried, with similar results.
There is a positive relationship between NO2 and CO, yet we will not discuss further this as we are focusing on Ozone.
###4 gases AQI trend
data_cleaned %>%
spread(Measure, Value) %>%
filter(AQI != 0) %>%
gather("Measure","Value", 13:16) -> data_cleaned
data_cleaned %>%
group_by(Measure, Gases, year) %>%
summarize(mean = round(mean(Value),2),
min = min(Value),
max = max(Value)) %>%
gather("Sub_measure", "Value", 4:6)-> sum_1
sum_1
# A tibble: 336 x 5
# Groups: Measure, Gases [16]
Measure Gases year Sub_measure Value
<chr> <chr> <fct> <chr> <dbl>
1 1st_Max_Hour CO 2010 mean 5.76
2 1st_Max_Hour CO 2011 mean 5.98
3 1st_Max_Hour CO 2012 mean 6.04
4 1st_Max_Hour CO 2013 mean 5.56
5 1st_Max_Hour CO 2014 mean 5.8
6 1st_Max_Hour CO 2015 mean 5.71
7 1st_Max_Hour CO 2016 mean 5.81
8 1st_Max_Hour NO2 2010 mean 12.1
9 1st_Max_Hour NO2 2011 mean 11.2
10 1st_Max_Hour NO2 2012 mean 11.4
# ... with 326 more rows
ggplot(sum_1 %>% filter(Measure == "AQI", Sub_measure=="mean"),
aes(year,Value, group=Sub_measure, col = Sub_measure))+
geom_line()+
geom_point()+
facet_grid(~Gases)+
theme(axis.text.x = element_text(angle = 75, hjust = 1, size=11))+
ylab("AQI")
Even though the AQI index of O3 remains highest among 4 gases, it might be a positive signal of a downward trend on this index. However, data in 2016 is not fully included (only updated until April 2016). Therefore, it might be not a proper indication.
ggplot(data_cleaned %>% filter(Measure == "AQI"),
aes(year,Value))+
geom_boxplot()+
facet_grid(~Gases)+
theme(axis.text.x = element_text(angle = 75, hjust = 1, size=11))+
ylab("AQI")
Looking at AQI distribution within each year, it is noticeable that O3 (and SO2) has a very huge variation. There are many days that AQI stays out of “good” range (>50). In addition, we can see the boxplot shape of O3 in 2016 is totally different from the previous year. This may imply that 4-month data in 2016 is not comparable with the rest of the data. In addition, O3 might have seasonality feature, which makes the shape of the first 4-month in 2016 is completely different from full-year data.
For this reason, let have a closer look about the trend of O3 over the year.
data_cleaned %>%
filter(Gases == "O3", Measure =="AQI") %>%
mutate(State_Site = paste0(State, " - ", Site_Num)) %>%
group_by(year, month, State_Site) %>%
summarize(mean_AQI = mean(Value))-> chart
ggplot(chart, aes(month, mean_AQI, group = State_Site, col = State_Site))+
geom_line()+
geom_point(alpha=0.5)+
facet_grid(State_Site ~ year) +
geom_vline(xintercept = c(5,9), alpha =0.3, linetype ="longdash")
Interestingly, O3 AQI has quite a consistent trend over the years and regions. AQI increases significantly around April and decreases around September every year. In addition, the seasonality feature is shown very obviously in the East Coast locations such as Pennsylvania, Maryland and New Jersey, but not so clear in Ohio and not seen in North Dakota.
Locations of selected sites in the provided data
When comparing the seasonal trend with their locations, it seems to have a meaningful relationship.
data_cleaned %>%
filter(Gases == "O3", Measure =="AQI") %>%
mutate(State_Site = paste0(State, " - ", Site_Num)) %>%
filter(State_Site %in% c("Maryland - 3001","New Jersey - 2","Pennsylvania - 12","Pennsylvania - 8"))-> chart
ggplot(chart, aes(month, Value, col = State_Site))+
geom_boxplot(alpha=0.5)+
# geom_jitter(alpha=0.5)+
facet_grid(State_Site ~ year) +
geom_vline(xintercept = c(5,9), alpha =0.3, linetype ="longdash")+
ylab("Ozone AQI")
There was a larger variation within the month during this peak period before 2013. It seems to more stabilize in 2013 and 2014, yet have a signal to fluctuate again in 2015.
Data Quality: The dataset contains duplicated and missing values. It has been removed before conducting the analysis. However, there are still some inconsistencies and abnormal values that need to investigate further.
Ozone analysis:
There is strong evidence of seasonality of Ozone AQI. The index increases significantly starting around April and drop back around September. Seasonality characteristic of Ozone seems to link to the location. The areas which are close to the East Coast shows a stronger seasonality (e.g ), while it is less obvious in Ohio and not seen in North Dakota. I might link to the climate and weather feature of these areas. In addition, the peak period also suggests its relation to the heat from the sun and temperature. As the peak time is from April to September.
It also shows a minor improvement in controlling Ozone AQI level in 2013 and 2014, yet the trend seems to reverse in 2015. The analysis should be continued with newly updated data to confirm the trend.