Assignment 3 - Visualization and Analysis of Air Pollution Data

Introduction

Air pollution has been always of great concern due to its adverse impact on public health. Environmental departments of the countries around the world are responsible for monitoring and assessing the air pollution problems through various approaches such as ground observation network. A number of air pollutants such ozone, carbon monoxide, particulate matter is timely measured by the ground stations across the globe. EPA (Environmental Protection Agency) in USA provides open and free data for air pollution data. In this work, air pollution data from ground monitoring stations and aggregated pollution variables at the county levels were accessed and download from EPA in USA (https://aqs.epa.gov/aqsweb/airdata/download_files.html). Two key variables of particulate matter 2.5 (PM2.5) and air quality index (AQI) for 2018 selected from the downloaded data were used to visualize and analyze the trends and patterns of air pollution in USA. PM2.5 represents the concentration of particulate matters with diameter less than 2.5 micrometers and AQI is an overall metric that represents air pollution condition by considering all major pollutant. Specifically, the data sources for this work are listed as follows:

Daily level PM 2.5 concentration and AQI index data by monitors in 2018;
Annually AQI summaries at the county level in 2018.

Data preprocessing

Daily site data contains various attributes related to PM2.5 for each ground station. Key variables regarding station information, pollutants are filtered from the original data by the code below. The column names of filtered data frame are renamed for the convenience of the following analysis.

library(readr)
library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

library(ggplot2)

# import data
aqi <- read_csv("C:/Users/Tom/Desktop/data/ass3/data/annual_aqi_by_county_2018.csv")

## Parsed with column specification:
## cols(
##   State = col_character(),
##   County = col_character(),
##   Year = col_double(),
##   `Days with AQI` = col_double(),
##   `Good Days` = col_double(),
##   `Moderate Days` = col_double(),
##   `Unhealthy for Sensitive Groups Days` = col_double(),
##   `Unhealthy Days` = col_double(),
##   `Very Unhealthy Days` = col_double(),
##   `Hazardous Days` = col_double(),
##   `Max AQI` = col_double(),
##   `90th Percentile AQI` = col_double(),
##   `Median AQI` = col_double(),
##   `Days CO` = col_double(),
##   `Days NO2` = col_double(),
##   `Days Ozone` = col_double(),
##   `Days SO2` = col_double(),
##   `Days PM2.5` = col_double(),
##   `Days PM10` = col_double()
## )

dly <- read_csv("C:/Users/Tom/Desktop/data/ass3/data/daily.csv")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `Parameter Code` = col_double(),
##   POC = col_double(),
##   Latitude = col_double(),
##   Longitude = col_double(),
##   `Date Local` = col_date(format = ""),
##   `Observation Count` = col_double(),
##   `Observation Percent` = col_double(),
##   `Arithmetic Mean` = col_double(),
##   `1st Max Value` = col_double(),
##   `1st Max Hour` = col_double(),
##   AQI = col_double(),
##   `Method Code` = col_double(),
##   `Date of Last Change` = col_date(format = "")
## )

## See spec(...) for full column specifications.

pmaqi <- 
  dly %>%
  select(`State Name`, `County Name`, `Site Num`, `Date Local`,
         `Arithmetic Mean`, AQI, Longitude, Latitude) %>%
  filter(`Arithmetic Mean` >= 0) %>%
  drop_na()
names(pmaqi) <- c("statename", "countyname", "sitenum", "date", 
                     "pm25", "aqi",  "long", "lat")

Statistical summary

A basic statistical description of point-level PM2.5 concentration and county-level AQI index can help to better understand the two variables. Here, histogram and boxplots are used to visualize the characteristics of the distribution of PM2.5 concentration and AQI. Daily AQI index measures the severity of air pollution and can be used to categorize days into different groups based on the value of AQI. In terms of the level of air pollution, days can be typically group into four categories including Good Days, Moderate Days, Unhealthy Days and Hazardous Days. County-level AQI data used in this report include four category variables that represents the number of days of each category in 2018. The code below visualizes the distribution of the four categorical variables. According to the histograms, most counties generally experienced good air quality throughout 2018 which is indicated by the right skewness of the distribution for Good Days a very narrow distribution intervals around 0 day for Unhealthy Days and Hazardous Days. Additionally, the distributions of Good Days and Moderate Days show opposite skewness, which suggests that the days with moderate air quality were not typical throughout the year 2018 as the distribution of Moderate Days is clustered below 50 days compared to above 250 days for Good Days.

## histogram for aqi at county
vis <-
  aqi %>% 
  select(`Good Days`, `Moderate Days`, `Unhealthy Days`, `Hazardous Days`) %>%
  gather(key = class, value = days)
vis$class <- factor(vis$class, levels = c("Good Days", "Moderate Days", 
                                                  "Unhealthy Days", "Hazardous Days"))
ggplot(data = vis) +
  geom_histogram(aes(x = days), bins = 35) +
  facet_wrap(~ class, scales = "free_y") +
  theme_bw()

The county-level data represent a average level of air pollution. Point-level measurements of PM2.5 and AQI derived from ground stations depict a detail and local condition of air pollution. The distributions of PM2.5 and AQI at the station level are displayed below. Both distribution shows a positive skewness with AQI clustering around 50 and PM2.5 below 25. Stations with high values for both variables seldom occurred during the year.

## histogram for aqi and pm2.5 at sites
vis <- 
  pmaqi %>%
  select(pm25, aqi) %>%
  gather(key = class, value = value)
ggplot(data = vis) +
  geom_histogram(aes(x = value), bins = 35) +
  facet_wrap(~ class, scales = "free") +
  theme_bw()

ggplot(data = vis) +
  geom_boxplot(aes(x =class,  y = value)) +
  theme_bw()

Temporal variation

Daily data record time series of observation at the monitoring sites for each day in 2018 and can be utilized to explore the trends of PM2.5 and AQI. First, measurements of the stations in each state were averaged for each data. Time series of seven states were selected to show their changes over the year as displayed below. It can be seen from the plot that states in the south and western region of USA like California and Mexico experienced severely air pollution problem in the beginning of the year. Both Washington and California suffered bad air quality in the second half of the year, which is indicated by the peaks of the trend curves.

## aggreagte by state and day
vis <- 
  pmaqi %>%
  select(statename, date, pm25, aqi) %>%
  group_by(statename, date) %>%
  summarise(pm25 = mean(pm25), aqi = mean(aqi)) %>%
  gather(key = class, value = value, -date, -statename) %>%
  filter(statename %in% c("Alabama", "California", "Arizona", "Country Of Mexico", "New York", "Maryland", "Washington"))
ggplot(data = vis) +
  geom_path(aes(x = date, y = value, color = statename)) +
  facet_wrap(~ class) +
  theme_bw()

To show the overall changes, the measurements from all the monitors were aggregated at each day. The overall variation over the year is plotted as follows. We can see that the changing trends of AQI and PM2.5 show a similar pattern and summer and winter seasons experience more severe air pollution.

## aggreagte by day
vis <- 
  pmaqi %>%
  select(date, pm25, aqi) %>%
  group_by(date) %>%
  summarise(pm25 = mean(pm25), aqi = mean(aqi)) %>%
  gather(key = class, value = value, -date)
ggplot(data = vis) +
  geom_path(aes(x = date, y = value, color = class)) +
  theme_bw()

Spatial distribution and patterns

County-level AQI index was aggregated against different states to explore spatial variation of air quality for each state. Different averages of the variable Good Days are rendered by gradient colors in the plot below. This map shows how many days with good air quality for each state in 2018. From the map, states in the northern and eastern regions had better air condition than other states.

## aggregate aqi by state (areal level)
states <- map_data("state")
aqi$State <- stringr::str_to_lower(aqi$State)
vis <- 
  aqi %>%
  select(State, `Good Days`, `Moderate Days`, `Unhealthy Days`) %>%
  group_by(State) %>%
  summarise(good = mean(`Good Days`),
            mod = mean(`Moderate Days`),
            unhel = mean(`Unhealthy Days`))
vis <-
  states %>%
  left_join(vis,  by = c("region" = "State"))

ggplot() +
  geom_polygon(aes(x = long, y = lat, group = group, fill = good),
               data = vis) +
  scale_fill_distiller("Days", palette = "Spectral") +
  xlab("Longitude") + ylab("Latitude") +
  ggtitle("Distribution of good days") +
  coord_map() +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5),
        legend.key.width = unit(4, "mm"))

Averages of the variable Unhealthy days are also mapped below, which confirms the observation from the previous that states in east had better air condition. We can also see that California is the state that experienced most days of unhealthy condition.

ggplot() +
  geom_polygon(aes(x = long, y = lat, group = group, fill = unhel),
               data = vis) +
  scale_fill_distiller("Days", palette = "Spectral") +
  xlab("Longitude") + ylab("Latitude") +
  ggtitle("Distribution of unhealth days") +
  coord_map() +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5), 
        legend.key.width = unit(4, "mm"))

Point-level measurements from stations are aggregated for each station to from annually average values. The map below shows the spatial distribution of annually averaged PM2.5 concentration over the USA. The points with green colors indicate lower PM2.5 concentration. The stations in the west had measured higher concentration level of PM2.5.

## aggregate for point data
### for pm2.5
vis <- 
  pmaqi %>%
  select(sitenum, long, lat, pm25, aqi) %>%
  group_by(sitenum) %>%
  summarise(pm25 = mean(pm25), aqi = mean(aqi), long = mean(long), lat = mean(lat)) %>%
  gather(key = class, value = value, -long, -lat, -sitenum) %>%
  filter(long > -125)
vis1 <- vis[vis$class == "pm25", ]
ggplot() +
  geom_polygon(aes(x = long, y = lat, group = group),
               fill = "whitesmoke", colour = "gray50", size = 0.1, data = states) +
  geom_point(aes(x = long, y = lat, color = value), data = vis1) +
  scale_color_distiller("PM 2.5", palette = "RdYlGn") +
  xlab("Longitude") + ylab("Latitude") +
  ggtitle("Distribution of PM 2.5 concentration") +
  coord_map() +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5), 
        legend.key.width = unit(4, "mm"))

AQI can provide a more comprehensive measure of air condition. Point-level AQI at the stations is mapped as follows. Since AQI considers a number of typical pollutants, the distribution of AQI shows a moderately different pattern compared to the pattern of PM2.5.

### for aqi index
vis2 <- vis[vis$class == "aqi", ]
ggplot() +
  geom_polygon(aes(x = long, y = lat, group = group),
               fill = "whitesmoke", colour = "gray50", size = 0.1, 
               data = states) +
  geom_point(aes(x = long, y = lat, color = value), data = vis2) +
  scale_color_distiller("AQI", palette = "PuBuGn", direction = 0) +
  xlab("Longitude") + ylab("Latitude") +
  ggtitle("Distribution of AQI") +
  coord_map() +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5), 
        legend.key.width = unit(4, "mm"))

Conclusion

This report utilizes the air pollution data from EPA and visualization facilities from R to explore the distributions and patterns of air pollution from both spatial and temporal dimensions. High spatial and temporal variations in PM2.5 and AQI have been identified from the data. Summer and winter seasons were experienced server air pollution problem. States in the west and south tend to be suffered more air pollution problem than the state in the northern and eastern regions.