MSE InfVis Module Task

Summary

Fine particulate matter (PM2.5) is an ambient air pollutant with diameter of 2.5 micrometers or less, for which there is a strong evidence that is harmful to human health. This work explores the change in PM2.5 emissions through ten years in the USA. As a result, those counties are highlighted which experienced surprisingly high increase of emissions between 1999–2008.

Emissions data retrieved from: http://www.epa.gov/ttn/chief/net/2008inventory.html

Literature:

Chang, W., 2013. R graphics cookbook: practical recipes for visualizing data, 1st ed. ed. O’Reilly, Beijing.
Dorman, M., 2014. Learning R for geospatial analysis: leverage the power of R to elegantly manage crucial geospatial analysis tasks. Packt Publishing, Birmingham.
Particulates, 2015. . Wikipedia Free Encycl. Retrieved 03.05.2015.
Wickham, H., 2009. ggplot2: Elegant Graphics for Data Analysis. Springer New York.

Module supervisor: Prof. Dr. Susanne Bleisch

Load the necessary packages and custom functions.

library(ggplot2)
library(plyr)
library(rgdal)

## Loading required package: sp

## rgdal: version: 1.1-7, (SVN revision 612)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 1.11.0, released 2014/04/16
##  Path to GDAL shared files: /usr/local/share/gdal
##  Loaded PROJ.4 runtime: Rel. 4.8.0, 6 March 2012, [PJ_VERSION: 480]
##  Path to PROJ.4 shared files: (autodetected)
##  Linking to sp version: 1.2-2

library(raster)
library(grid)
#library(rgeos)
library(maptools)

## Checking rgeos availability: TRUE

source("R/sum_year.R")
source("R/percent_change.R")

Read in the emissions data, it will take a few seconds!

NEI <- readRDS("data/summarySCC_PM25.rds")
attach(NEI)
head(NEI)

##     fips      SCC Pollutant Emissions  type year
## 4  09001 10100401  PM25-PRI    15.714 POINT 1999
## 8  09001 10100404  PM25-PRI   234.178 POINT 1999
## 12 09001 10100501  PM25-PRI     0.128 POINT 1999
## 16 09001 10200401  PM25-PRI     2.036 POINT 1999
## 20 09001 10200504  PM25-PRI     0.388 POINT 1999
## 24 09001 10200602  PM25-PRI     1.490 POINT 1999

Data preparation and overview

We need to see how the data is distributed and whats the case with the outliers. For that, the best tools are the histogram, density plot, box plot. So lets check a histogram first, using year as the faceting variable

Density plot.

Let’s see a box plot. A box plot can help us to clearly visualize outliers in the data, although its worth to know how actually the outliers are calculated. In the default settings of the ggplot2 those values are considered outliers that fall below the 2QR and above the 4QR +/- 1.5*the interquartile range (IQR).

Well, it looks like that there are quite many of them. This can be a bit tricky and raises a couple of questions: 1. How are these extreme values compared to the “core” values? 2. Are those values really outliers, or this is the nature of the data?

Unfortunately, we in the scope of this project we cannot really go so deep into this topic as it would require, therefore we choose the easy path and just exclude them from the further analysis because they would strongly skew the results. Excluding the outliers:

upper.limit <- quantile(NEI$Emissions)[4] + 1.5*IQR(NEI$Emissions)
lower.limit <- quantile(NEI$Emissions)[2] - 1.5*IQR(NEI$Emissions)
lower.limit

##        25% 
## -0.1336473

But notice that the lower limit is negative. Because as far as we know a negative emission doesn’t make sense, we are going to manually set the lower limit to 0.

lower.limit <- 0

And now we can subset the data and create two groups. The first is going tho be the group without the previously identified outliers, the second group is the group of outliers where only the positive outliers are considered. The group for the outliers is necessary in case we want to analyse them more in detail later on, but in this case we continue without the outliers.

NEI_sub <- subset(NEI, Emissions<upper.limit & Emissions>lower.limit)
NEI_outl <- subset(NEI, Emissions>upper.limit)

Data analysis

First, let’s see how the emissions changed through ten years in the whole country. For that, we need to summarize the emissions per year.

NEI_sum  <- data.frame(year = unique(NEI_sub$year), S = sum_year(NEI_sub))

And create a line graph.

We can observe an overall increase between 2005 and 2008. What happens if we zoom in and subdivide by emissions source? Again, sum up the data by source type and year.

NEI_type_sum  <- ddply(NEI_sub, c("type", "year"), summarise, Emissions_total=sum(Emissions))

And create a similar plot than before.

We can observe, that the increase in emissions was mainly caused by POINT type of sources. Additionally, on this plot the interesting range is highlighted with red. This is what we are going to focus on next.

We are going to calculate the percentage change in emissions in POINT sources between 2005 and 2008 to see if there are extreme values that could have caused that obvious increase. First calculate the percentage change in emissions. subset NEI_sub for type==POINT

NEI_POINT <- NEI_sub[NEI_sub$type=="POINT", ] # subset NEI_sub for type==POINT

Calculate the total emissions per year for every fips.

fips_year_all <- ddply(NEI_POINT, c("fips", "year"), summarise, Emissions_total=sum(Emissions))

Select those fips that have a value for 2005 OR 2008.

fips_year <- fips_year_all[fips_year_all$year==2005 | fips_year_all$year==2008, ]

Delete the fips that don’t have a value for both 2005 and 2008.

fips_05 <- fips_year[fips_year$year==2005, "fips"]
fips_08 <- fips_year[fips_year$year==2008, "fips"]
fips.include <- intersect(fips_05, fips_08)
fips_year <- fips_year[fips_year$fips %in% fips.include, ]

Then calculate the percentage change by every fips.

fips_CHG  <- data.frame(fips=as.numeric(), Em.change=as.numeric())
for (i in unique(fips_year$fips)) {
    e_05 <- fips_year[fips_year$fips==i & fips_year$year==2005, "Emissions_total"]
    e_08 <- fips_year[fips_year$fips==i & fips_year$year==2008, "Emissions_total"]
    fips_CHG <- rbind(fips_CHG, data.frame(fips=i, Em.change=pc_change(e_05, e_08)))
}
head(fips_CHG)

##    fips Em.change
## 1 01001 226.78530
## 2 01007 -63.27132
## 3 01009 -80.58632
## 4 01013 211.59802
## 5 01015 160.13376
## 6 01017  10.01750

Now we plot a histogram.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Unfortunately there is not much we can see, but if we take a look at the scale of the x-axis, we can observe some very high values. Common sense says that change of this magnitude is unlikely to be realistic. Therefore we zoom in to the data to see if we can identify the distribution.

## Warning: Removed 74 rows containing non-finite values (stat_bin).

The histogram shows the percentage change in emissions from POINT sources between 2005 and 2008. Because the histogram drops at 600% and common sense dictates that a change in the thousand percent range is unlikely, the values above 600% were excluded from the further analysis. Hence, we subset our data again.

fips_CHG_lim <- fips_CHG[fips_CHG$Em.change<=600, ]

Now we can plot the values on a map, to see if we can identify any pattern. But first we need to re-code the continuous variables to categorical.

fips_CHG_lim$category <- cut(fips_CHG_lim$Em.change,
                     breaks=c(-100, 0, 100, 200, 600), # -100 because of min(fips_CHG_lim$Em.change)
                     labels=c("-100–0%","0–100%","100–200%","200–600%"))

Let’s prepare the data for the map. Where the geographical data for the map is directly downloaded from the database of global administrative boundaries. First, get and prepare the data for the county boundaries.

county  <-  readOGR("/home/bdukai/Documents/Studies/MSE_GIT/InfVis/ModulTask/data", "USA_2_GADM_fips", stringsAsFactors = FALSE) # need to provide a full path

## OGR data source with driver: ESRI Shapefile 
## Source: "/home/bdukai/Documents/Studies/MSE_GIT/InfVis/ModulTask/data", layer: "USA_2_GADM_fips"
## with 3145 features
## It has 4 fields

county = county[
    county$NAME_1 != "Alaska" &
        county$NAME_1 != "Hawaii", ]
county = county[
    county$TYPE_2 != "Water body", ]
newProj = CRS(
"+proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs")
county = spTransform(county, newProj)
county_f  <-  fortify(county, region = "FIPS")
head(county_f)

##      long      lat order  hole piece    id   group
## 1 1225972 -1274991     1 FALSE     1 01001 01001.1
## 2 1234371 -1274114     2 FALSE     1 01001 01001.1
## 3 1244907 -1272280     3 FALSE     1 01001 01001.1
## 4 1244132 -1267496     4 FALSE     1 01001 01001.1
## 5 1265116 -1263940     5 FALSE     1 01001 01001.1
## 6 1265318 -1263907     6 FALSE     1 01001 01001.1

Join the data with the calculated changes for every county.

colnames(county_f)[which(colnames(county_f) == "id")] = "fips"
county_f  <-  join(county_f, fips_CHG_lim, "fips")
head(county_f)

##      long      lat order  hole piece  fips   group Em.change category
## 1 1225972 -1274991     1 FALSE     1 01001 01001.1  226.7853 200–600%
## 2 1234371 -1274114     2 FALSE     1 01001 01001.1  226.7853 200–600%
## 3 1244907 -1272280     3 FALSE     1 01001 01001.1  226.7853 200–600%
## 4 1244132 -1267496     4 FALSE     1 01001 01001.1  226.7853 200–600%
## 5 1265116 -1263940     5 FALSE     1 01001 01001.1  226.7853 200–600%
## 6 1265318 -1263907     6 FALSE     1 01001 01001.1  226.7853 200–600%

Get the data for the state boundaries.

states  <-  getData("GADM", country = "USA", level = 1) 
states  <-  states[!(states$NAME_1 %in% c("Alaska", "Hawaii")), ]
states  <-  spTransform(states, CRS(proj4string(county)))
states_f  <-  fortify(states, region = "NAME_1")
head(states_f)

##      long      lat order  hole piece      id     group
## 1 1076104 -1034268     1 FALSE     1 Alabama Alabama.1
## 2 1085410 -1033146     2 FALSE     1 Alabama Alabama.1
## 3 1093749 -1031892     3 FALSE     1 Alabama Alabama.1
## 4 1107308 -1030032     4 FALSE     1 Alabama Alabama.1
## 5 1108666 -1029851     5 FALSE     1 Alabama Alabama.1
## 6 1112841 -1029288     6 FALSE     1 Alabama Alabama.1

And set the theme for the plot.

sp_minimal  <- theme_minimal(base_size = 12, base_family = "Verdana") +
    theme(axis.text = element_blank(),
          axis.title = element_blank(),
          axis.ticks = element_blank())

Finally, plot the map.

Map of the USA identifying those counties that experienced a surprisingly high change in emissions from POINT sources between 2005 and 2008. These cases would require closer analysis to discover the cause of the high amount of change. Furthermore, the amount of change was categorized using common sense, but with more information on this topic, the categories could look very different, also with regard of what is considered an outlier.

MSE InfVis Module Task

Balázs Dukai

06.05.2015

Summary

Literature:

Data preparation and overview

Data analysis