Intro

This is a project for Exploratory Data Analysis course, which is a part of Coursera’s Data Science and Data Science: Foundations Using R Specializations.

The Project aims to explore the National Emissions Inventory database and see what it say about fine particulate matter pollution in the United States over the 10-year period 1999–2008.


  • Code chunks can be displayed by clicking Code button

Setup

Load packages & data

library(data.table); library(ggplot2); library(plotly); library(lattice)
if(!file.exists("./data/0903_DS-EDA-w4_Emissions/data.zip"))
        {download.file(
                "https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2FNEI_data.zip",
                "./data/0903_DS-EDA-w4_Emissions/data.zip", method = "curl")}
unzip("./data/0903_DS-EDA-w4_Emissions/data.zip",
              exdir = "./data/0903_DS-EDA-w4_Emissions")
PM25 <- data.table(readRDS("./data/0903_DS-EDA-w4_Emissions/summarySCC_PM25.rds"))
SCC <- data.table(readRDS(
        "./data/0903_DS-EDA-w4_Emissions/Source_Classification_Code.rds"))

Data

PM2.5 Emissions Data file summarySCC_PM25.rds contains all of the PM2.5 emissions data for 1999, 2002, 2005, and 2008. For each year, the table contains number of tons of PM2.5 emitted from a specific type of source for the entire year:

PM25
          fips        SCC Pollutant     Emissions     type year
      1: 09001   10100401  PM25-PRI  15.714000000    POINT 1999
      2: 09001   10100404  PM25-PRI 234.178000000    POINT 1999
      3: 09001   10100501  PM25-PRI   0.128000000    POINT 1999
      4: 09001   10200401  PM25-PRI   2.036000000    POINT 1999
      5: 09001   10200504  PM25-PRI   0.388000000    POINT 1999
     ---                                                       
6497647: 53009 2265003020  PM25-PRI   0.003152410 NON-ROAD 2008
6497648: 41057 2260002006  PM25-PRI   0.046869500 NON-ROAD 2008
6497649: 38015 2270006005  PM25-PRI   1.012890000 NON-ROAD 2008
6497650: 46105 2265004075  PM25-PRI   0.000486488 NON-ROAD 2008
6497651: 53005 2270004076  PM25-PRI   0.001622670 NON-ROAD 2008

There are 6 497 651 observations on 6 variables, viz:

  • fips: a five-digit number (represented as a string) indicating the U.S. county
  • SCC: the name of the source as indicated by a digit string (see source code classification table)
  • Pollutant: a string indicating the pollutant
  • Emissions: amount of PM2.5 emitted, in tons
  • type: the type of source (point, non-point, on-road, or non-road)
  • year: the year of emissions recorded

Source Classification Code file Source_Classification_Code.rds provides a mapping from the SCC digit strings in the Emissions table to the actual name of the PM2.5 source. There are following variables:

names(SCC)
 [1] "SCC"                 "Data.Category"       "Short.Name"         
 [4] "EI.Sector"           "Option.Group"        "Option.Set"         
 [7] "SCC.Level.One"       "SCC.Level.Two"       "SCC.Level.Three"    
[10] "SCC.Level.Four"      "Map.To"              "Last.Inventory.Year"
[13] "Created_Date"        "Revised_Date"        "Usage.Notes"        

The sources are categorized in a few different ways from more general to more specific, one may choose to explore whatever categories are most useful. For example, source 10100101 is known as Ext Comb /Electric Gen /Anthracite Coal /Pulverized Coal:

head(SCC)[,1:3]
        SCC Data.Category
1: 10100101         Point
2: 10100102         Point
3: 10100201         Point
4: 10100202         Point
5: 10100203         Point
6: 10100204         Point
                                                                   Short.Name
1:                   Ext Comb /Electric Gen /Anthracite Coal /Pulverized Coal
2: Ext Comb /Electric Gen /Anthracite Coal /Traveling Grate (Overfeed) Stoker
3:       Ext Comb /Electric Gen /Bituminous Coal /Pulverized Coal: Wet Bottom
4:       Ext Comb /Electric Gen /Bituminous Coal /Pulverized Coal: Dry Bottom
5:                   Ext Comb /Electric Gen /Bituminous Coal /Cyclone Furnace
6:                   Ext Comb /Electric Gen /Bituminous Coal /Spreader Stoker

1. Total annual emissions in the US

  • Group data by year, then calculate total emissions for each year
  • Make a plot showing the total PM2.5 emission in the US from all sources by year
totemUSpm <- PM25 %>% group_by(year) %>%
        summarise(emission = sum(Emissions)/10^6)
plot(totemUSpm, type= "l", lwd=3, col="cornflowerblue",
        main = "The US annual emissions from PM2.5, mln tons")
points(totemUSpm, pch=20, lwd=5, col="brown")


  • Total emissions from PM2.5 have DECREASED in the United States from 1999 to 2008

2. Total annual emissions in Baltimore

  • Filter data, then group by year, and calculate the Baltimore City, Maryland (fips == 24510) total year emissions
  • Make a plot showing the total PM2.5 emission in Baltimore by year
    • use the base plotting system
totemBalpm <- PM25 %>% filter(fips == 24510) %>%
        group_by(year) %>%
        summarise(Baltimore.total.emission = sum(Emissions))
bp <- barplot(Baltimore.total.emission ~ year, totemBalpm,
              col = "cornflowerblue",
              main ="Baltimore total annual emissions from PM2.5, tons")
points(bp, totemBalpm$Baltimore.total.emission, pch=20,
       col = "brown", lwd=4)
lines(bp, totemBalpm$Baltimore.total.emission, col = "brown", lwd=3)

  • Total emissions from PM2.5 DECREASED in the Baltimore City, Maryland in 2008
    • COMPARED to 1999/2002/2005
    • AND in 2005/2002 COMPARED to 1999
  • BUT in 2005 COMPARED to 2002, total emissions from PM2.5 INCREASED

3. Changes in total emissions for Baltimore by types of sources

  • Filter, group by type & year, and calculate Baltimore groups totals emissions
    • types: point, nonpoint, onroad, nonroad
  • Make a plot showing the total PM2.5 emission in Baltimore by type & year
    • use ggplot/ plotly packages
totemBalType <- PM25 %>% filter(fips == 24510) %>%
        group_by(type, year) %>%
        summarise(Baltimore.total.emission = sum(Emissions))
gg<- ggplot(totemBalType, aes(x=year, y=Baltimore.total.emission,
                              color=type))+
        geom_point(size=1.2, alpha=0.7)+
        geom_line()+
        facet_grid(~type)+
        theme(axis.text.x = element_text(
                angle = 25, vjust = 1, hjust = 0),
              axis.text.y = element_text(
                angle = 35, vjust = 1, hjust = 0),
              legend.position = "none",
              plot.title = element_text(hjust = 0.5))+
        labs(x="", y="emissions",
             title="Baltimore total emissions by type and year, tons (interactive)")
ggplotly(gg)
  • All types of sources (point, nonpoint, onroad, nonroad) have seen decreases in emissions from 1999–2008 for Baltimore City
  • Sources of types nonroad and point have seen increases in emissions from 1999–2008 for Baltimore City

5. Emissions from motor vehicle sources in Baltimore

  • Find SCC-codes of motor vehicle sources
  • Filter by vehicle sources, group by year, and calculate groups avg. emissions
  • Make a plot showing emissions from motor vehicle sources in Baltimore by year
vehicle <- SCC %>% filter(Short.Name %like% "*[Vv]ehicle*") %>% transmute(SCC)
avgBalVeh <- PM25 %>% filter((SCC %in% vehicle$SCC) &(fips == 24510)) %>%
        group_by(year) %>%
        summarise(avg.emission.vehicle = mean(Emissions))
bp <- barplot(avg.emission.vehicle ~ year, avgBalVeh, col = "cornflowerblue",
              main = "Baltimore avg annual emissions from motor vehicle sources, tons")
points(bp, avgBalVeh$avg.emission.vehicle, pch=20, col = "brown", lwd=4)
lines(bp, avgBalVeh$avg.emission.vehicle, col = "brown", lwd=3)

  • Emissions from motor vehicle sources in Baltimore City have DECREASED from 1999 to 2008

6. Greater changes in emissions from vehicle sources: Baltimore or Los Angeles

  • Find SCC-codes of motor vehicle sources
  • Filter by vehicle sources and locations Los Angeles County, California (fips == 06037)/ Baltimore City, Maryland (fips == 24510)
    • then group by City & year, and calculate groups avg. emissions
  • Rename cities codes to cities names
  • Make a plot showing emissions from motor vehicle sources in Baltimore comparing to Los Angeles, by year
    • use lattice package
vehicle <- SCC %>% filter(Short.Name %like% "*[Vv]ehicle*") %>% transmute(SCC)
avgVeh <- PM25 %>%
        filter((SCC %in% vehicle$SCC) &((fips == "24510")|(fips=="06037"))) %>%
        transmute(City = as.factor(fips), Emissions, year = as.factor(year)) %>%
        group_by(City, year) %>%
        summarise(avg.emission.vehicle = mean(Emissions))
levels(avgVeh$City) <- c("LosAngeles", "Baltimore")
barchart(avg.emission.vehicle ~ year| City, data = avgVeh, groups= City,
         auto.key = TRUE, horizontal = FALSE,
         main = "Emissions from motor vehicle sources in Los Angeles and Baltimore")

  • Greater changes over time in motor vehicle emissions has seen Los Angeles