This is a project for Exploratory Data Analysis course, which is a part of Coursera’s Data Science and Data Science: Foundations Using R Specializations.
The Project aims to explore the National Emissions Inventory database and see what it say about fine particulate matter pollution in the United States over the 10-year period 1999–2008.
Code buttonLoad packages & data
library(data.table); library(ggplot2); library(plotly); library(lattice)
if(!file.exists("./data/0903_DS-EDA-w4_Emissions/data.zip"))
{download.file(
"https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2FNEI_data.zip",
"./data/0903_DS-EDA-w4_Emissions/data.zip", method = "curl")}
unzip("./data/0903_DS-EDA-w4_Emissions/data.zip",
exdir = "./data/0903_DS-EDA-w4_Emissions")
PM25 <- data.table(readRDS("./data/0903_DS-EDA-w4_Emissions/summarySCC_PM25.rds"))
SCC <- data.table(readRDS(
"./data/0903_DS-EDA-w4_Emissions/Source_Classification_Code.rds"))PM2.5 Emissions Data file summarySCC_PM25.rds contains all of the PM2.5 emissions data for 1999, 2002, 2005, and 2008. For each year, the table contains number of tons of PM2.5 emitted from a specific type of source for the entire year:
PM25 fips SCC Pollutant Emissions type year
1: 09001 10100401 PM25-PRI 15.714000000 POINT 1999
2: 09001 10100404 PM25-PRI 234.178000000 POINT 1999
3: 09001 10100501 PM25-PRI 0.128000000 POINT 1999
4: 09001 10200401 PM25-PRI 2.036000000 POINT 1999
5: 09001 10200504 PM25-PRI 0.388000000 POINT 1999
---
6497647: 53009 2265003020 PM25-PRI 0.003152410 NON-ROAD 2008
6497648: 41057 2260002006 PM25-PRI 0.046869500 NON-ROAD 2008
6497649: 38015 2270006005 PM25-PRI 1.012890000 NON-ROAD 2008
6497650: 46105 2265004075 PM25-PRI 0.000486488 NON-ROAD 2008
6497651: 53005 2270004076 PM25-PRI 0.001622670 NON-ROAD 2008
There are 6 497 651 observations on 6 variables, viz:
fips: a five-digit number (represented as a string) indicating the U.S. countySCC: the name of the source as indicated by a digit string (see source code classification table)Pollutant: a string indicating the pollutantEmissions: amount of PM2.5 emitted, in tonstype: the type of source (point, non-point, on-road, or non-road)year: the year of emissions recordedSource Classification Code file Source_Classification_Code.rds provides a mapping from the SCC digit strings in the Emissions table to the actual name of the PM2.5 source. There are following variables:
names(SCC) [1] "SCC" "Data.Category" "Short.Name"
[4] "EI.Sector" "Option.Group" "Option.Set"
[7] "SCC.Level.One" "SCC.Level.Two" "SCC.Level.Three"
[10] "SCC.Level.Four" "Map.To" "Last.Inventory.Year"
[13] "Created_Date" "Revised_Date" "Usage.Notes"
The sources are categorized in a few different ways from more general to more specific, one may choose to explore whatever categories are most useful. For example, source 10100101 is known as Ext Comb /Electric Gen /Anthracite Coal /Pulverized Coal:
head(SCC)[,1:3] SCC Data.Category
1: 10100101 Point
2: 10100102 Point
3: 10100201 Point
4: 10100202 Point
5: 10100203 Point
6: 10100204 Point
Short.Name
1: Ext Comb /Electric Gen /Anthracite Coal /Pulverized Coal
2: Ext Comb /Electric Gen /Anthracite Coal /Traveling Grate (Overfeed) Stoker
3: Ext Comb /Electric Gen /Bituminous Coal /Pulverized Coal: Wet Bottom
4: Ext Comb /Electric Gen /Bituminous Coal /Pulverized Coal: Dry Bottom
5: Ext Comb /Electric Gen /Bituminous Coal /Cyclone Furnace
6: Ext Comb /Electric Gen /Bituminous Coal /Spreader Stoker
totemUSpm <- PM25 %>% group_by(year) %>%
summarise(emission = sum(Emissions)/10^6)
plot(totemUSpm, type= "l", lwd=3, col="cornflowerblue",
main = "The US annual emissions from PM2.5, mln tons")
points(totemUSpm, pch=20, lwd=5, col="brown")24510) total year emissionstotemBalpm <- PM25 %>% filter(fips == 24510) %>%
group_by(year) %>%
summarise(Baltimore.total.emission = sum(Emissions))
bp <- barplot(Baltimore.total.emission ~ year, totemBalpm,
col = "cornflowerblue",
main ="Baltimore total annual emissions from PM2.5, tons")
points(bp, totemBalpm$Baltimore.total.emission, pch=20,
col = "brown", lwd=4)
lines(bp, totemBalpm$Baltimore.total.emission, col = "brown", lwd=3)point, nonpoint, onroad, nonroadggplot/ plotly packagestotemBalType <- PM25 %>% filter(fips == 24510) %>%
group_by(type, year) %>%
summarise(Baltimore.total.emission = sum(Emissions))
gg<- ggplot(totemBalType, aes(x=year, y=Baltimore.total.emission,
color=type))+
geom_point(size=1.2, alpha=0.7)+
geom_line()+
facet_grid(~type)+
theme(axis.text.x = element_text(
angle = 25, vjust = 1, hjust = 0),
axis.text.y = element_text(
angle = 35, vjust = 1, hjust = 0),
legend.position = "none",
plot.title = element_text(hjust = 0.5))+
labs(x="", y="emissions",
title="Baltimore total emissions by type and year, tons (interactive)")
ggplotly(gg)point, nonpoint, onroad, nonroad) have seen decreases in emissions from 1999–2008 for Baltimore Citynonroad and point have seen increases in emissions from 1999–2008 for Baltimore Cityvehicle <- SCC %>% filter(Short.Name %like% "*[Vv]ehicle*") %>% transmute(SCC)
avgBalVeh <- PM25 %>% filter((SCC %in% vehicle$SCC) &(fips == 24510)) %>%
group_by(year) %>%
summarise(avg.emission.vehicle = mean(Emissions))
bp <- barplot(avg.emission.vehicle ~ year, avgBalVeh, col = "cornflowerblue",
main = "Baltimore avg annual emissions from motor vehicle sources, tons")
points(bp, avgBalVeh$avg.emission.vehicle, pch=20, col = "brown", lwd=4)
lines(bp, avgBalVeh$avg.emission.vehicle, col = "brown", lwd=3)fips == 06037)/ Baltimore City, Maryland (fips == 24510)
lattice packagevehicle <- SCC %>% filter(Short.Name %like% "*[Vv]ehicle*") %>% transmute(SCC)
avgVeh <- PM25 %>%
filter((SCC %in% vehicle$SCC) &((fips == "24510")|(fips=="06037"))) %>%
transmute(City = as.factor(fips), Emissions, year = as.factor(year)) %>%
group_by(City, year) %>%
summarise(avg.emission.vehicle = mean(Emissions))
levels(avgVeh$City) <- c("LosAngeles", "Baltimore")
barchart(avg.emission.vehicle ~ year| City, data = avgVeh, groups= City,
auto.key = TRUE, horizontal = FALSE,
main = "Emissions from motor vehicle sources in Los Angeles and Baltimore")