This is my final project for Coursera’s Exploratory Data Analysis Course (using R). The goal of this assignment is to analyse in different locations of the United States, the presence of PM2.5. PM2.5 is an ambient pollutant for which there’s a strong evidence that it is harmful to human health.
In order to study this, we are provided with two main datasets from the EPA National Emissions Inventory Website, with data of the levels of PM2.5 in 1999, 2002, 2005 and 2008.
The datasets we are going to use are contained in a single zip file.
The zip file contains two files:
In order to be able to load and manipulate the data into R we load the following packages:
library(tidyverse)
library(readr)
After downloading the data we proceed to load it into R.
NEI <- readRDS("C:/Users/marct/OneDrive - Tecnocampus Mataro-Maresme/Documentos/CURSOS/R PATH COURSERA/Cleaning & Tidying Data in R/Week 4/summarySCC_PM25.rds")
SCC <- readRDS("C:/Users/marct/OneDrive - Tecnocampus Mataro-Maresme/Documentos/CURSOS/R PATH COURSERA/Cleaning & Tidying Data in R/Week 4/Source_Classification_Code.rds")
1. Have total emissions from PM2.5 decreased in the United States from 1999 to 2008? Using the base plotting system, make a plot showing the average PM2.5 emission from all sources for each of the years 1999, 2002, 2005, and 2008.
Firstly we group the data by year and then obtain the average of PM2.5 levels for each year.
plot1data <- NEI %>% group_by(year) %>% summarise(PM2.5 = mean(Emissions, na.rm=T))
After that we plot the data using the basic plotting system.
with(plot1data, barplot(PM2.5, names.arg=year, col="red"))
title(main="Evolution of PM2.5 levels (1999-2008)")
As we can see in the chart total emissions of PM2.5 have decreased significantly between 1999 and 2008, maybe because of the new regulations. The levels of this pollutant in 2008 were aproximately 5 times less than in 1999.
2. Have total emissions from PM2.5 decreased in the Baltimore City from 1999 to 2008? Use the base plotting system to make a plot answering this question.
To answer this question we have to consider that Baltimore City’s ID for this dataset corresponds to fips==“24510”. First of all we filter the data to obtain only those rows where fips==24510. Then we group it by year and calculate the average levels of PM2.5 using the function summarise.
plot2data <- NEI %>% filter(fips=="24510") %>% group_by(year) %>% summarise(PM2.5 = mean(Emissions, na.rm=T))
Finally we plot the results.
with(plot2data, barplot(PM2.5, names.arg=year, col="lightblue"))
title(main="Evolution of PM2.5 levels (1999-2008) in Baltimore")
In the case of Baltimore City, levels of PM2.5 decreased between 1999 and 2002, then increased between 2002 and 2005, and finally decreased until 2008.
3. Of the four types of sources indicated by the type (point, nonpoint, onroad, nonroad) variable, which of these four sources have seen decreases in emissions from 1999-2008 for Baltimore City? Which have seen increases in emissions from 1999-2008? Use the ggplot2 plotting system to make a plot answer this question.
Again we filter the data to obtain only the data related to Baltimore City, but this time we group by the year and the type of source. Then we obtain the average levels of PM2.5.
plot3data <- NEI %>% filter(fips=="24510") %>% group_by(year, type) %>% summarise(PM2.5 = mean(Emissions, na.rm=T))
Then we plot the result using a barchart from ggplot2 package. We use the parameter stat=“identity” because we want to plot two variables (in the x axis the year, and in the y axis the value of PM2.5). The third variable is plotted using a facet_grid function, that allows us to group the plots by type of source.
ggplot(data=plot3data, aes(x=factor(year), y=PM2.5)) + geom_bar(stat="identity", aes(fill=type)) + facet_grid(.~type) + xlab("Year") + ylab("Tons of PM2.5") + ggtitle("Evolution of PM2.5 levels in Baltimore by type of source")
When dividing by the different types of sources, we observe a similar trend in all of them. Generally speaking, PM2.5 levels seem to have decreased for all types of sources in the US.
4. Across the United States, how have emissions from coal combustion-related sources changed from 1999-2008?
In order to answer this question, we have to use the SCC dataframe to filter those rows that refer to coal and combustion sources. These rows are identified by a kind of ID column called SCC. Our goal is to obtain the SCC’s that correspond to the sources related to coal-combustion sources.
combustioncoalSCC <- filter(SCC, grepl("comb", SCC.Level.One, ignore.case=TRUE), grepl("coal", SCC.Level.Four, ignore.case=TRUE) )
After that we filter the NEI dataframe to obtain those rows that contain the same SCC code than the values from the combustioncoalSCC dataframe (the dataframe we created to obtain the coal-combustion sources SCC). In other words, we use the SCC column from the combustioncoalSCC dataframe, to filter this column and obtain only the emissions related to coal-combustion sources.
prova <- filter(NEI, SCC==combustioncoalSCC$SCC)
Now we create another dataframe where we group the results by year and obtain the average emissions (from coal-combustion related sources) per year.
prova2 <- prova %>% group_by(year) %>% summarise(PM2.5=mean(Emissions, na.rm=TRUE))
Finally we plot the result using the ggplot package.
ggplot(data=prova2, aes(x=factor(year), y=PM2.5))+ geom_bar(stat="identity", fill="#FF9999") + xlab("Year") + ylab("Tons of PM2.5") + labs(title = expression("PM"[2.5]*" Coal Combustion Source Emissions Across US from 1999-2008"))
5. How have emissions from motor vehicle sources changed from 1999-2008 in Baltimore City?
To answer this question we use a similar procedure to the last exercise. In order to find the motor vehicle sources, we filter the SCC dataframe to obtain those records where the word “vehicle” appears , ignoring capital letters (through the ignore.case parameter.
motorvehicles <- filter(SCC,grepl("vehicle",SCC.Level.Two, ignore.case = T))
Then we create a new dataframe as a result of filtering the NEI dataset to obtain the rows where the SCC code is the same as in the SCC code column from the motorvehicles dataset. This operation allows us to obtain the data related to motor vehicle sources.
prova5 <- filter(NEI, SCC %in% motorvehicles$SCC)
data5 <- prova5 %>% filter(fips=="24510") %>% group_by(year) %>% summarise(PM2.5 = mean(Emissions, na.rm=T))
Finally we plot the data.
ggplot(data=data5, aes(x=factor(year), y=PM2.5)) + geom_bar(stat="identity", fill="#FF9889") + xlab("Year") + ylab("Tons of PM2.5") + labs(title = expression("PM"[2.5]*" Motor Vehicle Source Emissions in Baltimore from 1999-2008"))
As we can see, average levels of PM2.5 from motor vehicle sources have decreased substantially since 1999.
6. Compare emissions from motor vehicle sources in Baltimore City with emissions from motor vehicle sources in Los Angeles County, California . Which city has seen greater changes over time in motor vehicle emissions?
In order to answer this question we use the same dataframe created before (motorvehicles), which contains the ID codes (SCC) for motor vehicle sources. We proceed with the same mechanism and create the dataframe (prova6) where we store PM2.5 levels for motor vehicle sources. fips==“06037”)
motorvehicles <- filter(SCC,grepl("vehicle",SCC.Level.Two, ignore.case = T))
prova6 <- filter(NEI, SCC %in% motorvehicles$SCC)
The second step is to filter the dataframe that we have just created (prova6) and obtain only the values for Baltimore City and Los Ángeles (we use the fips ID column with the corresponding ID of each location).
We create then a new dataframe where we store the names of the cities (citynames) and that we will use later to plot the data.
We join this dataframe and the existing one.
data6 <- prova6 %>% filter(fips =="24510" | fips== "06037") %>% group_by(year, fips) %>% summarise(PM2.5 = mean(Emissions, na.rm=T))
fips <- unique(data6$fips)
City <- c("Los Angeles", "Baltimore City")
citynames <- cbind(fips,City) %>% as.data.frame()
data6 <- left_join(data6, citynames)
Finally we plot the result. We use the facet grid and the colors to represent the data for each city nicely.
ggplot(data=data6, aes(x=factor(year), y=PM2.5)) + geom_bar(stat="identity", aes(fill=City)) + facet_grid(.~City, scales="free", space="free")+xlab("Year") + ylab("Tons of PM") + labs(title = expression("PM"[2.5]*" Motor Vehicle Source Emissions in Baltimore/Los Angeles from 1999-2008"))
In general emissions are much lower in Baltimore City than in Los Angeles. In Baltimore City the levels of PM2.5 have decreased while in Los Angeles, levels have increased.