The data for this assignment comes from the U.S. Environmental Protection Agency (EPA) which sets national air quality standards for outdoor air pollution in the U.S. One standard set by the EPA states that the average mean, averaged over 3 years, of fine particle pollution (referred to as PM2.5) cannot exceed 12 micrograms per cubic meter.
Key Question:
The dataset contains the annual mean for PM2.5 averaged over the period 2008 throught 2010.
Read in the data with read.csv():
class <- c("numeric", "character", "factor", "numeric", "numeric")
pollution <- read.csv("data/avgpm25.csv", colClasses = class)
Show the first few rows of the data frame:
head(pollution)
## pm25 fips region longitude latitude
## 1 9.771185 01003 east -87.74826 30.59278
## 2 9.993817 01027 east -85.84286 33.26581
## 3 10.688618 01033 east -87.72596 34.73148
## 4 11.337424 01049 east -85.79892 34.45913
## 5 12.119764 01055 east -86.03212 34.01860
## 6 10.827805 01069 east -85.35039 31.18973
Show more info on the dataset using str() function:
str(pollution)
## 'data.frame': 576 obs. of 5 variables:
## $ pm25 : num 9.77 9.99 10.69 11.34 12.12 ...
## $ fips : chr "01003" "01027" "01033" "01049" ...
## $ region : Factor w/ 2 levels "east","west": 1 1 1 1 1 1 1 1 1 1 ...
## $ longitude: num -87.7 -85.8 -87.7 -85.8 -86 ...
## $ latitude : num 30.6 33.3 34.7 34.5 34 ...
Compute the five-number summary of the PM2.5 data in the pollution dataset:
fivenum(pollution$pm25)
## [1] 3.382626 8.547590 10.046697 11.356829 18.440731
Use summary() function to make it a bit easier/ nicer to read
summary(pollution$pm25)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.383 8.549 10.047 9.836 11.356 18.441
In a boxplot, the “whiskers” sticking out above and below the box have a length of 1.5 times the IQR (the distance from the bottom of the box to the top). Anything beyond the whiskers is marked as an “outlier”
boxplot(pollution$pm25, col = "blue")
Note: The current national ambient air quality standad is 12 mg per cubic meter but it used to be 15.
Filter to show data points pm25 > 15:
filter(pollution, pm25 > 15)
## pm25 fips region longitude latitude
## 1 16.19452 06019 west -119.9035 36.63837
## 2 15.80378 06029 west -118.6833 35.29602
## 3 18.44073 06031 west -119.8113 36.15514
## 4 16.66180 06037 west -118.2342 34.08851
## 5 15.01573 06047 west -120.6741 37.24578
## 6 17.42905 06065 west -116.8036 33.78331
## 7 16.25190 06099 west -120.9588 37.61380
## 8 16.18358 06107 west -119.1661 36.23465
Use a Map of the counties to see where the counties lie in California:
map("county", "california")
with(filter(pollution, pm25 > 15), points(longitude, latitude))
A histogram is useful when we need more detail on the full distribution of the data.
Histogram of the PM2.5 annual average data:
hist(pollution$pm25, col = "green")
Use rug() function to get more detail (shows actual data points)
hist(pollution$pm25, col = "green")
rug(pollution$pm25)
Override default number of bars by setting breaks argument (more bars = more details):
hist(pollution$pm25, col = "green", breaks=100)
rug(pollution$pm25)
Note: default number of bars in hist() is set based on the density of the data
Once we start seeing interesting features, it can be useful to lay down annotations on plots as reference points.
Ex: On the boxplot, add a horizontal line at 12, where the national standard is:
boxplot(pollution$pm25, col = "blue")
abline(h=12)
Add a median line and a line at 12 (US standards) to the histogram:
Use color (col) and line width (lwd) to indicate different components
hist(pollution$pm25, col = "green")
abline(v = 12, lwd = 2)
abline(v = median(pollution$pm25), col = "magenta", lwd = 4)
Useful for summarizing categorical data. Here, we have 1 categorical variable (region) and we use table() to do the actual tabulation of how many counties are in each region.
table(pollution$region) %>% barplot(col = "wheat")
One of the simplest ways to show relationships between 2 variables is to show side-by-side boxplots. The boxplot() function can take a fomula with the left side indicating the variable for which we want to create the boxplot (continuous) and the right side indicatiing the variable that stratifies the left side into categories.
Side-by-side boxplots give a rich sense of any trends or changes in a variable, and many can be fit on 1 page. Show the difference in PM2.5 levels between east and west regions with boxplots:
boxplot(pm25 ~ region, data = pollution, col = "red")
Distribution of PM2.5 in the eastern and western regions of the US:
par(mfrow = c(2, 1), mar = c(4, 4, 2, 1))
hist(subset(pollution, region == "east")$pm25, col = "green")
hist(subset(pollution, region == "west")$pm25, col = "green")
For continuous variables, the most common visualization technique is the scatterplot, which simply maps each variable to an x- or y-axis coordinate.
Scatterplot of latitude and PM2.5:
with(pollution, plot(latitude, pm25))
abline(h = 12, lwd = 2, lty = 2)
Add a 3rd dimension to scatterplot by using color to highlight that dimension.
Use color to highlight region in a scatterplot
with(pollution, plot(latitude, pm25, col=region))
abline(h = 12, lwd = 2, lty = 2)
levels(pollution$region)
## [1] "east" "west"
Using multiple scatterplots can be necessary when overlaying points with different colors or shapes is confusing (sometimes because of the volume of data). Separating the plots out can sometimes make visualization easier.
Make side-by-side scatterplots of of PM25 by Region
par(mfrow = c(1, 2), mar = c(5, 4, 2, 1))
with(subset(pollution, region == "west"), plot(latitude, pm25, main = "West"))
with(subset(pollution, region == "east"), plot(latitude, pm25, main = "East"))
These plots, sometimes called panel plots, are easier with lattice or ggplot2
## Lattice
library(lattice)
xyplot(pm25 ~ latitude | region, data = pollution)
## ggplot2
library(ggplot2)
qplot(latitude, pm25, data = pollution, facets = . ~ region)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.