Week 3 Coding Practice: Data Visualization

6.2 The Data: Air Pollution in the U.S.

The data for this assignment comes from the U.S. Environmental Protection Agency (EPA) which sets national air quality standards for outdoor air pollution in the U.S. One standard set by the EPA states that the average mean, averaged over 3 years, of fine particle pollution (referred to as PM2.5) cannot exceed 12 micrograms per cubic meter.

Key Question:

  • Are there any counties in the U.S. that exceeds the national standard for fine particle pollution?

6.3 Getting the Data

The dataset contains the annual mean for PM2.5 averaged over the period 2008 throught 2010.

Read in the data with read.csv():

class <- c("numeric", "character", "factor", "numeric", "numeric")
pollution <- read.csv("data/avgpm25.csv", colClasses = class)

Show the first few rows of the data frame:

head(pollution)
##        pm25  fips region longitude latitude
## 1  9.771185 01003   east -87.74826 30.59278
## 2  9.993817 01027   east -85.84286 33.26581
## 3 10.688618 01033   east -87.72596 34.73148
## 4 11.337424 01049   east -85.79892 34.45913
## 5 12.119764 01055   east -86.03212 34.01860
## 6 10.827805 01069   east -85.35039 31.18973
  • pm25: average PM2.5 level
  • ’fips: the 5-digit code indicating county
  • ‘region’: the region of the country where the county resides
  • ’longitude and latitude: represent the centroid for the county

Show more info on the dataset using str() function:

str(pollution)
## 'data.frame':    576 obs. of  5 variables:
##  $ pm25     : num  9.77 9.99 10.69 11.34 12.12 ...
##  $ fips     : chr  "01003" "01027" "01033" "01049" ...
##  $ region   : Factor w/ 2 levels "east","west": 1 1 1 1 1 1 1 1 1 1 ...
##  $ longitude: num  -87.7 -85.8 -87.7 -85.8 -86 ...
##  $ latitude : num  30.6 33.3 34.7 34.5 34 ...

6.4 Simple Summaries: One Dimension

  • Five-number Summary:
    • Gives the minimum, 25th percentile, median, 75th percentile, and maximum of the data.
    • A quick check on the distribution of the data.
    • Function: fivenum()
  • Boxplots:
    • A visual representation of the five-number summary + some extra info including outliers beyond the bulk of the data.
    • Function: boxplot()
  • Barplot:
    • Useful for visualizing categorical data, with the number of entries for each category being proportional to the height of the bar.
    • Function: barplot()
  • Histograms:
    • Show the complete empirical distribution of the data, beyond the five data points shown in the boxplots.
    • Use to check skewness, symmetry, multi-modality, and other features of the data.
    • Function: hist()
    • Handy function with hist() is rug().
  • Density Plot:
    • Computes the non-parametric estimate of the distribution of variables. Function: density()

6.5 Five Number Summary

Compute the five-number summary of the PM2.5 data in the pollution dataset:

fivenum(pollution$pm25)
## [1]  3.382626  8.547590 10.046697 11.356829 18.440731

Use summary() function to make it a bit easier/ nicer to read

summary(pollution$pm25)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.383   8.549  10.047   9.836  11.356  18.441
  • Note: summmary() adds the mean, which can be compared to the median to check skewness

6.6 Boxplot

In a boxplot, the “whiskers” sticking out above and below the box have a length of 1.5 times the IQR (the distance from the bottom of the box to the top). Anything beyond the whiskers is marked as an “outlier”

boxplot(pollution$pm25, col = "blue")

  • There are few points on both the high and low end that appera to be outliers according to the boxplot() algorithm. These may be worth investigating individually.
  • It appears that the high points are all above the level of 15, so we can look at these data points directly.

Note: The current national ambient air quality standad is 12 mg per cubic meter but it used to be 15.

Filter to show data points pm25 > 15:

filter(pollution, pm25 > 15)
##       pm25  fips region longitude latitude
## 1 16.19452 06019   west -119.9035 36.63837
## 2 15.80378 06029   west -118.6833 35.29602
## 3 18.44073 06031   west -119.8113 36.15514
## 4 16.66180 06037   west -118.2342 34.08851
## 5 15.01573 06047   west -120.6741 37.24578
## 6 17.42905 06065   west -116.8036 33.78331
## 7 16.25190 06099   west -120.9588 37.61380
## 8 16.18358 06107   west -119.1661 36.23465
  • All counties with pm25 levels over 15 are in western U.S. (region = west)
  • All counties are actually all in California (first 2 digits of fips = 06)

Use a Map of the counties to see where the counties lie in California:

map("county", "california")
with(filter(pollution, pm25 > 15), points(longitude, latitude))

6.7 Histogram

A histogram is useful when we need more detail on the full distribution of the data.

Histogram of the PM2.5 annual average data:

hist(pollution$pm25, col = "green")

  • There appears to be a higher concentration of counties in the 9-12 micrograms per cubic meter.

Use rug() function to get more detail (shows actual data points)

hist(pollution$pm25, col = "green")
rug(pollution$pm25)

Override default number of bars by setting breaks argument (more bars = more details):

hist(pollution$pm25, col = "green", breaks=100)
rug(pollution$pm25)

Note: default number of bars in hist() is set based on the density of the data

  • There is a rather large spike at 8 micrograms per cubic meter which could be worth investigating.

6.8 Overlaying Features

Once we start seeing interesting features, it can be useful to lay down annotations on plots as reference points.

Ex: On the boxplot, add a horizontal line at 12, where the national standard is:

boxplot(pollution$pm25, col = "blue")
abline(h=12)

  • A reasonable portion of the distribution is above line (i.e. potentially in volation of the standard).

Add a median line and a line at 12 (US standards) to the histogram:

Use color (col) and line width (lwd) to indicate different components

hist(pollution$pm25, col = "green")
abline(v = 12, lwd = 2)
abline(v = median(pollution$pm25), col = "magenta", lwd = 4)

6.9 Barplot

Useful for summarizing categorical data. Here, we have 1 categorical variable (region) and we use table() to do the actual tabulation of how many counties are in each region.

table(pollution$region) %>% barplot(col = "wheat")

  • There are many more counties in the eastern U.S. than in the west in this dataset.

Simple Summaries: Two Dimensions and Beyond

Some tools for investigating data in 2 dimensions:

  • Multiple or Overlayed 1-D plots:
    • Using multiple boxplots or histograms to see relationships between 2 variables, especially when one is naturally categorical.
  • Scatterplots:
    • Natural tool for visualizing 2 continuous variables.
    • Transformations of the variables (e.g. log or square-root transformation) may be necessary for effective visualizaion)
  • Smooth Scatterplots: *Similar to scatterplots but plots a 2-D histograms of the data.
    • Can be useful for scatterplots that contain many many data points.

Visualizing data in more than 2 dimesnions, without resorting to 3-D animations:

  • Overlayed or multiple 2-D plots; conditioning plots (coplots):
    • A coplot shows the relationship between 2 variables as a 3rd (or more) variable changes.
    • Ex: See the relationship between air pollution levels and mortality changes with the seasons of the year. The air pollution and mortality = the 2 primary variables and season is the 3rd varying, in the background
  • Use color, size, shape to add dimensions
  • Spinning/ Interactive Plots
  • Actual 3-D plots

6.11 Multiple Boxplots

One of the simplest ways to show relationships between 2 variables is to show side-by-side boxplots. The boxplot() function can take a fomula with the left side indicating the variable for which we want to create the boxplot (continuous) and the right side indicatiing the variable that stratifies the left side into categories.

Side-by-side boxplots give a rich sense of any trends or changes in a variable, and many can be fit on 1 page. Show the difference in PM2.5 levels between east and west regions with boxplots:

boxplot(pm25 ~ region, data = pollution, col = "red")

  • Clearly, the levels in the east are on average higher than averages in the west.

6.12 Multiple Histograms

Distribution of PM2.5 in the eastern and western regions of the US:

par(mfrow = c(2, 1), mar = c(4, 4, 2, 1))
hist(subset(pollution, region == "east")$pm25, col = "green")
hist(subset(pollution, region == "west")$pm25, col = "green")

  • PM2.5 in the eastern U.S. tends to be left-skewed with some counties having very low levels.
  • PM2.5 in the western U.S. tends to be right-skewed with some outlying counties with very high levels.

6.13 Scatterplots using plot() function

For continuous variables, the most common visualization technique is the scatterplot, which simply maps each variable to an x- or y-axis coordinate.

Scatterplot of latitude and PM2.5:

with(pollution, plot(latitude, pm25))
abline(h = 12, lwd = 2, lty = 2)

  • As you go from south to north in the U.S., we can see the highest levels of PM2.5 tend to be in the middle of the country.

6.14 Scatterplots with Color

Add a 3rd dimension to scatterplot by using color to highlight that dimension.

Use color to highlight region in a scatterplot

with(pollution, plot(latitude, pm25, col=region))
abline(h = 12, lwd = 2, lty = 2)

  • Here, east is black and west is red but this is not easily identifiable.
  • Figure out which color gets mapped to which region by looking at the levels of the region variable.
levels(pollution$region)
## [1] "east" "west"
  • First level = “east” –> Color for east gets mapped to 1
  • 2nd level = “west” –> Color for west gets mapped to 2
  • For plotting functions:
    • col=1 is black (default)
    • col=2 is red

6.15 Multiple Scatterplots

Using multiple scatterplots can be necessary when overlaying points with different colors or shapes is confusing (sometimes because of the volume of data). Separating the plots out can sometimes make visualization easier.

Make side-by-side scatterplots of of PM25 by Region

par(mfrow = c(1, 2), mar = c(5, 4, 2, 1))
with(subset(pollution, region == "west"), plot(latitude, pm25, main = "West"))
with(subset(pollution, region == "east"), plot(latitude, pm25, main = "East"))

These plots, sometimes called panel plots, are easier with lattice or ggplot2

## Lattice
library(lattice)
xyplot(pm25 ~ latitude | region, data = pollution)

## ggplot2
library(ggplot2)
qplot(latitude, pm25, data = pollution, facets = . ~ region)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

6.16 Summary

  • Exploratory plots are “quick and dirty”
  • Summarize data and highlight broad features to answer basic questions about the data and judge the evidence for or against hypotheses.
  • Useful to suggest modeling strategies to employ in the “next step” of data analysis.