# Load deplyr
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Getting the Data

# Create a class variable for the columns
class <- c("numeric", "character", "factor", "numeric", "numeric")

# Read in the dataset 
pollution <- read.csv("avgpm25.csv", colClasses = class)

# Check the top
head(pollution)
##        pm25  fips region longitude latitude
## 1  9.771185 01003   east -87.74826 30.59278
## 2  9.993817 01027   east -85.84286 33.26581
## 3 10.688618 01033   east -87.72596 34.73148
## 4 11.337424 01049   east -85.79892 34.45913
## 5 12.119764 01055   east -86.03212 34.01860
## 6 10.827805 01069   east -85.35039 31.18973
# Look at the structure of the dataset
str(pollution)
## 'data.frame':    576 obs. of  5 variables:
##  $ pm25     : num  9.77 9.99 10.69 11.34 12.12 ...
##  $ fips     : chr  "01003" "01027" "01033" "01049" ...
##  $ region   : Factor w/ 2 levels "east","west": 1 1 1 1 1 1 1 1 1 1 ...
##  $ longitude: num  -87.7 -85.8 -87.7 -85.8 -86 ...
##  $ latitude : num  30.6 33.3 34.7 34.5 34 ...

Simple Summaries: One Dimension

For one dimensional summarize, there are number of options in R.

Five Number Summary

fivenum(pollution$pm25)
## [1]  3.382626  8.547590 10.046697 11.356829 18.440731
summary(pollution$pm25)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.383   8.549  10.050   9.836  11.360  18.440

Boxplots

boxplot(pollution$pm25, col = "red")

filter(pollution, pm25 > 15)
##       pm25  fips region longitude latitude
## 1 16.19452 06019   west -119.9035 36.63837
## 2 15.80378 06029   west -118.6833 35.29602
## 3 18.44073 06031   west -119.8113 36.15514
## 4 16.66180 06037   west -118.2342 34.08851
## 5 15.01573 06047   west -120.6741 37.24578
## 6 17.42905 06065   west -116.8036 33.78331
## 7 16.25190 06099   west -120.9588 37.61380
## 8 16.18358 06107   west -119.1661 36.23465
# Load the map library
library(maps)
## 
##  # maps v3.1: updated 'world': all lakes moved to separate new #
##  # 'lakes' database. Type '?world' or 'news(package="maps")'.  #
# Map it
map("county", "california")

# Perform a select and overlay onto the map
with(filter(pollution, pm25 > 15), points(longitude, latitude))

Histogram

# create histogram
hist(pollution$pm25, col = "green")

# Get more information w/the rug() function
rug(pollution$pm25)

# Better with parameters
#
hist(pollution$pm25, col = "green", breaks = 100)
rug(pollution$pm25)

Overlaying Features

boxplot(pollution$pm25, col = "red")

abline(h = 12)

hist(pollution$pm25, col = "green")
abline(v = 12, lwd = 2)
abline(v = median(pollution$pm25), col = "magenta", lwd = 4)

Barplot

table(pollution$region) %>% barplot(col = "wheat")

Simple Summaries: Two Dimensions and Beyond

So far we’ve covered some of the main tools used to summarize one dimensional data. For investigating data in two dimensions and beyond, there is an array of additional tools. Some of the key approaches are:

For visualizing data in more than 2 dimensions, without resorting to 3-D animations (or glasses!), we can often combine the tools that we’ve already learned:

Multiple Boxplots

  • One of the simplest ways to show the relationship between two variables (in this case, one categorical and one continuous) is to show side-by-side boxplots. Using the pollution data described above, we can show the difference in PM2.5 levels between the eastern and western parts of the U.S. with the boxplot() function.
boxplot(pm25 ~ region, data = pollution, col = "red")

  • The boxplot() function can take a formula, with the left hand side indicating the variable for which we want to create the boxplot (continuous) and the right hand side indicating the variable that stratifies the left hand side into categories. Since the region variable only has two categories, we end up with two boxplots. Side-by-side boxplots are useful because you can often fit many on a page to get a rich sense of any trends or changes in a variable. Their compact format allow you to visualize a lot of data in a small space.

  • From the plot above, we can see rather clearly that the levels in eastern counties are on average higher than the levels in western counties.

Multiple Histograms

  • It can sometimes be useful to plot multiple histograms, much like with side-by-side boxplots, to see changes in the shape of the distribution of a variable across different categories. However, the number of histograms that you can effectively put on a page is limited.

  • Here is the distribution of PM2.5 in the eastern and western regions of the U.S.

par(mfrow = c(2, 1), mar = c(4, 4, 2, 1))
hist(subset(pollution, region == "east")$pm25, col = "green")
hist(subset(pollution, region == "west")$pm25, col = "green")

Scatterplots

  • For continuous variables, the most common visualization technique is the scatterplot, which simply maps each variable to an x- or y-axis coordinate. Here is a scatterplot of latitude and PM2.5, which can be made with the plot() function.
with(pollution, plot(latitude, pm25))
abline(h = 12, lwd = 2, lty = 2)

  • As you go from south to north in the U.S., we can see that the highest levels of PM2.5 tend to be in the middle region of the country.

Scatterplot - Using Color

  • If we wanted to add a third dimension to the scatterplot above, say the region variable indicating east and west, we could use color to highlight that dimension. Here we color the circles in the plot to indicate east (black) or west (red).
with(pollution, plot(latitude, pm25, col = region))
abline(h = 12, lwd = 2, lty = 2)

# It may be confusing at first to figure out which color gets mapped to which region. We
# can find out by looking directly at the levels of the region variable.
#
levels(pollution$region)
## [1] "east" "west"
  • It may be confusing at first to figure out which color gets mapped to which region. We can find out by looking directly at the levels of the region variable.
levels(pollution$region)
## [1] "east" "west"
  • Here we see that the first level is “east” and the second level is “west”. So the color for “east” will get mapped to 1 and the color for “west” will get mapped to 2. For plotting functions, col = 1 is black (the default color) and col = 2 is red.

Multiple Scatterplots

  • Using multiple scatterplots can be necessary when overlaying points with different colors or shapes is confusing (sometimes because of the volume of data). Separating the plots out can sometimes make visualization easier.
par(mfrow = c(1, 2), mar = c(5, 4, 2, 1))
with(subset(pollution, region == "west"), plot(latitude, pm25, main = "West"))
with(subset(pollution, region == "east"), plot(latitude, pm25, main = "East"))

  • These kinds of plots, sometimes called panel plots, are generally easier to make with either the lattice or ggplot2 system, which we will learn about in greater detail in later chapters..
## Lattice
library(lattice)
xyplot(pm25 ~ latitude | region, data = pollution)

## ggplot2
library(ggplot2)
qplot(latitude, pm25, data = pollution, facets = . ~ region)

Summary

  • Exploratory plots are “quick and dirty” and their purpose is to let you summarize the data and highlight any broad features. They are also useful for exploring basic questions about the data and for judging the evidence for or against certain hypotheses. Ultimately, they may be useful for suggesting modeling strategies that can be employed in the “next step” of the data analysis process.