Exploratory Graphs: Part TWO VIDEO - https://www.youtube.com/watch?v=UyopqXQ8TTM&feature=youtu.be
What is the difference between categorical, ordinal and interval variables? http://www.ats.ucla.edu/stat/mult_pkg/whatstat/nominal_ordinal_interval.htm
What is a continuous variable? http://www.statisticshowto.com/continuous-variable/
Each row contains the 5-digit code indicating the county (fips), the region of the country in which the county resides, the longitude and latitude of the centroid for that county, and the average PM2.5 level.
# Load deplyr
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Getting the Data
# Create a class variable for the columns
class <- c("numeric", "character", "factor", "numeric", "numeric")
# Read in the dataset
pollution <- read.csv("avgpm25.csv", colClasses = class)
# Check the top
head(pollution)
## pm25 fips region longitude latitude
## 1 9.771185 01003 east -87.74826 30.59278
## 2 9.993817 01027 east -85.84286 33.26581
## 3 10.688618 01033 east -87.72596 34.73148
## 4 11.337424 01049 east -85.79892 34.45913
## 5 12.119764 01055 east -86.03212 34.01860
## 6 10.827805 01069 east -85.35039 31.18973
# Look at the structure of the dataset
str(pollution)
## 'data.frame': 576 obs. of 5 variables:
## $ pm25 : num 9.77 9.99 10.69 11.34 12.12 ...
## $ fips : chr "01003" "01027" "01033" "01049" ...
## $ region : Factor w/ 2 levels "east","west": 1 1 1 1 1 1 1 1 1 1 ...
## $ longitude: num -87.7 -85.8 -87.7 -85.8 -86 ...
## $ latitude : num 30.6 33.3 34.7 34.5 34 ...
For one dimensional summarize, there are number of options in R.
Five-number summary: This gives the minimum, 25th percentile, median, 75th percentile, maximum of the data and is quick check on the distribution of the data (see the fivenum())
Boxplots: Boxplots are a visual representation of the five-number summary plus a bit more information. In particular, boxplots commonly plot outliers that go beyond the bulk of the data. This is implemented via the boxplot() function
Barplot: Barplots are useful for visualizing categorical data, with the number of entries for each category being proportional to the height of the bar. Think “pie chart” but actually useful. The barplot can be made with the barplot() function.
Histograms: Histograms show the complete empirical distribution of the data, beyond the five data points shown by the boxplots. Here, you can easily check skewwness of the data, symmetry, multi-modality, and other features. The hist() function makes a histogram, and a handy function to go with it sometimes is the rug() function.
Density plot: The density() function computes a non-parametric estimate of the distribution of a variables
fivenum(pollution$pm25)
## [1] 3.382626 8.547590 10.046697 11.356829 18.440731
summary(pollution$pm25)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.383 8.549 10.050 9.836 11.360 18.440
boxplot(pollution$pm25, col = "red")
Note that in a boxplot, the “whiskers” that stick out above and below the box have a length of 1.5 times the inter-quartile range, or IQR, which is simply the distance from the bottom of the box to the top of the box. Anything beyond the whiskers is marked as an “outlier” and is plotted separately as an individual point.
From the boxplot, we can see that there are a few points on both the high and the low end that appear to be outliers according to the boxplot() algorithm. These points migth be worth looking at individually.
From the plot, it appears that the high points are all above the level of 15, so we can take a look at those data points directly. Note that although the current national ambient air quality standard is 12 micrograms per cubic meter, it used to be 15.
filter(pollution, pm25 > 15)
## pm25 fips region longitude latitude
## 1 16.19452 06019 west -119.9035 36.63837
## 2 15.80378 06029 west -118.6833 35.29602
## 3 18.44073 06031 west -119.8113 36.15514
## 4 16.66180 06037 west -118.2342 34.08851
## 5 15.01573 06047 west -120.6741 37.24578
## 6 17.42905 06065 west -116.8036 33.78331
## 7 16.25190 06099 west -120.9588 37.61380
## 8 16.18358 06107 west -119.1661 36.23465
These counties are all in the western U.S. (region == west) and in fact are all in California because the first two digits of the fips code are 06.
We can make a quick map of these counties to get a sense of where they are in California
# Load the map library
library(maps)
##
## # maps v3.1: updated 'world': all lakes moved to separate new #
## # 'lakes' database. Type '?world' or 'news(package="maps")'. #
# Map it
map("county", "california")
# Perform a select and overlay onto the map
with(filter(pollution, pm25 > 15), points(longitude, latitude))
# create histogram
hist(pollution$pm25, col = "green")
# Get more information w/the rug() function
rug(pollution$pm25)
This distribution is interesting because there appears to be a high concentration of counties in the neighborhood of 9 to 12 micrograms per cubic meter. We can get a little more detail of we use the rug() function to show us the actual data points.
The large cluster of data points in the 9 to 12 range is perhaps not surprising in this context. It’s not uncommon to observe this behavior in situations where you have a strict limit imposed at a certain level. Note that there are still quite a few counties above the level of 12, which may be worth investigating.
The hist() function has a default algorithm for determining the number of bars to use in the histogram based on the density of the data (see ?nclass.Sturges). However, you can override the default option by setting the breaks argument to something else. Here, we use more bars to try to get more detail.
# Better with parameters
#
hist(pollution$pm25, col = "green", breaks = 100)
rug(pollution$pm25)
boxplot(pollution$pm25, col = "red")
abline(h = 12)
We can see that a reasonable portion of the distribution as displayed by the boxplot is above the line (i.e. potentially in violation of the standard).
While the boxplot gives a sense, the histogram might be better suited to visualizing the data here. In the plot below, we see the histogram and draw two lines, one at the median of the data and one at 12, the level of the standard.
hist(pollution$pm25, col = "green")
abline(v = 12, lwd = 2)
abline(v = median(pollution$pm25), col = "magenta", lwd = 4)
table(pollution$region) %>% barplot(col = "wheat")
So far we’ve covered some of the main tools used to summarize one dimensional data. For investigating data in two dimensions and beyond, there is an array of additional tools. Some of the key approaches are:
Multiple or overlayed 1-D plots (Lattice/ggplot2): Using multiple boxplots or multiple histograms can be useful for seeing the relationship between two variables, especially when on is naturally categorical.
Scatterplots: Scatterplots are the natural tool for visualizing two continuous variables. Transformations of the variables (e.g. log or square-root transformation) may be necessary for effective visualization.
Smooth scatterplots: Similar in concept to scatterplots but rather plots a 2-D histogram of the data. Can be useful for scatterplots that may contain many many data points.
For visualizing data in more than 2 dimensions, without resorting to 3-D animations (or glasses!), we can often combine the tools that we’ve already learned:
Overlayed or multiple 2-D plots; conditioning plots (coplots): A conditioning plot, or coplot, shows the relationship between two variables as a third (or more) variable changes. For example, you might want to see how the relationship between air pollution levels and mortality changes with the season of the year. Here, air pollution and mortality are the two primary variables and season is the third variable varying in the background.
Use color, size, shape to add dimensions: Plotting points with different colors or shapes is useful for indicating a third dimension, where different colors can indicate different categories or ranges of something. Plotting symbols with different sizes can also achieve the same effect when the third dimension is continuous.
Spinning/interactive plots: Spinning plots can be used to simulate 3-D plots by allowing the user to essentially quickly cycle through many different 2-D projections so that the plot feels 3-D. These are sometimes helpful to capture unusual structure in the data, but I rarely use them.
Actual 3-D plots (not that useful): Actual 3-D plots (for example, requiring 3- D glasses) are relatively few and far between and are generally impractical for communicating to a large audience. Of course, this may change in the future with improvements in technology.
boxplot(pm25 ~ region, data = pollution, col = "red")
The boxplot() function can take a formula, with the left hand side indicating the variable for which we want to create the boxplot (continuous) and the right hand side indicating the variable that stratifies the left hand side into categories. Since the region variable only has two categories, we end up with two boxplots. Side-by-side boxplots are useful because you can often fit many on a page to get a rich sense of any trends or changes in a variable. Their compact format allow you to visualize a lot of data in a small space.
From the plot above, we can see rather clearly that the levels in eastern counties are on average higher than the levels in western counties.
It can sometimes be useful to plot multiple histograms, much like with side-by-side boxplots, to see changes in the shape of the distribution of a variable across different categories. However, the number of histograms that you can effectively put on a page is limited.
Here is the distribution of PM2.5 in the eastern and western regions of the U.S.
par(mfrow = c(2, 1), mar = c(4, 4, 2, 1))
hist(subset(pollution, region == "east")$pm25, col = "green")
hist(subset(pollution, region == "west")$pm25, col = "green")
You can see here that the PM2.5 in the western U.S. tends to be right-skewed with some outlying counties with very high levels. The PM2.5 in the east tends to be left skewed some counties having very low levels.
Skewness in Histograms: http://www.dummies.com/how-to/content/how-to-identify-skew-and-symmetry-in-a-statistical.html
with(pollution, plot(latitude, pm25))
abline(h = 12, lwd = 2, lty = 2)
with(pollution, plot(latitude, pm25, col = region))
abline(h = 12, lwd = 2, lty = 2)
# It may be confusing at first to figure out which color gets mapped to which region. We
# can find out by looking directly at the levels of the region variable.
#
levels(pollution$region)
## [1] "east" "west"
levels(pollution$region)
## [1] "east" "west"
par(mfrow = c(1, 2), mar = c(5, 4, 2, 1))
with(subset(pollution, region == "west"), plot(latitude, pm25, main = "West"))
with(subset(pollution, region == "east"), plot(latitude, pm25, main = "East"))
## Lattice
library(lattice)
xyplot(pm25 ~ latitude | region, data = pollution)
## ggplot2
library(ggplot2)
qplot(latitude, pm25, data = pollution, facets = . ~ region)