Statistical Visualization

One of the hardest parts of analysis is producing quality supporting graphics. Conversely, a good graph is one of the best ways to present findings. Fortunately, R provides excellent graphing capabilities, both in base installation and with an ad on packages such as lattice and ggplot2. In this exercise, we will briefly introduce you to some simple graphs using base graphics and then show their counterparts in ggplot2.

Graphics are used in statistics primarily for two reasons: EDA (Exploratory Data Analysis) and presenting results. Both are incredibly important but must be targeted to different audiences.

Base Graphics

When graphing for the first time with R, most people use base graphics and then move on to ggplot2 when their needs become more complex. This section is here for completeness and because base graphics are just needed, especially for modifying the plots generated by other functions.

Before we go any further we need some data. Most of the datasets built into R are tiny, even by the standards from ten years ago. A good dataset for example graphs is, ironically, included with ggplot2 . In order to access it, ggplot2 must first be installed and loaded. The goal of this exercise is to introduce you to some basic statistical plots using base graphics and ggplot2. So we will be using a simple dataset and focus more on charting concepts.

Reading in diamond dataset from ggplot2

require(ggplot2)

## Loading required package: ggplot2

data(diamonds)
head(diamonds)

## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Histogram using base graphics

The most common graph of data in a single variable is a histogram. This shows the distribution of value for that variable. Creating histogram is very simple and shown below for the carat column in diamonds.

hist(diamonds$carat, main="Carat Histogram", xlab="Carat")

This shows the distribution of carat size. Notice that the title was set using the main argument and x-axis label with xlab argument. Histograms break the data into buckets and the heights of the bars represent the number of observations that fall into each bucket.

Scatterplot with base graphics

It is frequently good to see two variables in comparison with each other; this is where a scatterplot is of used. We will plot the price of diamonds against the carat using formula notation.

plot(price ~ carat, data=diamonds)

The ~ separating price and carat indicate that we are viewing price against carat where price is the y-value and carat is the x-value. It is also possible to build a scatterplot by simply specifying the x and y variable without the formula interface.

Scatterplot without using formula

plot(diamonds$carat, diamonds$price)

Boxplots with base graphics

Boxplots are often among the first graphs taught to statistics students. It is often used as a statistical mechanism to find outliers in data. Given their ubiquity, it is important to learn them and thankfully R has the boxplot function to help us construct one.

boxplot(diamonds$carat)

The idea behind the boxplot is that the thick middle line represents the median and the box is bounded by first and third quartiles. That is the middle 50% of data — the Interquartile Range or IQR is held in the box. The lines extend out to 1.5*IQR in both directions. The outlier points are then plotted beyond that.

Histogram using ggplot2

While R’s basic graphics are extremely powerful and flexible and can be customized to a great extent, using them can be labor-intensive most of the time. Two packages- ggplot2 and lattice were built to make graphing easier. Now we will recreate all the previous graphs and expand the examples with more advanced features.

Initially the ggplot2 syntax is harder to grasp, but the efforts are more than worthwhile. The basic structure of ggplot2 starts with ggplot function, which at it most basic should take the data as its first argument. After initializing the object, we add layers using the + symbol. To start we just discuss geometric layers such as points, lines and histograms. Furthermore, the layer can have different aesthetic mappings and even different data.

As we did above using base graphics, let’s plot the distribution of diamonds carats using ggplot2. This is built using ggplot and geom_histogram as shown below.

ggplot(data=diamonds) + geom_histogram(aes(x=carat))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

A similar display is the density plot, which is done by changing geom_histogram to geom_density. We also specify the color to fill in the graphs using the fill argument.

Density plot using ggplot2

ggplot(data=diamonds) + geom_density(aes(x=carat) , fill = "grey50")

Whereas the histogram displays the count of data in buckets, the density plot shows the probability of observation falling within a sliding window along with the variable of interest. The difference between the two is subtle but important where histograms are more of a discrete measurement while density plots are more of continuous measurement.

Scatterplot using ggplot2

Here we not only see the ggplot2 way of making scatterplot but also show some of the power of ggplot2. In the next few examples, we will be using ggplot(diamonds, aes(x=carat.y=price)) repeatedly, which ordinarily would require a lot of redundant typing. Fortunately, we can save ggplot objects to variables and add layers later.

Here we are adding the third dimension to the scatterplot using color column.

g <- ggplot(diamonds, aes(x=carat,y=price))
g + geom_point(aes(color=color))

Notice that we set color=color inside aes. This is because the designated color will be determined by the data. Also, see that a legend was automatically generated.

ggplot2 has the ability to make faceted graphs, or small multiples and this done using facet_wrap or facet_grid. facet_wrap takes the levels of one variable, cuts up the underlying data according to them, makes a separate pane for each set and arranges them to fit in the plot. Here row and column placement have no real meaning.

Facet_wrap and facet_grid in ggplot2

g + geom_point(aes(color=color)) + facet_wrap(~color)

On the other hand facet_grid acts similar but assigns all levels of a variable to either a row or column as shown below.

g + geom_point(aes(color=color)) + facet_grid(cut~clarity)

After understanding how to read one pane in this plot we can easily understand all the panes and make quick comparisons.

Boxplot using ggplot2

Being a complete graphics package ggplot2 offers geom_boxplot . Even though it is one dimensional, using a y aesthetic , there needs to be some x aesthetic, so we will use 1.

ggplot(diamonds,aes(y=carat, x=1)) + geom_boxplot()

This can be neatly extended to drawing multiple boxplots, one for each level of a variable as shown below.

ggplot(diamonds,aes(y=carat, x=cut)) + geom_boxplot()

Getting fancy we can swap out the boxplots for violin plots using geom_violin. Violin plots are similar to boxplots except that the boxes are curved, giving the sense of the density of the data.

We can add multiple layers (geoms) on the same plot, as seen below. Notice that the order of the layers matters. In the graph on the left, the points are underneath the violins, while in the graphs on the right, the points are on top of the violins. Notice the gridExtra package helps you to arrange the multiple graphs in rows and columns.

Violin Plots using ggplot2

require(gridExtra)

## Loading required package: gridExtra

p1 <- ggplot(diamonds,aes(y=carat,x=cut)) + geom_point() + geom_violin()
p2 <- ggplot(diamonds,aes(y=carat,x=cut)) + geom_violin() + geom_point()
grid.arrange(p1, p2, ncol=2)

Line plots using ggplot2

Line charts are often used when one variable has a certain continuity, but that is not always necessary because there is often a good reason to use a line with categorical data.

Let’s create a simple line plot using economics data from ggplot2 package.

data(economics)
head(economics)

## # A tibble: 6 x 6
##   date         pce    pop psavert uempmed unemploy
##   <date>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
## 1 1967-07-01  507. 198712    12.6     4.5     2944
## 2 1967-08-01  510. 198911    12.6     4.7     2945
## 3 1967-09-01  516. 199113    11.9     4.6     2958
## 4 1967-10-01  512. 199311    12.9     4.9     3143
## 5 1967-11-01  517. 199498    12.8     4.7     3066
## 6 1967-12-01  525. 199657    11.8     4.8     3018

ggplot(economics, aes(x=date, y=pop)) + geom_line()

A common task for line plots is displaying a metric over the course of a year for many years. To prepare the economics data we will use lubridate package with convenient functions for manipulating dates.

We need to create two new variables: year and month. To simplify things we will subset the data to include only years starting with 2000.

Preparing data for multiple line charts

require(lubridate)

## Loading required package: lubridate

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

## create year and month columns
economics$year <- year(economics$date)
economics$month <- month(economics$date)

## subset the data
econ2000 <- economics[which(economics$year>=2000),]

head(econ2000)

## # A tibble: 6 x 8
##   date         pce    pop psavert uempmed unemploy  year month
##   <date>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl> <dbl> <dbl>
## 1 2000-01-01 6535. 280976     5.4     5.8     5708  2000     1
## 2 2000-02-01 6620. 281190     4.8     6.1     5858  2000     2
## 3 2000-03-01 6686. 281409     4.5     6       5733  2000     3
## 4 2000-04-01 6671. 281653     5       6.1     5481  2000     4
## 5 2000-05-01 6708. 281877     4.9     5.8     5758  2000     5
## 6 2000-06-01 6744. 282126     4.9     5.7     5651  2000     6

Now let’s create line plots depicting multiple years as follows. The first line of the code block creates the line graph with a separate line for and color for each year.

Notice that we converted year to a factor so that it would get a discrete color scale and then the scale was named by using scale_color_discrete(name=”Year”). Lastly, the title, x-label and y-label were set with labs.

All these pieces put together builds a professional-looking, publication-quality graph as below.

g <- ggplot(econ2000,aes(x=month, y=pop))
g <- g + geom_line(aes(color=factor(year), group=year))
g <- g + scale_color_discrete(name="Year")
g <- g + labs(title="Population Growth", x="Month",y="Population")
g

Theme in ggplot2

A greatness of ggplot2 is the ability to use themes to easily change the way plot look.

While building the theme from scratch can be a daunting task but ggthemes package has put together themes to recreate commonly use styles of graphs.

Following are a few styles: The Economist, Excel, Edward Tufte and The Wall Street Journal.

require(ggthemes)

## Loading required package: ggthemes

g2 <- ggplot(diamonds, aes(x=carat, y=price)) + geom_point(aes(color=color))

## Lets apply few themes
p1 <- g2 + theme_economist() + scale_color_economist()
p2 <- g2 + theme_excel() + scale_color_excel()
p3 <- g2 + theme_tufte()
p4 <- g2 + theme_wsj()
grid.arrange(p1, p2, p3, p4, nrow=2, ncol=2)

In this exercise we have seen both basic graphs and ggplot2 that are both nicer and easier to create.

We have covered histograms, scatterplots, boxplots, violinplots, line plots and density graphs.

We have also looked at using colors and small multiples for distinguishing data. It is just a humble introduction to ggplot2 and base plotting in R. There are many other features in ggplot2 such as jittering, stacking, dodging and alpha.