One of the hardest parts of analysis is producing quality supporting graphics. Conversely, a good graph is one of the best ways to present findings. Fortunately, R provides excellent graphing capabilities, both in base installation and with an ad on packages such as lattice and ggplot2. In this exercise, we will briefly introduce you to some simple graphs using base graphics and then show their counterparts in ggplot2.
Graphics are used in statistics primarily for two reasons: EDA (Exploratory Data Analysis) and presenting results. Both are incredibly important but must be targeted to different audiences.
When graphing for the first time with R, most people use base graphics and then move on to ggplot2 when their needs become more complex. This section is here for completeness and because base graphics are just needed, especially for modifying the plots generated by other functions.
Before we go any further we need some data. Most of the datasets built into R are tiny, even by the standards from ten years ago. A good dataset for example graphs is, ironically, included with ggplot2 . In order to access it, ggplot2 must first be installed and loaded. The goal of this exercise is to introduce you to some basic statistical plots using base graphics and ggplot2. So we will be using a simple dataset and focus more on charting concepts.
require(ggplot2)
## Loading required package: ggplot2
data(diamonds)
head(diamonds)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
The most common graph of data in a single variable is a histogram. This shows the distribution of value for that variable. Creating histogram is very simple and shown below for the carat column in diamonds.
hist(diamonds$carat, main="Carat Histogram", xlab="Carat")
This shows the distribution of carat size. Notice that the title was set using the main argument and x-axis label with xlab argument. Histograms break the data into buckets and the heights of the bars represent the number of observations that fall into each bucket.
It is frequently good to see two variables in comparison with each other; this is where a scatterplot is of used. We will plot the price of diamonds against the carat using formula notation.
plot(price ~ carat, data=diamonds)
The ~ separating price and carat indicate that we are viewing price against carat where price is the y-value and carat is the x-value. It is also possible to build a scatterplot by simply specifying the x and y variable without the formula interface.
plot(diamonds$carat, diamonds$price)
Boxplots are often among the first graphs taught to statistics students. It is often used as a statistical mechanism to find outliers in data. Given their ubiquity, it is important to learn them and thankfully R has the boxplot function to help us construct one.
boxplot(diamonds$carat)
The idea behind the boxplot is that the thick middle line represents the median and the box is bounded by first and third quartiles. That is the middle 50% of data — the Interquartile Range or IQR is held in the box. The lines extend out to 1.5*IQR in both directions. The outlier points are then plotted beyond that.
While R’s basic graphics are extremely powerful and flexible and can be customized to a great extent, using them can be labor-intensive most of the time. Two packages- ggplot2 and lattice were built to make graphing easier. Now we will recreate all the previous graphs and expand the examples with more advanced features.
Initially the ggplot2 syntax is harder to grasp, but the efforts are more than worthwhile. The basic structure of ggplot2 starts with ggplot function, which at it most basic should take the data as its first argument. After initializing the object, we add layers using the + symbol. To start we just discuss geometric layers such as points, lines and histograms. Furthermore, the layer can have different aesthetic mappings and even different data.
As we did above using base graphics, let’s plot the distribution of diamonds carats using ggplot2. This is built using ggplot and geom_histogram as shown below.
ggplot(data=diamonds) + geom_histogram(aes(x=carat))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
A similar display is the density plot, which is done by changing geom_histogram to geom_density. We also specify the color to fill in the graphs using the fill argument.
ggplot(data=diamonds) + geom_density(aes(x=carat) , fill = "grey50")
Whereas the histogram displays the count of data in buckets, the density plot shows the probability of observation falling within a sliding window along with the variable of interest. The difference between the two is subtle but important where histograms are more of a discrete measurement while density plots are more of continuous measurement.
Here we not only see the ggplot2 way of making scatterplot but also show some of the power of ggplot2. In the next few examples, we will be using ggplot(diamonds, aes(x=carat.y=price)) repeatedly, which ordinarily would require a lot of redundant typing. Fortunately, we can save ggplot objects to variables and add layers later.
Here we are adding the third dimension to the scatterplot using color column.
g <- ggplot(diamonds, aes(x=carat,y=price))
g + geom_point(aes(color=color))
Notice that we set color=color inside aes. This is because the designated color will be determined by the data. Also, see that a legend was automatically generated.
ggplot2 has the ability to make faceted graphs, or small multiples and this done using facet_wrap or facet_grid. facet_wrap takes the levels of one variable, cuts up the underlying data according to them, makes a separate pane for each set and arranges them to fit in the plot. Here row and column placement have no real meaning.
g + geom_point(aes(color=color)) + facet_wrap(~color)
On the other hand facet_grid acts similar but assigns all levels of a variable to either a row or column as shown below.
g + geom_point(aes(color=color)) + facet_grid(cut~clarity)
After understanding how to read one pane in this plot we can easily understand all the panes and make quick comparisons.
Being a complete graphics package ggplot2 offers geom_boxplot . Even though it is one dimensional, using a y aesthetic , there needs to be some x aesthetic, so we will use 1.
ggplot(diamonds,aes(y=carat, x=1)) + geom_boxplot()
This can be neatly extended to drawing multiple boxplots, one for each level of a variable as shown below.
ggplot(diamonds,aes(y=carat, x=cut)) + geom_boxplot()
Getting fancy we can swap out the boxplots for violin plots using geom_violin. Violin plots are similar to boxplots except that the boxes are curved, giving the sense of the density of the data.
We can add multiple layers (geoms) on the same plot, as seen below. Notice that the order of the layers matters. In the graph on the left, the points are underneath the violins, while in the graphs on the right, the points are on top of the violins. Notice the gridExtra package helps you to arrange the multiple graphs in rows and columns.
require(gridExtra)
## Loading required package: gridExtra
p1 <- ggplot(diamonds,aes(y=carat,x=cut)) + geom_point() + geom_violin()
p2 <- ggplot(diamonds,aes(y=carat,x=cut)) + geom_violin() + geom_point()
grid.arrange(p1, p2, ncol=2)
Line charts are often used when one variable has a certain continuity, but that is not always necessary because there is often a good reason to use a line with categorical data.
Let’s create a simple line plot using economics data from ggplot2 package.
data(economics)
head(economics)
## # A tibble: 6 x 6
## date pce pop psavert uempmed unemploy
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1967-07-01 507. 198712 12.6 4.5 2944
## 2 1967-08-01 510. 198911 12.6 4.7 2945
## 3 1967-09-01 516. 199113 11.9 4.6 2958
## 4 1967-10-01 512. 199311 12.9 4.9 3143
## 5 1967-11-01 517. 199498 12.8 4.7 3066
## 6 1967-12-01 525. 199657 11.8 4.8 3018
ggplot(economics, aes(x=date, y=pop)) + geom_line()
A common task for line plots is displaying a metric over the course of a year for many years. To prepare the economics data we will use lubridate package with convenient functions for manipulating dates.
We need to create two new variables: year and month. To simplify things we will subset the data to include only years starting with 2000.
require(lubridate)
## Loading required package: lubridate
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
## create year and month columns
economics$year <- year(economics$date)
economics$month <- month(economics$date)
## subset the data
econ2000 <- economics[which(economics$year>=2000),]
head(econ2000)
## # A tibble: 6 x 8
## date pce pop psavert uempmed unemploy year month
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2000-01-01 6535. 280976 5.4 5.8 5708 2000 1
## 2 2000-02-01 6620. 281190 4.8 6.1 5858 2000 2
## 3 2000-03-01 6686. 281409 4.5 6 5733 2000 3
## 4 2000-04-01 6671. 281653 5 6.1 5481 2000 4
## 5 2000-05-01 6708. 281877 4.9 5.8 5758 2000 5
## 6 2000-06-01 6744. 282126 4.9 5.7 5651 2000 6
Now let’s create line plots depicting multiple years as follows. The first line of the code block creates the line graph with a separate line for and color for each year.
Notice that we converted year to a factor so that it would get a discrete color scale and then the scale was named by using scale_color_discrete(name=”Year”). Lastly, the title, x-label and y-label were set with labs.
All these pieces put together builds a professional-looking, publication-quality graph as below.
g <- ggplot(econ2000,aes(x=month, y=pop))
g <- g + geom_line(aes(color=factor(year), group=year))
g <- g + scale_color_discrete(name="Year")
g <- g + labs(title="Population Growth", x="Month",y="Population")
g
A greatness of ggplot2 is the ability to use themes to easily change the way plot look.
While building the theme from scratch can be a daunting task but ggthemes package has put together themes to recreate commonly use styles of graphs.
Following are a few styles: The Economist, Excel, Edward Tufte and The Wall Street Journal.
require(ggthemes)
## Loading required package: ggthemes
g2 <- ggplot(diamonds, aes(x=carat, y=price)) + geom_point(aes(color=color))
## Lets apply few themes
p1 <- g2 + theme_economist() + scale_color_economist()
p2 <- g2 + theme_excel() + scale_color_excel()
p3 <- g2 + theme_tufte()
p4 <- g2 + theme_wsj()
grid.arrange(p1, p2, p3, p4, nrow=2, ncol=2)
In this exercise we have seen both basic graphs and ggplot2 that are both nicer and easier to create.
We have covered histograms, scatterplots, boxplots, violinplots, line plots and density graphs.
We have also looked at using colors and small multiples for distinguishing data. It is just a humble introduction to ggplot2 and base plotting in R. There are many other features in ggplot2 such as jittering, stacking, dodging and alpha.