When building a data visualization in R using the ggplot2 package, you will work by creating “layers” to construct any graphic. We discussed the different layers in a previous RPubs post. There are many good tutorials available on the web for learning ggplot2 but one I found very helpful was put together by Selva Prabhakaran. You can find them at the site, r-statistics.co.

What we are going to do here is create some visualizations, building them up layer upon layer, using the Lahman baseball data base. We are going to look at the relationships between home runs, at bats, and other variables to illustrate the layering approach.

As ususal, you can find the complete code on GitHub.

To get started, we need to load the necessary package libraries.

library(Lahman)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Then we create a datafram to work with. We call ours Bat14 because we are looking at offensive statistics for the 2014 season.

You need three layers to create a complete plot. True, it will be basic and unadorned, but that is where we begin. In this example, we use only two layer arguments – data and aes – which lays out the coordinates but nothing else.

Bat14 <- Batting %>% filter(yearID == "2014")
ggplot(data = Bat14, aes(x = AB, y = HR))

To include the data points, we need to specify the geometry of the data. So we add a third layer, our geom_ argument. In this case, we want a simple scatterplot, so we use geom_point(). Each point represents a professional baseball player who hit a home run in the 2014 season.

ggplot(data = Bat14, aes(x = AB, y = HR)) +
        geom_point()

There you have it, a basic x-y scatterplot of home runs as a function of at bats.

We coukd quit here, but instead we will explore what we can do with more layers, or by playing around with the basic layers we are using.

Here, we are going to draw a regression line through out data point to see what kind of relationship might exist. For this, we use another geom_ argument, geom_smooth() and we tell R to use the linear method. What resuts is our regression with a confidence interval built around it (the light gray area surrounding the blue line).

ggplot(data = Bat14, aes(x = AB, y = HR)) +
        geom_point() +
        geom_smooth(method = lm)

With our basic elements in place, we can begin to customize the generic R output to fit our needs. First, we want to change the scale of the x- and y-axes. We do that by adding another layer, the xlim and ylim arguments. We will get a warning because we are cropping out some data in the dataframe.

ggplot(data = Bat14, aes(x = AB, y = HR)) +
        geom_point() +
        geom_smooth(method = lm) +
        xlim(c(0, 700)) +
        ylim(c(0, 50))
## Warning: Removed 2 rows containing missing values (geom_smooth).

As often is the case in R, we can accomplish the same results different ways. Here we utilize a coord_ argument with the xlim and ylim.

ggplot(data = Bat14, aes(x = AB, y = HR)) +
        geom_point() +
        geom_smooth(method = lm) +
        coord_cartesian(xlim = c(0, 700), ylim = c(0, 50))

Now we can add some titles and legends. We do this in ggplot2 with the labs argument, another layer. You can see the results below.

ggplot(data = Bat14, aes(x = AB, y = HR)) +
        geom_point() +
        geom_smooth(method = lm) +
        coord_cartesian(xlim = c(0, 700), ylim = c(0, 50)) +
        labs(title = "MLB Players Home Runs and At-Bats", subtitle = "from Lahman Batting dataset", 
             x = "At-Bats", y = "Home Runs", caption = "for 2014 season")

Now it’s starting to look like a finished product. But we can push this further using more arguments in our existing layers.

Here, we are going to introduce color to our data points. We can also control the size of the data point we use, even the shape. We are going to color our regression line, as well.

ggplot(data = Bat14, aes(x = AB, y = HR)) +
        geom_point(col = "blue", size = 2) +
        geom_smooth(method = lm, col = "green") +
        coord_cartesian(xlim = c(0, 700), ylim = c(0, 50)) +
        labs(title = "MLB Players Home Runs and At-Bats", subtitle = "from Lahman Batting dataset", 
             x = "At-Bats", y = "Home Runs", caption = "for 2014 season")

We can tell the sytem to select colors for us, but we need to tell it what category to use. Here, we color our data points by which league each player represent, the National League or the American League. For this, we use the lgID identity. R chooses the colors it uses from its default palettes, but there are ways to specify colors. For the present, we will let R drive color selection.

ggplot(data = Bat14, aes(x = AB, y = HR)) +
        geom_point(aes(col = lgID), size = 2) +
        geom_smooth(method = lm, col = "green") +
        coord_cartesian(xlim = c(0, 700), ylim = c(0, 50)) +
        labs(title = "MLB Players Home Runs vs At-Bats by League", subtitle = "from Lahman Batting dataset", 
             x = "At-Bats", y = "Home Runs", caption = "for 2014 season")

So now we have introduced another variable to our visualization, players by league. We can alo represent players on each team in the two leagues. We do this using the teamID instead of lgID. At first, this might sound like a good idea…

ggplot(data = Bat14, aes(x = AB, y = HR)) +
        geom_point(aes(col = teamID), size = 2) +
        geom_smooth(method = lm, col = "green") +
        coord_cartesian(xlim = c(0, 700), ylim = c(0, 50)) +
        labs(title = "MLB Home Runs vs At-Bats by Team", subtitle = "from Lahman Batting dataset", 
             x = "At-Bats", y = "Home Runs", caption = "for 2014 season")

Once we see the results, it’s information overload. There is too much information here, and it’s not well-structured, so the chart makes little sense. Breaking players out by their teams doesn’t add a lot to our understanding. Worse, the presentation actually diminishes understanding because it’s too chaotic.

So, let’s rethink things. How about presenting team statistics instead? You know, aggregate the data we are working with by team?

To do this, we create an object MostHR14 from our dataframe Bat14. We ask R to sort the data by teamID then add up the number of Home Runs and At Bats. We use the na.rm = TRUE setting to remove any data that are NA in our set. We also use the pipe function (%>%) to simplify the coding task. The pipe is a powerful innovation in R that helps make code more readable and easier to write.

MostHR14 <- Bat14 %>% group_by(teamID) %>% summarize(hr = sum(HR), ab = sum(AB), na.rm = TRUE)
View(MostHR14)
ggplot(data = MostHR14, aes(x = ab, y = hr)) +
        geom_point(aes(col = teamID), size = 2) +
        geom_smooth(method = lm, col = "green") +
        coord_cartesian(xlim = c(5200,5700), ylim = c(0, 220 )) +
        labs(title = "MLB Home Runs by Team, 2014", subtitle = "from Lahman Batting dataset", 
             x = "At-Bats", y = "Home Runs")

That’s a big improvement over our previous effort.

Let’s go back to an earlier graphic, about home runs by league. The following code recreates that visual, but notice the slight change to the code. We’ve added a scale_ layer to manage the orientation of the x-axis. Well see why in a minute.

ggplot(data = Bat14, aes(x = AB, y = HR)) +
        geom_point(aes(col = lgID), size = 2) +
        geom_smooth(method = lm, col = "green") +
        coord_cartesian(xlim = c(0, 700), ylim = c(0, 50)) +
        labs(title = "MLB Home Runs vs At-Bats by League", subtitle = "from Lahman Batting dataset", 
             x = "At-Bats", y = "Home Runs", caption = "for 2014 season") +
        scale_x_continuous(breaks = seq(0, 700, 100))

Using the scale_ layer allows us to reverse the x-axis. For this purpose, we use scale_x_reverse. We can do the same with the y-axis. It’s not really helpful for our baseball data, but it’s useful to know the trick.

ggplot(data = Bat14, aes(x = AB, y = HR)) +
        geom_point(aes(col = lgID), size = 2) +
        geom_smooth(method = lm, col = "green") +
        coord_cartesian(xlim = c(0, 700), ylim = c(0, 50)) +
        labs(title = "MLB Home Runs vs At-Bats by League", subtitle = "from Lahman Batting dataset", 
             x = "At-Bats", y = "Home Runs", caption = "for 2014 season") +
        scale_x_reverse(breaks = seq(0, 700, 100))

Beyond the basics, adding layers doesn’t really change the underlying data as much as it improves the presentation of our visual. Once again, here is our home runs by league chart. Notice we’ve added another layer, theme_set_. This allows us to choose a theme for the graphic from a set of available templates in ggplot2. There are eight to choose from. To read more, use the help feature by typing the command ?theme_by().

Once we have set the theme, we don’t need to worry about customizing it again. The theme we have been using so far is Classic.

ggplot(data = Bat14, aes(x = AB, y = HR)) +
        geom_point(aes(col = lgID), size = 2) +
        geom_smooth(method = lm, col = "green") +
        coord_cartesian(xlim = c(0, 700), ylim = c(0, 50)) +
        labs(title = "MLB Home Runs vs At-Bats by League", subtitle = "from Lahman Batting dataset", 
             x = "At-Bats", y = "Home Runs", caption = "for 2014 season") +
        scale_x_continuous(breaks = seq(0, 700, 100)) +
        theme_set(theme_classic())

But maybe we prefer the cleaner, lighter graphics of theme_bw().

ggplot(data = Bat14, aes(x = AB, y = HR)) +
        geom_point(aes(col = lgID), size = 2) +
        geom_smooth(method = lm, col = "green") +
        coord_cartesian(xlim = c(0, 700), ylim = c(0, 50)) +
        labs(title = "MLB Hits vs At-Bats by League", subtitle = "from Lahman Batting dataset", 
             x = "At-Bats", y = "Home Runs", caption = "for 2014 season") +
        scale_x_continuous(breaks = seq(0, 700, 100)) +
        theme_bw()

In all these layers, it helps to pay around a bit with the settings to get a good feel for what is possible and to understand the flexibility of choices available in ggplot2.

I think we have illustrated the idea of building up a graphic using the layering principles underlying the ggplot2 structure. We are going to simplify things a bit now by creating an object. Rather than rewriting the code or copying and pasting our lines each time, we will just refer to a single graphic object. First, create the object, Bat14HRplot.

Bat14HRplot <- ggplot(data = Bat14, aes(x = AB, y = HR)) +
        geom_point(aes(col = lgID, size = H)) +
        geom_smooth(method = loess, col = "green") +
        coord_cartesian(xlim = c(0, 700), ylim = c(0, 50)) +
        labs(title = "MLB Home Runs vs At-Bats by League and Hits", subtitle = "from Lahman Batting dataset", 
             x = "At-Bats", y = "Home Runs", caption = "for 2014 season") +
        scale_x_continuous(breaks = seq(0, 700, 100)) +
        theme_bw()

Notice that we changed the code a bit, to create an argument that will relate the size of the data points according to the number of hits for each player. This allows us to introduce another data dimension to our graphic.

Next, call the object to generate the visual.

Bat14HRplot

This size selection adds something to our undertanding, but it muddies up the graphic a bit. What’s important is it illustrates the size tool and demonstrates how it can be used to add information to our chart. It works pretty well at this stage, but we will see in a bit how it becomes a problem as we introduce additional layers.

The following chart uses intentional walks for the size argument. This is a better use because it conveys the information without muddying up the visual. So some data works better with this method than others. So our data selections and our decisions on how to present the data are important to our ability to represent what the numbers are saying.

ggplot(data = Bat14, aes(x = AB, y = HR)) +
        geom_point(aes(col = lgID, size = IBB)) +
        geom_smooth(method = loess, col = "green", se = FALSE) +
        coord_cartesian(xlim = c(0, 700), ylim = c(0, 50)) +
        labs(title = "MLB Home Runs vs At-Bats by League and Intentional Walks", subtitle = "from Lahman Batting dataset", 
             x = "At-Bats", y = "Home Runs", caption = "for 2014 season") +
        scale_x_continuous(breaks = seq(0, 700, 100)) +
        theme_bw()

But back to our object, Bat14HRplot.

We saw how to use the default themes by adding the theme_ layer. We can also use the same layer to achieve a customized effect if we specify what we want to see in the different arguments. Again, refer to the ?theme in help command to see the 70 agruments available in ggplot2 for customizing the plot, legends, titles, fonts, colors, and so on.

First, the object.

Bat14HRplot <- Bat14HRplot + theme(plot.title = element_text(size = 12,
                                                         face = "bold",
                                                         color = "black",
                                                         hjust = 0.5,
                                                         lineheight = 1.2),
                               plot.subtitle = element_text(size = 10,
                                                            color = "black",
                                                            hjust = 0.5))

Then, the call.

Bat14HRplot

We included the labs layer as part of our Bat14HRplot object, but we can still modify it in the following manner. We are trying to label the legends on the right-hand side of the chart. First, the object.

Bat14HRplot <- Bat14HRplot + labs(color = "League", size = "Hits")

Then, the call.

Bat14HRplot

Get the hang of it?

Suppose we want to determine the color scheme. In our object, we have asked R to do it for us, but we can override that decision by adding a new layer. This one, scale_color_manual allows us to specify the colors used for our data points. We will also spell out the league names in the legend.

We will get a warning that we are replacing the existing color scale, which is what we want.

Bat14HRplot <-  Bat14HRplot + scale_color_manual(name = "League",
                                             labels = c("American", "National"),
                                             values = c("blue", "red"))

If we want to move the League legend into the first position, we can do that too with the guides layer.

Bat14HRplot <- Bat14HRplot + guides(color = guide_legend(order = 1),
                                size = guide_legend(order = 2))

After all that, our finished product looks like this:

Bat14HRplot

That’s pretty good. But it’s still only one chart. There is one more layer we need to know about. The facet_ layer allows us to break up our chart into a number of smaller charts.

First, let’s separate our data into two charts, one for the National League and one for the American League. We do this by adding the facet_wrap layer and identifying it with the lgID column in our dataframe. This will form a matrix of two charts.

Bat14HRplot + facet_wrap( ~ lgID )

That’s a useful trick. But we might be regretting our earlier decision of displaying the data points in a bubble chart form by hits.

This next chart is pretty neat and it illustrates how we can use the facet_wrap layer in different ways. But we are beginning to lose our data points in the complex matrix we create.

Bat14HRplot + facet_wrap( ~ teamID)

We will end with the facet_grid argument, which creates a more complex matrix for us. It would be a useful decision to make in a different data context, but it is good to see how it works. Try reversing the order of lgID and teamID in the argument and see what happens.

Bat14HRplot + facet_grid(lgID ~ teamID)