When building a data visualization in R using the ggplot2 package, you will work by creating “layers” to construct any graphic. We discussed the different layers in a previous RPubs post. There are many good tutorials available on the web for learning ggplot2 but one I found very helpful was put together by Selva Prabhakaran. You can find them at the site, r-statistics.co.

What we are going to do here is create some visualizations, building them up layer upon layer, using the Lahman baseball data base. We are going to look at the relationships between home runs, at bats, and other variables to illustrate the layering approach.

As ususal, you can find the complete code on GitHub.

To get started, we need to load the necessary package libraries.

library(Lahman)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
##     filter, lag
## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union

Then we create a datafram to work with. We call ours Bat14 because we are looking at offensive statistics for the 2014 season.

You need three layers to create a complete plot. True, it will be basic and unadorned, but that is where we begin. In this example, we use only two layer arguments – data and aes – which lays out the coordinates but nothing else.

Bat14 <- Batting %>% filter(yearID == "2014")
ggplot(data = Bat14, aes(x = AB, y = HR))

To include the data points, we need to specify the geometry of the data. So we add a third layer, our geom_ argument. In this case, we want a simple scatterplot, so we use geom_point(). Each point represents a professional baseball player who hit a home run in the 2014 season.

ggplot(data = Bat14, aes(x = AB, y = HR)) +
geom_point()

There you have it, a basic x-y scatterplot of home runs as a function of at bats.

We coukd quit here, but instead we will explore what we can do with more layers, or by playing around with the basic layers we are using.

Here, we are going to draw a regression line through out data point to see what kind of relationship might exist. For this, we use another geom_ argument, geom_smooth() and we tell R to use the linear method. What resuts is our regression with a confidence interval built around it (the light gray area surrounding the blue line).

ggplot(data = Bat14, aes(x = AB, y = HR)) +
geom_point() +
geom_smooth(method = lm)

With our basic elements in place, we can begin to customize the generic R output to fit our needs. First, we want to change the scale of the x- and y-axes. We do that by adding another layer, the xlim and ylim arguments. We will get a warning because we are cropping out some data in the dataframe.

ggplot(data = Bat14, aes(x = AB, y = HR)) +
geom_point() +
geom_smooth(method = lm) +
xlim(c(0, 700)) +
ylim(c(0, 50))
## Warning: Removed 2 rows containing missing values (geom_smooth).

As often is the case in R, we can accomplish the same results different ways. Here we utilize a coord_ argument with the xlim and ylim.

ggplot(data = Bat14, aes(x = AB, y = HR)) +
geom_point() +
geom_smooth(method = lm) +
coord_cartesian(xlim = c(0, 700), ylim = c(0, 50))

Now we can add some titles and legends. We do this in ggplot2 with the labs argument, another layer. You can see the results below.

ggplot(data = Bat14, aes(x = AB, y = HR)) +
geom_point() +
geom_smooth(method = lm) +
coord_cartesian(xlim = c(0, 700), ylim = c(0, 50)) +
labs(title = "MLB Players Home Runs and At-Bats", subtitle = "from Lahman Batting dataset",
x = "At-Bats", y = "Home Runs", caption = "for 2014 season")

Now it’s starting to look like a finished product. But we can push this further using more arguments in our existing layers.

Here, we are going to introduce color to our data points. We can also control the size of the data point we use, even the shape. We are going to color our regression line, as well.

ggplot(data = Bat14, aes(x = AB, y = HR)) +
geom_point(col = "blue", size = 2) +
geom_smooth(method = lm, col = "green") +
coord_cartesian(xlim = c(0, 700), ylim = c(0, 50)) +
labs(title = "MLB Players Home Runs and At-Bats", subtitle = "from Lahman Batting dataset",
x = "At-Bats", y = "Home Runs", caption = "for 2014 season")

We can tell the sytem to select colors for us, but we need to tell it what category to use. Here, we color our data points by which league each player represent, the National League or the American League. For this, we use the lgID identity. R chooses the colors it uses from its default palettes, but there are ways to specify colors. For the present, we will let R drive color selection.

ggplot(data = Bat14, aes(x = AB, y = HR)) +
geom_point(aes(col = lgID), size = 2) +
geom_smooth(method = lm, col = "green") +
coord_cartesian(xlim = c(0, 700), ylim = c(0, 50)) +
labs(title = "MLB Players Home Runs vs At-Bats by League", subtitle = "from Lahman Batting dataset",
x = "At-Bats", y = "Home Runs", caption = "for 2014 season")

So now we have introduced another variable to our visualization, players by league. We can alo represent players on each team in the two leagues. We do this using the teamID instead of lgID. At first, this might sound like a good idea…

ggplot(data = Bat14, aes(x = AB, y = HR)) +
geom_point(aes(col = teamID), size = 2) +
geom_smooth(method = lm, col = "green") +
coord_cartesian(xlim = c(0, 700), ylim = c(0, 50)) +
labs(title = "MLB Home Runs vs At-Bats by Team", subtitle = "from Lahman Batting dataset",
x = "At-Bats", y = "Home Runs", caption = "for 2014 season")

Once we see the results, it’s information overload. There is too much information here, and it’s not well-structured, so the chart makes little sense. Breaking players out by their teams doesn’t add a lot to our understanding. Worse, the presentation actually diminishes understanding because it’s too chaotic.

So, let’s rethink things. How about presenting team statistics instead? You know, aggregate the data we are working with by team?

To do this, we create an object MostHR14 from our dataframe Bat14. We ask R to sort the data by teamID then add up the number of Home Runs and At Bats. We use the na.rm = TRUE setting to remove any data that are NA in our set. We also use the pipe function (%>%) to simplify the coding task. The pipe is a powerful innovation in R that helps make code more readable and easier to write.

MostHR14 <- Bat14 %>% group_by(teamID) %>% summarize(hr = sum(HR), ab = sum(AB), na.rm = TRUE)
View(MostHR14)
ggplot(data = MostHR14, aes(x = ab, y = hr)) +
geom_point(aes(col = teamID), size = 2) +
geom_smooth(method = lm, col = "green") +
coord_cartesian(xlim = c(5200,5700), ylim = c(0, 220 )) +
labs(title = "MLB Home Runs by Team, 2014", subtitle = "from Lahman Batting dataset",
x = "At-Bats", y = "Home Runs")