When building a data visualization in *R* using the *ggplot2* package, you will work by creating “layers” to construct any graphic. We discussed the different layers in a previous *RPubs* post. There are many good tutorials available on the web for learning *ggplot2* but one I found very helpful was put together by Selva Prabhakaran. You can find them at the site, *r-statistics.co*.

What we are going to do here is create some visualizations, building them up layer upon layer, using the Lahman baseball data base. We are going to look at the relationships between home runs, at bats, and other variables to illustrate the layering approach.

As ususal, you can find the complete code on GitHub.

To get started, we need to load the necessary package libraries.

```
library(Lahman)
library(ggplot2)
library(dplyr)
```

```
##
## Attaching package: 'dplyr'
```

```
## The following objects are masked from 'package:stats':
##
## filter, lag
```

```
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
```

Then we create a datafram to work with. We call ours *Bat14* because we are looking at offensive statistics for the 2014 season.

You need three layers to create a complete plot. True, it will be basic and unadorned, but that is where we begin. In this example, we use only two layer arguments – *data* and *aes* – which lays out the coordinates but nothing else.

```
Bat14 <- Batting %>% filter(yearID == "2014")
ggplot(data = Bat14, aes(x = AB, y = HR))
```

To include the data points, we need to specify the geometry of the data. So we add a third layer, our *geom_* argument. In this case, we want a simple scatterplot, so we use *geom_point()*. Each point represents a professional baseball player who hit a home run in the 2014 season.

```
ggplot(data = Bat14, aes(x = AB, y = HR)) +
geom_point()
```

There you have it, a basic x-y scatterplot of home runs as a function of at bats.

We coukd quit here, but instead we will explore what we can do with more layers, or by playing around with the basic layers we are using.

Here, we are going to draw a regression line through out data point to see what kind of relationship might exist. For this, we use another *geom_* argument, *geom_smooth()* and we tell *R* to use the linear method. What resuts is our regression with a confidence interval built around it (the light gray area surrounding the blue line).

```
ggplot(data = Bat14, aes(x = AB, y = HR)) +
geom_point() +
geom_smooth(method = lm)
```

With our basic elements in place, we can begin to customize the generic *R* output to fit our needs. First, we want to change the scale of the x- and y-axes. We do that by adding another layer, the *xlim* and *ylim* arguments. We will get a warning because we are cropping out some data in the dataframe.

```
ggplot(data = Bat14, aes(x = AB, y = HR)) +
geom_point() +
geom_smooth(method = lm) +
xlim(c(0, 700)) +
ylim(c(0, 50))
```

`## Warning: Removed 2 rows containing missing values (geom_smooth).`

As often is the case in *R*, we can accomplish the same results different ways. Here we utilize a *coord_* argument with the *xlim* and *ylim*.

```
ggplot(data = Bat14, aes(x = AB, y = HR)) +
geom_point() +
geom_smooth(method = lm) +
coord_cartesian(xlim = c(0, 700), ylim = c(0, 50))
```

Now we can add some titles and legends. We do this in *ggplot2* with the *labs* argument, another layer. You can see the results below.

```
ggplot(data = Bat14, aes(x = AB, y = HR)) +
geom_point() +
geom_smooth(method = lm) +
coord_cartesian(xlim = c(0, 700), ylim = c(0, 50)) +
labs(title = "MLB Players Home Runs and At-Bats", subtitle = "from Lahman Batting dataset",
x = "At-Bats", y = "Home Runs", caption = "for 2014 season")
```

Now it’s starting to look like a finished product. But we can push this further using more arguments in our existing layers.

Here, we are going to introduce color to our data points. We can also control the size of the data point we use, even the shape. We are going to color our regression line, as well.

```
ggplot(data = Bat14, aes(x = AB, y = HR)) +
geom_point(col = "blue", size = 2) +
geom_smooth(method = lm, col = "green") +
coord_cartesian(xlim = c(0, 700), ylim = c(0, 50)) +
labs(title = "MLB Players Home Runs and At-Bats", subtitle = "from Lahman Batting dataset",
x = "At-Bats", y = "Home Runs", caption = "for 2014 season")
```

We can tell the sytem to select colors for us, but we need to tell it what category to use. Here, we color our data points by which league each player represent, the National League or the American League. For this, we use the *lgID* identity. *R* chooses the colors it uses from its default palettes, but there are ways to specify colors. For the present, we will let *R* drive color selection.

```
ggplot(data = Bat14, aes(x = AB, y = HR)) +
geom_point(aes(col = lgID), size = 2) +
geom_smooth(method = lm, col = "green") +
coord_cartesian(xlim = c(0, 700), ylim = c(0, 50)) +
labs(title = "MLB Players Home Runs vs At-Bats by League", subtitle = "from Lahman Batting dataset",
x = "At-Bats", y = "Home Runs", caption = "for 2014 season")
```

So now we have introduced another variable to our visualization, players by league. We can alo represent players on each team in the two leagues. We do this using the *teamID* instead of *lgID*. At first, this might sound like a good idea…

```
ggplot(data = Bat14, aes(x = AB, y = HR)) +
geom_point(aes(col = teamID), size = 2) +
geom_smooth(method = lm, col = "green") +
coord_cartesian(xlim = c(0, 700), ylim = c(0, 50)) +
labs(title = "MLB Home Runs vs At-Bats by Team", subtitle = "from Lahman Batting dataset",
x = "At-Bats", y = "Home Runs", caption = "for 2014 season")
```

Once we see the results, it’s information overload. There is too much information here, and it’s not well-structured, so the chart makes little sense. Breaking players out by their teams doesn’t add a lot to our understanding. Worse, the presentation actually diminishes understanding because it’s too chaotic.

So, let’s rethink things. How about presenting team statistics instead? You know, aggregate the data we are working with by team?

To do this, we create an object *MostHR14* from our dataframe *Bat14*. We ask *R* to sort the data by *teamID* then add up the number of Home Runs and At Bats. We use the *na.rm = TRUE* setting to remove any data that are NA in our set. We also use the pipe function (*%>%*) to simplify the coding task. The pipe is a powerful innovation in *R* that helps make code more readable and easier to write.

```
MostHR14 <- Bat14 %>% group_by(teamID) %>% summarize(hr = sum(HR), ab = sum(AB), na.rm = TRUE)
View(MostHR14)
ggplot(data = MostHR14, aes(x = ab, y = hr)) +
geom_point(aes(col = teamID), size = 2) +
geom_smooth(method = lm, col = "green") +
coord_cartesian(xlim = c(5200,5700), ylim = c(0, 220 )) +
labs(title = "MLB Home Runs by Team, 2014", subtitle = "from Lahman Batting dataset",
x = "At-Bats", y = "Home Runs")
```