The dplyr package is a tool kit with great versatility, especially when it comes to data manipulation. Here, we are going to explore some basic ways to use dplyr for working with data and creating graphics. In particular, we are going to look at select(), group_by, summarize(), mutate(), and arrange().

You can learn more about any of these by typing ?name in your R console or editor. For example, ?select brings up help on the topic select().

As usual, you can find the code for this at GitHub

We start by loading the packages we need to do our work. We are looking at Major League Baseball statistics, so we need the Lahman package.

library(dplyr)
library(Lahman)
library(stringr)
library(ggplot2)
library(forcats)

We create a subset of the Batting data for players in the 2014 season. We want to get an idea how large our new data subset is, so we count() it.

Bat14 <- Batting %>% filter(yearID == "2014")
Bat14 %>% count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1  1435

It’s always useful to get an overview of the data we are working with, so let’s look at the data structure. The str() command allows us to see all the variables (columns in the data frame) and some of the observations.

str(Bat14)
## 'data.frame':    1435 obs. of  22 variables:
##  $ playerID: chr  "abadfe01" "abreubo01" "abreujo02" "abreuto01" ...
##  $ yearID  : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
##  $ stint   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ teamID  : Factor w/ 149 levels "ALT","ANA","ARI",..: 96 94 33 117 93 79 116 51 45 66 ...
##  $ lgID    : Factor w/ 7 levels "AA","AL","FL",..: 2 5 2 5 2 2 2 5 2 2 ...
##  $ G       : int  69 78 145 3 10 7 143 7 6 6 ...
##  $ AB      : int  0 133 556 4 0 0 502 15 0 3 ...
##  $ R       : int  0 12 80 0 0 0 64 1 0 1 ...
##  $ H       : int  0 33 176 0 0 0 123 1 0 0 ...
##  $ X2B     : int  0 9 35 0 0 0 27 0 0 0 ...
##  $ X3B     : int  0 0 2 0 0 0 4 0 0 0 ...
##  $ HR      : int  0 1 36 0 0 0 14 0 0 0 ...
##  $ RBI     : int  0 14 107 0 0 0 65 0 0 0 ...
##  $ SB      : int  0 1 3 0 0 0 8 0 0 0 ...
##  $ CS      : int  0 0 1 0 0 0 4 0 0 0 ...
##  $ BB      : int  0 20 51 0 0 0 32 0 0 0 ...
##  $ SO      : int  0 21 131 0 0 0 90 5 0 2 ...
##  $ IBB     : int  0 0 15 0 0 0 1 0 0 0 ...
##  $ HBP     : int  0 0 11 0 0 0 3 0 0 0 ...
##  $ SH      : int  0 0 0 0 0 0 3 0 0 0 ...
##  $ SF      : int  0 2 4 0 0 0 2 0 0 0 ...
##  $ GIDP    : int  0 3 14 1 0 0 10 1 0 0 ...

The first dplyr function we want to work with is select(). This allows us to isolate particular columns for our use. We add a head() command to avoid producing a list of every row and observation under that column variable.

Bat14 %>% select(teamID) %>% head(10)
##    teamID
## 1     OAK
## 2     NYN
## 3     CHA
## 4     SFN
## 5     NYA
## 6     MIN
## 7     SEA
## 8     COL
## 9     CLE
## 10    KCA

We can specify any combination of variables we like with select().

Bat14 %>% select(teamID, lgID) %>% head(10)
##    teamID lgID
## 1     OAK   AL
## 2     NYN   NL
## 3     CHA   AL
## 4     SFN   NL
## 5     NYA   AL
## 6     MIN   AL
## 7     SEA   AL
## 8     COL   NL
## 9     CLE   AL
## 10    KCA   AL

We can also exclude specific columns if we wish.

Bat14 %>% select(-teamID, -lgID, -yearID, -stint) %>% head(10)
##     playerID   G  AB  R   H X2B X3B HR RBI SB CS BB  SO IBB HBP SH SF GIDP
## 1   abadfe01  69   0  0   0   0   0  0   0  0  0  0   0   0   0  0  0    0
## 2  abreubo01  78 133 12  33   9   0  1  14  1  0 20  21   0   0  0  2    3
## 3  abreujo02 145 556 80 176  35   2 36 107  3  1 51 131  15  11  0  4   14
## 4  abreuto01   3   4  0   0   0   0  0   0  0  0  0   0   0   0  0  0    1
## 5  aceveal01  10   0  0   0   0   0  0   0  0  0  0   0   0   0  0  0    0
## 6  achteaj01   7   0  0   0   0   0  0   0  0  0  0   0   0   0  0  0    0
## 7  ackledu01 143 502 64 123  27   4 14  65  8  4 32  90   1   3  3  2   10
## 8  adamecr01   7  15  1   1   0   0  0   0  0  0  0   5   0   0  0  0    1
## 9  adamsau01   6   0  0   0   0   0  0   0  0  0  0   0   0   0  0  0    0
## 10 adamsla01   6   3  1   0   0   0  0   0  0  0  0   2   0   0  0  0    0

Here is another way to achieve the same result.

Bat14 %>% select(c(-(2:5))) %>% head(10)
##     playerID   G  AB  R   H X2B X3B HR RBI SB CS BB  SO IBB HBP SH SF GIDP
## 1   abadfe01  69   0  0   0   0   0  0   0  0  0  0   0   0   0  0  0    0
## 2  abreubo01  78 133 12  33   9   0  1  14  1  0 20  21   0   0  0  2    3
## 3  abreujo02 145 556 80 176  35   2 36 107  3  1 51 131  15  11  0  4   14
## 4  abreuto01   3   4  0   0   0   0  0   0  0  0  0   0   0   0  0  0    1
## 5  aceveal01  10   0  0   0   0   0  0   0  0  0  0   0   0   0  0  0    0
## 6  achteaj01   7   0  0   0   0   0  0   0  0  0  0   0   0   0  0  0    0
## 7  ackledu01 143 502 64 123  27   4 14  65  8  4 32  90   1   3  3  2   10
## 8  adamecr01   7  15  1   1   0   0  0   0  0  0  0   5   0   0  0  0    1
## 9  adamsau01   6   0  0   0   0   0  0   0  0  0  0   0   0   0  0  0    0
## 10 adamsla01   6   3  1   0   0   0  0   0  0  0  0   2   0   0  0  0    0

Either way works just fine. It’s a judgement call. Use the method you are more comfortable with, and which other users might find easier to understand. For example, the first method lets users know what variables are missing, but it is a little less elegant code. The second method is more elegant than the first, but the changes we made aren’t evident. It’s a good idea to add a note to code to help explain things.

Now we turn to group_by which allows us to roll-up the data into chosen groups. Here, we look at the two leagues and how many observations fall under each.

Bat14 %>% group_by(lgID) %>% count()
## # A tibble: 2 x 2
## # Groups:   lgID [2]
##     lgID     n
##   <fctr> <int>
## 1     AL   746
## 2     NL   689

Likewise, we can group by team.

Bat14 %>% group_by(teamID) %>% count()
## # A tibble: 30 x 2
## # Groups:   teamID [30]
##    teamID     n
##    <fctr> <int>
##  1    ARI    52
##  2    ATL    39
##  3    BAL    44
##  4    BOS    55
##  5    CHA    44
##  6    CHN    48
##  7    CIN    45
##  8    CLE    46
##  9    COL    49
## 10    DET    48
## # ... with 20 more rows

We can also use group_by to arrange by different groups. Here, we are telling R to count observations by team and sort them by league. This is the kind of thing a pivot table does in Excel or Google Sheets.

Bat14 %>% group_by(lgID, teamID) %>% count()
## # A tibble: 30 x 3
## # Groups:   lgID, teamID [30]
##      lgID teamID     n
##    <fctr> <fctr> <int>
##  1     AL    BAL    44
##  2     AL    BOS    55
##  3     AL    CHA    44
##  4     AL    CLE    46
##  5     AL    DET    48
##  6     AL    HOU    48
##  7     AL    KCA    48
##  8     AL    LAA    55
##  9     AL    MIN    48
## 10     AL    NYA    58
## # ... with 20 more rows

We can call in any of the variables (columns) we want from our data frame. We will be calculating Runs per Game soon, so we pull in the columns we need to work with. Note that now the observations are by each player because that’s how the original data is structured.

Bat14 %>% group_by(teamID, lgID, R, G) %>% count()
## # A tibble: 1,279 x 5
## # Groups:   teamID, lgID, R, G [1,279]
##    teamID   lgID     R     G     n
##    <fctr> <fctr> <int> <int> <int>
##  1    ARI     NL     0     2     2
##  2    ARI     NL     0     3     3
##  3    ARI     NL     0     4     1
##  4    ARI     NL     0     6     2
##  5    ARI     NL     0     7     1
##  6    ARI     NL     0    18     2
##  7    ARI     NL     0    22     1
##  8    ARI     NL     0    25     1
##  9    ARI     NL     0    29     1
## 10    ARI     NL     0    32     1
## # ... with 1,269 more rows

If we want to aggregate the data by team (as below) or by another factor, we can use the summarize() command. We can run any math operation through this. Here, we want to know the total number of walks (base on balls) for each team.

Bat14 %>% group_by(teamID) %>% summarize(Total_BB = sum(BB))
## # A tibble: 30 x 2
##    teamID Total_BB
##    <fctr>    <int>
##  1    ARI      398
##  2    ATL      472
##  3    BAL      401
##  4    BOS      535
##  5    CHA      417
##  6    CHN      442
##  7    CIN      415
##  8    CLE      504
##  9    COL      397
## 10    DET      443
## # ... with 20 more rows

We can do the same for any variable. In this case, we aggregate hits.

Bat14 %>% group_by(teamID) %>% summarize(Total_H = sum(H))
## # A tibble: 30 x 2
##    teamID Total_H
##    <fctr>   <int>
##  1    ARI    1379
##  2    ATL    1316
##  3    BAL    1434
##  4    BOS    1355
##  5    CHA    1400
##  6    CHN    1315
##  7    CIN    1282
##  8    CLE    1411
##  9    COL    1551
## 10    DET    1557
## # ... with 20 more rows

Notice that R gives us results in alphabetical order. We can change that by using the arrange() command. Here, we create two objects we will use for graphing in a bit. They are Runs and RpG,

Runs <- Bat14 %>% group_by(teamID) %>% summarize(Runs = sum(R)) %>% arrange(desc(Runs))
RpG <- Bat14 %>% group_by(teamID) %>% summarize(RpG = sum(R) / 162) %>% arrange(desc(RpG))

Time to graph!

We want to display the number of Runs per Game each team produced during the 2014 season. We also want to plot them in order of size. To do this, we specify in the aes() command the reorder() subcommand. We use -RpG to indicated sort in descending order.

If you want to build this graph layer by layer to see how ggplot works, start with the code through geom_col() then build it up from there.

ggplot(RpG, aes(x = reorder(teamID, -RpG), y = RpG)) + geom_col() + aes(fill = teamID) +
  labs(title = "Runs per Game by Team, 2014 Season", x = "MLB Team", y = "R/G", caption = "from Lahman data base") + guides(fill = FALSE)

It is redundant to consider Total Runs and Runs per Game because RpG is just Total Runs divided by the number of games played, which is a constant 162 in baseball. But it is another way to look at the same data. In this case, we are sorting in ascending order.

ggplot(Runs, aes(x = reorder(teamID, Runs), y = Runs)) + geom_col() + aes(fill = teamID) +
  labs(title = "Total Runs by Team, 2014 Season", x = "MLB Team", y = "Total Runs", caption = "from Lahman data base") + guides(fill = FALSE)

There are different ways to accomplish the same ends in programming. There is no one right way that is better than the others. Every programmer is different. All we are trying to do is improve on how we write code.

So here is an alternate approach to producing the exact same graphs as above. We will limit ourselves to reproducing the RpG chart. The point is, there may be a million ways to do the same thing, so find the one you are most comfortable with which you think is also easy for others to use.

We create a new object, AltRpG.

AltRpG <- Bat14 %>% group_by(teamID) %>% summarize(Runs = sum(R)) %>% mutate("Runs/Game" = Runs / 162) %>% arrange(desc(Runs))

Then we graph it. Notice it is identical to the Runs per Game chart we created earlier.

ggplot(AltRpG, aes(x = reorder(teamID, -`Runs/Game`), y = `Runs/Game`)) + geom_col() + aes(fill = teamID) +
  labs(title = "Runs per Game by Team, 2014 Season", x = "MLB Team", y = "R/G", caption = "from Lahman data base") + guides(fill = FALSE)

The datascience+ site is a helpful source for lessons on using R. Some ideas for this RPub came from that site.