The dplyr package is a tool kit with great versatility, especially when it comes to data manipulation. Here, we are going to explore some basic ways to use dplyr for working with data and creating graphics. In particular, we are going to look at select(), group_by, summarize(), mutate(), and arrange().
You can learn more about any of these by typing ?name in your R console or editor. For example, ?select brings up help on the topic select().
As usual, you can find the code for this at GitHub
We start by loading the packages we need to do our work. We are looking at Major League Baseball statistics, so we need the Lahman package.
library(dplyr)
library(Lahman)
library(stringr)
library(ggplot2)
library(forcats)
We create a subset of the Batting data for players in the 2014 season. We want to get an idea how large our new data subset is, so we count() it.
Bat14 <- Batting %>% filter(yearID == "2014")
Bat14 %>% count()
## # A tibble: 1 x 1
## n
## <int>
## 1 1435
It’s always useful to get an overview of the data we are working with, so let’s look at the data structure. The str() command allows us to see all the variables (columns in the data frame) and some of the observations.
str(Bat14)
## 'data.frame': 1435 obs. of 22 variables:
## $ playerID: chr "abadfe01" "abreubo01" "abreujo02" "abreuto01" ...
## $ yearID : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
## $ stint : int 1 1 1 1 1 1 1 1 1 1 ...
## $ teamID : Factor w/ 149 levels "ALT","ANA","ARI",..: 96 94 33 117 93 79 116 51 45 66 ...
## $ lgID : Factor w/ 7 levels "AA","AL","FL",..: 2 5 2 5 2 2 2 5 2 2 ...
## $ G : int 69 78 145 3 10 7 143 7 6 6 ...
## $ AB : int 0 133 556 4 0 0 502 15 0 3 ...
## $ R : int 0 12 80 0 0 0 64 1 0 1 ...
## $ H : int 0 33 176 0 0 0 123 1 0 0 ...
## $ X2B : int 0 9 35 0 0 0 27 0 0 0 ...
## $ X3B : int 0 0 2 0 0 0 4 0 0 0 ...
## $ HR : int 0 1 36 0 0 0 14 0 0 0 ...
## $ RBI : int 0 14 107 0 0 0 65 0 0 0 ...
## $ SB : int 0 1 3 0 0 0 8 0 0 0 ...
## $ CS : int 0 0 1 0 0 0 4 0 0 0 ...
## $ BB : int 0 20 51 0 0 0 32 0 0 0 ...
## $ SO : int 0 21 131 0 0 0 90 5 0 2 ...
## $ IBB : int 0 0 15 0 0 0 1 0 0 0 ...
## $ HBP : int 0 0 11 0 0 0 3 0 0 0 ...
## $ SH : int 0 0 0 0 0 0 3 0 0 0 ...
## $ SF : int 0 2 4 0 0 0 2 0 0 0 ...
## $ GIDP : int 0 3 14 1 0 0 10 1 0 0 ...
The first dplyr function we want to work with is select(). This allows us to isolate particular columns for our use. We add a head() command to avoid producing a list of every row and observation under that column variable.
Bat14 %>% select(teamID) %>% head(10)
## teamID
## 1 OAK
## 2 NYN
## 3 CHA
## 4 SFN
## 5 NYA
## 6 MIN
## 7 SEA
## 8 COL
## 9 CLE
## 10 KCA
We can specify any combination of variables we like with select().
Bat14 %>% select(teamID, lgID) %>% head(10)
## teamID lgID
## 1 OAK AL
## 2 NYN NL
## 3 CHA AL
## 4 SFN NL
## 5 NYA AL
## 6 MIN AL
## 7 SEA AL
## 8 COL NL
## 9 CLE AL
## 10 KCA AL
We can also exclude specific columns if we wish.
Bat14 %>% select(-teamID, -lgID, -yearID, -stint) %>% head(10)
## playerID G AB R H X2B X3B HR RBI SB CS BB SO IBB HBP SH SF GIDP
## 1 abadfe01 69 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 abreubo01 78 133 12 33 9 0 1 14 1 0 20 21 0 0 0 2 3
## 3 abreujo02 145 556 80 176 35 2 36 107 3 1 51 131 15 11 0 4 14
## 4 abreuto01 3 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## 5 aceveal01 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 6 achteaj01 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 7 ackledu01 143 502 64 123 27 4 14 65 8 4 32 90 1 3 3 2 10
## 8 adamecr01 7 15 1 1 0 0 0 0 0 0 0 5 0 0 0 0 1
## 9 adamsau01 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 10 adamsla01 6 3 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0
Here is another way to achieve the same result.
Bat14 %>% select(c(-(2:5))) %>% head(10)
## playerID G AB R H X2B X3B HR RBI SB CS BB SO IBB HBP SH SF GIDP
## 1 abadfe01 69 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 abreubo01 78 133 12 33 9 0 1 14 1 0 20 21 0 0 0 2 3
## 3 abreujo02 145 556 80 176 35 2 36 107 3 1 51 131 15 11 0 4 14
## 4 abreuto01 3 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## 5 aceveal01 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 6 achteaj01 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 7 ackledu01 143 502 64 123 27 4 14 65 8 4 32 90 1 3 3 2 10
## 8 adamecr01 7 15 1 1 0 0 0 0 0 0 0 5 0 0 0 0 1
## 9 adamsau01 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 10 adamsla01 6 3 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0
Either way works just fine. It’s a judgement call. Use the method you are more comfortable with, and which other users might find easier to understand. For example, the first method lets users know what variables are missing, but it is a little less elegant code. The second method is more elegant than the first, but the changes we made aren’t evident. It’s a good idea to add a note to code to help explain things.
Now we turn to group_by which allows us to roll-up the data into chosen groups. Here, we look at the two leagues and how many observations fall under each.
Bat14 %>% group_by(lgID) %>% count()
## # A tibble: 2 x 2
## # Groups: lgID [2]
## lgID n
## <fctr> <int>
## 1 AL 746
## 2 NL 689
Likewise, we can group by team.
Bat14 %>% group_by(teamID) %>% count()
## # A tibble: 30 x 2
## # Groups: teamID [30]
## teamID n
## <fctr> <int>
## 1 ARI 52
## 2 ATL 39
## 3 BAL 44
## 4 BOS 55
## 5 CHA 44
## 6 CHN 48
## 7 CIN 45
## 8 CLE 46
## 9 COL 49
## 10 DET 48
## # ... with 20 more rows
We can also use group_by to arrange by different groups. Here, we are telling R to count observations by team and sort them by league. This is the kind of thing a pivot table does in Excel or Google Sheets.
Bat14 %>% group_by(lgID, teamID) %>% count()
## # A tibble: 30 x 3
## # Groups: lgID, teamID [30]
## lgID teamID n
## <fctr> <fctr> <int>
## 1 AL BAL 44
## 2 AL BOS 55
## 3 AL CHA 44
## 4 AL CLE 46
## 5 AL DET 48
## 6 AL HOU 48
## 7 AL KCA 48
## 8 AL LAA 55
## 9 AL MIN 48
## 10 AL NYA 58
## # ... with 20 more rows
We can call in any of the variables (columns) we want from our data frame. We will be calculating Runs per Game soon, so we pull in the columns we need to work with. Note that now the observations are by each player because that’s how the original data is structured.
Bat14 %>% group_by(teamID, lgID, R, G) %>% count()
## # A tibble: 1,279 x 5
## # Groups: teamID, lgID, R, G [1,279]
## teamID lgID R G n
## <fctr> <fctr> <int> <int> <int>
## 1 ARI NL 0 2 2
## 2 ARI NL 0 3 3
## 3 ARI NL 0 4 1
## 4 ARI NL 0 6 2
## 5 ARI NL 0 7 1
## 6 ARI NL 0 18 2
## 7 ARI NL 0 22 1
## 8 ARI NL 0 25 1
## 9 ARI NL 0 29 1
## 10 ARI NL 0 32 1
## # ... with 1,269 more rows
If we want to aggregate the data by team (as below) or by another factor, we can use the summarize() command. We can run any math operation through this. Here, we want to know the total number of walks (base on balls) for each team.
Bat14 %>% group_by(teamID) %>% summarize(Total_BB = sum(BB))
## # A tibble: 30 x 2
## teamID Total_BB
## <fctr> <int>
## 1 ARI 398
## 2 ATL 472
## 3 BAL 401
## 4 BOS 535
## 5 CHA 417
## 6 CHN 442
## 7 CIN 415
## 8 CLE 504
## 9 COL 397
## 10 DET 443
## # ... with 20 more rows
We can do the same for any variable. In this case, we aggregate hits.
Bat14 %>% group_by(teamID) %>% summarize(Total_H = sum(H))
## # A tibble: 30 x 2
## teamID Total_H
## <fctr> <int>
## 1 ARI 1379
## 2 ATL 1316
## 3 BAL 1434
## 4 BOS 1355
## 5 CHA 1400
## 6 CHN 1315
## 7 CIN 1282
## 8 CLE 1411
## 9 COL 1551
## 10 DET 1557
## # ... with 20 more rows
Notice that R gives us results in alphabetical order. We can change that by using the arrange() command. Here, we create two objects we will use for graphing in a bit. They are Runs and RpG,
Runs <- Bat14 %>% group_by(teamID) %>% summarize(Runs = sum(R)) %>% arrange(desc(Runs))
RpG <- Bat14 %>% group_by(teamID) %>% summarize(RpG = sum(R) / 162) %>% arrange(desc(RpG))
Time to graph!
We want to display the number of Runs per Game each team produced during the 2014 season. We also want to plot them in order of size. To do this, we specify in the aes() command the reorder() subcommand. We use -RpG to indicated sort in descending order.
If you want to build this graph layer by layer to see how ggplot works, start with the code through geom_col() then build it up from there.
ggplot(RpG, aes(x = reorder(teamID, -RpG), y = RpG)) + geom_col() + aes(fill = teamID) +
labs(title = "Runs per Game by Team, 2014 Season", x = "MLB Team", y = "R/G", caption = "from Lahman data base") + guides(fill = FALSE)
It is redundant to consider Total Runs and Runs per Game because RpG is just Total Runs divided by the number of games played, which is a constant 162 in baseball. But it is another way to look at the same data. In this case, we are sorting in ascending order.
ggplot(Runs, aes(x = reorder(teamID, Runs), y = Runs)) + geom_col() + aes(fill = teamID) +
labs(title = "Total Runs by Team, 2014 Season", x = "MLB Team", y = "Total Runs", caption = "from Lahman data base") + guides(fill = FALSE)
There are different ways to accomplish the same ends in programming. There is no one right way that is better than the others. Every programmer is different. All we are trying to do is improve on how we write code.
So here is an alternate approach to producing the exact same graphs as above. We will limit ourselves to reproducing the RpG chart. The point is, there may be a million ways to do the same thing, so find the one you are most comfortable with which you think is also easy for others to use.
We create a new object, AltRpG.
AltRpG <- Bat14 %>% group_by(teamID) %>% summarize(Runs = sum(R)) %>% mutate("Runs/Game" = Runs / 162) %>% arrange(desc(Runs))
Then we graph it. Notice it is identical to the Runs per Game chart we created earlier.
ggplot(AltRpG, aes(x = reorder(teamID, -`Runs/Game`), y = `Runs/Game`)) + geom_col() + aes(fill = teamID) +
labs(title = "Runs per Game by Team, 2014 Season", x = "MLB Team", y = "R/G", caption = "from Lahman data base") + guides(fill = FALSE)
The datascience+ site is a helpful source for lessons on using R. Some ideas for this RPub came from that site.