I just learned how to use dplyr this morning, so I put together this document that covers some dplyr basics.

background info

To illustrate some dplyr basics, we will use data on Mountain dusky salamanders, a species of salamander. Scientists at the University of Chicago collected mountain dusky salamanders from two locations. Salamanders from one location are called Rough Butt salamanders and salamanders from the other location are called White Side salamanders. Rough butt salamanders and white side salamanders are the same species, but haven’t interacted/mated for awhile. Scientists were interested in seeing whether salamanders preferred mating with salamanders from their own location.

setting things up

The salamander mating data set is in the glmm package. The data were published by McCullagh and Nelder in their book Generalized Linear Models (1989, Section 14.5).

library(glmm)
## Warning: package 'glmm' was built under R version 3.4.2
## Loading required package: trust
## Loading required package: mvtnorm
## Loading required package: Matrix
data(salamander)

Let’s learn a little bit about the data before we start working with it. First of all, we have four variables. The names of these variables are in the R output below.

names(salamander)
## [1] "Mate"   "Cross"  "Female" "Male"

The Mate variable is 1 (if the pair of salamanders mated) or 0 (if they didn’t). The Cross variable specifies the types of salamanders in the trial: the first letter is for the female salamander and the second is for the male salamander. For example, R/W indicates scientists put a female rough butt with a male white side. Finally, the Female and Male variables communicate the identification numbers of the female and male salamanders in the trial.

str(salamander)
## 'data.frame':    360 obs. of  4 variables:
##  $ Mate  : num  1 1 1 1 1 1 0 0 0 0 ...
##  $ Cross : Factor w/ 4 levels "R/R","R/W","W/R",..: 1 1 1 1 1 2 2 2 2 2 ...
##  $ Female: Factor w/ 60 levels "10","11","12",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Male  : Factor w/ 60 levels "10","11","12",..: 1 5 2 4 3 19 18 16 20 17 ...
head(salamander)
##   Mate Cross Female Male
## 1    1   R/R     10   10
## 2    1   R/R     11   14
## 3    1   R/R     12   11
## 4    1   R/R     13   13
## 5    1   R/R     14   12
## 6    1   R/W     15   28

We also need to use the dplyr package. If we call the tidyverse package, we’ll get the dplyr package. Overkill, but whatever.

## Warning: package 'tidyverse' was built under R version 3.4.2
## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.3.4     v dplyr   0.7.4
## v tidyr   0.7.2     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0
## Warning: package 'ggplot2' was built under R version 3.4.2
## Warning: package 'tibble' was built under R version 3.4.2
## Warning: package 'tidyr' was built under R version 3.4.2
## Warning: package 'readr' was built under R version 3.4.2
## Warning: package 'purrr' was built under R version 3.4.2
## Warning: package 'dplyr' was built under R version 3.4.2
## Warning: package 'forcats' was built under R version 3.4.2
## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x tidyr::expand() masks Matrix::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

start the dplyr fun!

create a new data set by selecting (or dropping) variables

For this activity, we don’t need the female and male identification numbers. We can create a new dataset that drops the two variables we do NOT want. To do this, we use the %<% symbol (a pipe) and the select function. The pipe indicates we will use the salamander data set to perform the function that follows the pipe. We then output the result. Finally, the minus sign tells R to select all the variables EXCEPT the third and fourth.

minisal <- salamander %>% select(-c(3,4))
summary(minisal)
##       Mate       Cross   
##  Min.   :0.000   R/R:90  
##  1st Qu.:0.000   R/W:90  
##  Median :1.000   W/R:90  
##  Mean   :0.525   W/W:90  
##  3rd Qu.:1.000           
##  Max.   :1.000

Alternatively, we can create this mini data set by selecting the variables we DO want. We could have run the following to end up with just the first and second variables.

other_minisal <- salamander %>% select(1:2)
summary(other_minisal)
##       Mate       Cross   
##  Min.   :0.000   R/R:90  
##  1st Qu.:0.000   R/W:90  
##  Median :1.000   W/R:90  
##  Mean   :0.525   W/W:90  
##  3rd Qu.:1.000           
##  Max.   :1.000

group rows together using group_by

Let’s calculate the number of trials for each type of cross. Additionally, let’s calculate the number of successful matings for each type of cross. We do this by using the mini data set (minisal), grouping the observations by the type of cross, and then calculating the two quantities for each of the four crosses. The n() function calculates the number of observations for each type of cross. The sum(Mate) sums the Mate variable for each type of cross. We save these new variablse as Cross_n and successes, respectively.

out1 <- minisal %>% group_by(Cross) %>% summarise(Cross_n = n(), successes = sum(Mate))

Let’s check out the results!

out1
## # A tibble: 4 x 3
##    Cross Cross_n successes
##   <fctr>   <int>     <dbl>
## 1    R/R      90        60
## 2    R/W      90        50
## 3    W/R      90        19
## 4    W/W      90        60

We have a table with 4 rows (because we have 4 crosses) and 3 columns (because we now have 3 variables).

create a new variable using mutate

Let’s say we now want to calculate the proportion of times salamanders of a certain cross-type mated. We have the numerator and denominator for these proportions already, so we can use mutate to divide one by the other to calculate these proportions.

out2 <- out1 %>% mutate(props = successes/Cross_n)
## Warning: package 'bindrcpp' was built under R version 3.4.2
out2
## # A tibble: 4 x 4
##    Cross Cross_n successes     props
##   <fctr>   <int>     <dbl>     <dbl>
## 1    R/R      90        60 0.6666667
## 2    R/W      90        50 0.5555556
## 3    W/R      90        19 0.2111111
## 4    W/W      90        60 0.6666667

That’s all I’ve learned so far. :)