An illustration of happy dplyr pipes.

Introduction

This tutorial is Part 3 of my Stats in R tutorial series. If you missed the last tutorial, you can check it out at the following link. In the previous tutorial, we learned a lot of descriptive statistics, how to simulate data, and how variables operate in R. Now that we have a good understanding of a lot of useful functions, we can talk about the tidyverse, arguably one of the most important and useful set of packages you will find in R for data wrangling. We will focus on one package in the tidyverse, the dplyr package, and you will then be able to learn how to:

Apply the fundamentals of the dplyr package and the %>% operator.
Start chaining together complex commands with dplyr.
Filter data for specific values.
Group your data for more specific descriptives.
Create and rename variables.
Summarize data using the dplyr method.

The tidyverse is unfortunately one of the more difficult parts of R to master. Learning it now will pay off immensely, else you will have a much harder time wrangling and visualizing data the way you want. While this tutorial is certainly not exhaustive, it at least covers the core functions that are useful to learn as a beginner.

The “Pipe”

The core functionality of dplyr (short for “data plyer”), is that it chains commands so that you can do quick and sophisticated data analysis without having to resort to several base R commands that require many intermediate steps to get to your goal. While the following code does not contain real commands, the general structure of a dplyr flow usually looks something like this:

data %>% 
  function.x() %>% 
  function.y() %>% 
  function.z()

Here is how it works. First, dplyr takes everything to the left of the %>% pipe, and runs everything after on it. Here data represents a typical data frame that will be tidied up by our dplyr commands. The %>% basically says “take this data and pipe it through to the next command.” The functions thereafter are applied after the pipe in sequential order, with each pipe adding an additional step. So here in order, our above command says “take data, then run function.x, then after that run function.y, and then finally run function z.

To visualize this, you can think of these pipes like little tunnels that your data goes through, with each function serving as a junction point where the data is manipulated in some way before it continues down the tunnel to the next command. Much of this will appear abstract until we apply this to actual dplyr commands, so it may be a good idea to mention our first function: select.

Select

Select is just like it sounds. If you want to select specific variables in your data, select achieves this goal. This is useful because it allows you to select only the variables that are useful to you, and you can apply more sophisticated selections by using multiple commands. We will first load the tidyverse, which automatically loads dplyr. Remember to use install.packages("tidyverse") if you haven’t installed this package already, otherwise you can just run library(tidyverse) if it is already installed.

install.packages("tidyverse")
library(tidyverse)

Now that our package is loaded, we will pick a dataset to use. I think a good one would be the airquality dataset. It comes with R so you can literally run airquality by itself and the data will show up in your Console. We will also convert airquality into a tibble called air, which just transforms this data frame into a more readable format.

air <- as_tibble(airquality)
air

## # A tibble: 153 × 6
##    Ozone Solar.R  Wind  Temp Month   Day
##    <int>   <int> <dbl> <int> <int> <int>
##  1    41     190   7.4    67     5     1
##  2    36     118   8      72     5     2
##  3    12     149  12.6    74     5     3
##  4    18     313  11.5    62     5     4
##  5    NA      NA  14.3    56     5     5
##  6    28      NA  14.9    66     5     6
##  7    23     299   8.6    65     5     7
##  8    19      99  13.8    59     5     8
##  9     8      19  20.1    61     5     9
## 10    NA     194   8.6    69     5    10
## # … with 143 more rows

You can see air contains the following variables:

Ozone: ozone in parts per billion (ppb)
Solar.R: solar radiation in Langleys (lang)
Wind: wind by miles per hour (mph)
Temp: temperature in degrees F
Month: the calendar month when data was recorded (1 to 12)
Day: the day of month when data was recorded (1 to 31)

Let’s say we want to select the Temp variable. All we have to do is use this function:

select(air, Temp)

## # A tibble: 153 × 1
##     Temp
##    <int>
##  1    67
##  2    72
##  3    74
##  4    62
##  5    56
##  6    66
##  7    65
##  8    59
##  9    61
## 10    69
## # … with 143 more rows

The alternative way, and arguably better way, is what we learned before. First, type in the air data. Then use the %>% operator (the shortcut for this is Ctrl + Shift + M). Finally, run select on the variable of interest, this time Ozone.

air %>% 
  select(Ozone)

## # A tibble: 153 × 1
##    Ozone
##    <int>
##  1    41
##  2    36
##  3    12
##  4    18
##  5    NA
##  6    28
##  7    23
##  8    19
##  9     8
## 10    NA
## # … with 143 more rows

You may be asking yourself why it’s necessary to write these additional lines. This will become much more clear as you combine more steps. For example, what if we want to only select one variable but also drop it’s missing (NA) values? We can do so by simply adding one more line with drop_na.

air %>% 
  select(Ozone) %>% 
  drop_na()

## # A tibble: 116 × 1
##    Ozone
##    <int>
##  1    41
##  2    36
##  3    12
##  4    18
##  5    28
##  6    23
##  7    19
##  8     8
##  9     7
## 10    16
## # … with 106 more rows

Congratulations! You have chained together your first command in the tidyverse. Notice also that the tibble now lists less rows. This is because we applied listwise deletion, a form of missing data removal in statistics. This usually isn’t the best way to handle missing data (a topic that could be a tutorial of it’s own), but for now you may just note that this is one way to deal with it. Now to select multiple variables, we just add more in the select function.

air %>% 
  select(Ozone, Wind)

## # A tibble: 153 × 2
##    Ozone  Wind
##    <int> <dbl>
##  1    41   7.4
##  2    36   8  
##  3    12  12.6
##  4    18  11.5
##  5    NA  14.3
##  6    28  14.9
##  7    23   8.6
##  8    19  13.8
##  9     8  20.1
## 10    NA   8.6
## # … with 143 more rows

What if we want to just remove some variables from the data? We can do this by adding - to the variable in our select function.

air %>% 
  select(-Ozone, -Wind)

## # A tibble: 153 × 4
##    Solar.R  Temp Month   Day
##      <int> <int> <int> <int>
##  1     190    67     5     1
##  2     118    72     5     2
##  3     149    74     5     3
##  4     313    62     5     4
##  5      NA    56     5     5
##  6      NA    66     5     6
##  7     299    65     5     7
##  8      99    59     5     8
##  9      19    61     5     9
## 10     194    69     5    10
## # … with 143 more rows

Now we have removed both ozone and wind variables from our data, with everything else remaining. Now this is all nice and helpful, but what if we have a dataset with a bajillion names? This is can be difficult to deal with and requires more sophisticated selection tools. I will introduce you to at least a few to help.

First off, a base R function that is helpful for quickly obtaining all of your variable names is colnames).

colnames(air)

## [1] "Ozone"   "Solar.R" "Wind"    "Temp"    "Month"   "Day"

Once you have your list, you can start picking what you want. Two useful functions are starts_with and ends_with. These take a variable’s beginning or ending letters to find it.

air %>% 
  select(starts_with("O"))

## # A tibble: 153 × 1
##    Ozone
##    <int>
##  1    41
##  2    36
##  3    12
##  4    18
##  5    NA
##  6    28
##  7    23
##  8    19
##  9     8
## 10    NA
## # … with 143 more rows

The contains command is more direct in that it just takes any specified character string in a variable and selects it.

air %>% 
  select(contains("O"))

## # A tibble: 153 × 3
##    Ozone Solar.R Month
##    <int>   <int> <int>
##  1    41     190     5
##  2    36     118     5
##  3    12     149     5
##  4    18     313     5
##  5    NA      NA     5
##  6    28      NA     5
##  7    23     299     5
##  8    19      99     5
##  9     8      19     5
## 10    NA     194     5
## # … with 143 more rows

Notice something? The contains function isn’t case-specific, so you can use both undercase or uppercase for this specific search. It is also more generic in what it selects, so be careful how you word things with this function.

Filter

Now that we know how to select, we are going to add filter to our arsenal. Filter takes the variables you have selected and designates what values they can have. It’s namesake is similar to what happens with a coffee filter. You put the data through the filter (the initial water and coffee grounds), and the result after filtering is a different form of the data (the coffee). Let’s say we only want wind speeds that exceed 10 mph. We would use the following function.

air %>% 
  filter(Wind > 10)

## # A tibble: 72 × 6
##    Ozone Solar.R  Wind  Temp Month   Day
##    <int>   <int> <dbl> <int> <int> <int>
##  1    12     149  12.6    74     5     3
##  2    18     313  11.5    62     5     4
##  3    NA      NA  14.3    56     5     5
##  4    28      NA  14.9    66     5     6
##  5    19      99  13.8    59     5     8
##  6     8      19  20.1    61     5     9
##  7    14     274  10.9    68     5    14
##  8    18      65  13.2    58     5    15
##  9    14     334  11.5    64     5    16
## 10    34     307  12      66     5    17
## # … with 62 more rows

If we want “greater than or equal to” 10 mph, we would just add >= instead.

air %>% 
  filter(Wind >= 10)

## # A tibble: 72 × 6
##    Ozone Solar.R  Wind  Temp Month   Day
##    <int>   <int> <dbl> <int> <int> <int>
##  1    12     149  12.6    74     5     3
##  2    18     313  11.5    62     5     4
##  3    NA      NA  14.3    56     5     5
##  4    28      NA  14.9    66     5     6
##  5    19      99  13.8    59     5     8
##  6     8      19  20.1    61     5     9
##  7    14     274  10.9    68     5    14
##  8    18      65  13.2    58     5    15
##  9    14     334  11.5    64     5    16
## 10    34     307  12      66     5    17
## # … with 62 more rows

How about less than or equal to?

air %>% 
  filter(Wind <= 10)

## # A tibble: 81 × 6
##    Ozone Solar.R  Wind  Temp Month   Day
##    <int>   <int> <dbl> <int> <int> <int>
##  1    41     190   7.4    67     5     1
##  2    36     118   8      72     5     2
##  3    23     299   8.6    65     5     7
##  4    NA     194   8.6    69     5    10
##  5     7      NA   6.9    74     5    11
##  6    16     256   9.7    69     5    12
##  7    11     290   9.2    66     5    13
##  8    11      44   9.7    62     5    20
##  9     1       8   9.7    59     5    21
## 10     4      25   9.7    61     5    23
## # … with 71 more rows

If we want something to equal exactly 20.1 mph, we would use the == operator. Remember, do not use = or R will act drunk and try to save it as an object.

air %>% 
  filter(Wind == 20.1)

## # A tibble: 1 × 6
##   Ozone Solar.R  Wind  Temp Month   Day
##   <int>   <int> <dbl> <int> <int> <int>
## 1     8      19  20.1    61     5     9

And if we want every value except for 20.1?

air %>% 
  filter(Wind != 20.1)

## # A tibble: 152 × 6
##    Ozone Solar.R  Wind  Temp Month   Day
##    <int>   <int> <dbl> <int> <int> <int>
##  1    41     190   7.4    67     5     1
##  2    36     118   8      72     5     2
##  3    12     149  12.6    74     5     3
##  4    18     313  11.5    62     5     4
##  5    NA      NA  14.3    56     5     5
##  6    28      NA  14.9    66     5     6
##  7    23     299   8.6    65     5     7
##  8    19      99  13.8    59     5     8
##  9    NA     194   8.6    69     5    10
## 10     7      NA   6.9    74     5    11
## # … with 142 more rows

Of course, the point of this tutorial is to learn how to combine these things. Let’s select Month and Day, then filter for just the month of May (5).

air %>% 
  select(Month, Day) %>% 
  filter(Month == 5)

## # A tibble: 31 × 2
##    Month   Day
##    <int> <int>
##  1     5     1
##  2     5     2
##  3     5     3
##  4     5     4
##  5     5     5
##  6     5     6
##  7     5     7
##  8     5     8
##  9     5     9
## 10     5    10
## # … with 21 more rows

Now that’s all great, but what if we want to apply multiple filters? We can just add them in like we did with the select function. So if we want to filter for the first ten days of July, we would use the following command:

air %>% 
  select(Month, Day) %>% 
  filter(Month == 7, Day <= 10)

## # A tibble: 10 × 2
##    Month   Day
##    <int> <int>
##  1     7     1
##  2     7     2
##  3     7     3
##  4     7     4
##  5     7     5
##  6     7     6
##  7     7     7
##  8     7     8
##  9     7     9
## 10     7    10

However, if we want data that has either July dates or the first ten days of any given month, we would use the | operator. Be careful here: make sure to include the name of the variable each time or R will again wig out and forget how to work.

air %>% 
  select(Month, Day) %>% 
  filter(Month == 7 | Day <= 10)

## # A tibble: 71 × 2
##    Month   Day
##    <int> <int>
##  1     5     1
##  2     5     2
##  3     5     3
##  4     5     4
##  5     5     5
##  6     5     6
##  7     5     7
##  8     5     8
##  9     5     9
## 10     5    10
## # … with 61 more rows

Special care must be taken with missing values. If we want to filter them from a variable, it is helpful to use is.na as a wrapper for whatever variable we want to remove the missingness from. This will obtain all values that are NA in Wind:

air %>% 
  filter(is.na(Wind))

## # A tibble: 0 × 6
## # … with 6 variables: Ozone <int>, Solar.R <int>, Wind <dbl>, Temp <int>,
## #   Month <int>, Day <int>

And including the ! here will negate this by only finding data that isn’t missing.

air %>% 
  filter(!is.na(Wind))

## # A tibble: 153 × 6
##    Ozone Solar.R  Wind  Temp Month   Day
##    <int>   <int> <dbl> <int> <int> <int>
##  1    41     190   7.4    67     5     1
##  2    36     118   8      72     5     2
##  3    12     149  12.6    74     5     3
##  4    18     313  11.5    62     5     4
##  5    NA      NA  14.3    56     5     5
##  6    28      NA  14.9    66     5     6
##  7    23     299   8.6    65     5     7
##  8    19      99  13.8    59     5     8
##  9     8      19  20.1    61     5     9
## 10    NA     194   8.6    69     5    10
## # … with 143 more rows

Arrange

A very simple function to add to our arsenal is arrange. This one requires less brainpower to use, as all it does is sort data in order (there is actually a base R version too called sort that does the same thing for vectors). If we want to arrange our Solar.R variable in ascending order, we run the following command:

air %>% 
  arrange(Solar.R)

## # A tibble: 153 × 6
##    Ozone Solar.R  Wind  Temp Month   Day
##    <int>   <int> <dbl> <int> <int> <int>
##  1    16       7   6.9    74     7    21
##  2     1       8   9.7    59     5    21
##  3    23      13  12      67     5    28
##  4    23      14   9.2    71     9    22
##  5     8      19  20.1    61     5     9
##  6    14      20  16.6    63     9    25
##  7     9      24  13.8    81     8     2
##  8     9      24  10.9    71     9    14
##  9     4      25   9.7    61     5    23
## 10    13      27  10.3    76     9    18
## # … with 143 more rows

To arrange in descending order, simply add desc inside the arrange function.

air %>% 
  arrange(desc(Solar.R))

## # A tibble: 153 × 6
##    Ozone Solar.R  Wind  Temp Month   Day
##    <int>   <int> <dbl> <int> <int> <int>
##  1    14     334  11.5    64     5    16
##  2    NA     332  13.8    80     6    14
##  3    39     323  11.5    87     6    10
##  4    30     322  11.5    68     5    19
##  5    NA     322  11.5    79     6    15
##  6    11     320  16.6    73     5    22
##  7    40     314  10.9    83     7     6
##  8    18     313  11.5    62     5     4
##  9    34     307  12      66     5    17
## 10    23     299   8.6    65     5     7
## # … with 143 more rows

Let’s combine all of what we know now. We would like to select solar radiation, ozone, and wind. Then we want to filter for solar radiation that is below 200. Finally we want to arrange the wind variable in descending order. This can be achieved with the following sequence of code.

air %>% 
  select(Ozone, Solar.R, Wind) %>% 
  filter(Solar.R < 200) %>% 
  arrange(desc(Wind))

## # A tibble: 71 × 3
##    Ozone Solar.R  Wind
##    <int>   <int> <dbl>
##  1     8      19  20.1
##  2     6      78  18.4
##  3    NA      66  16.6
##  4    14      20  16.6
##  5    32      92  15.5
##  6    21     191  14.9
##  7    NA      31  14.9
##  8    27     175  14.9
##  9     7      48  14.3
## 10     9      36  14.3
## # … with 61 more rows

We are starting to get more sophisticated with our data wrangling. Let’s continue to learn some more useful functions to add to our tidyverse vocabulary.

Grouping and Transformation

So far we haven’t done anything magical in terms of transforming our data, but now we can start doing so. The first step to this is learning group_by and summarise. The group_by function takes data and groups it into categories. The summarise function then applies transformations to these groups. For example, if we wanted to know what each month’s mean temperature is, we can achieve this with these two functions.

Let’s go ahead and do this. To group by month, we will use group_by(Month). Then we essentially create a new variable of our choosing in summarise by first assigning a name. Then we use the = operator after the name, and then finally the transformation we want after the = operator. Here we want the mean of the temperature for each month, so we just use mean(Temp), which will be our new variable called Mean_Temp.

air %>% 
  group_by(Month) %>% 
  summarise(Mean_Temp = mean(Temp))

## # A tibble: 5 × 2
##   Month Mean_Temp
##   <int>     <dbl>
## 1     5      65.5
## 2     6      79.1
## 3     7      83.9
## 4     8      84.0
## 5     9      76.9

The summarise function does something important here: it collapses the data frame into just the variables we have created. This makes it a very effective tool for really concise summaries of large datasets. Here we just have what we need. The mean temperatures for each month are listed from May to September. If we want to arrange these means by the hottest temperature, we can go ahead and do so with arrange. Remember to use our new variable name.

air %>% 
  group_by(Month) %>% 
  summarise(Mean_Temp = mean(Temp)) %>% 
  arrange(desc(Mean_Temp))

## # A tibble: 5 × 2
##   Month Mean_Temp
##   <int>     <dbl>
## 1     8      84.0
## 2     7      83.9
## 3     6      79.1
## 4     9      76.9
## 5     5      65.5

We can see that August is the hottest month on average and May is the coolest on average. What if we want multiple descriptives for our data? Remember that we learned quite a few of these functions in the last tutorial. Let’s go ahead and add some here.

air %>% 
  group_by(Month) %>% 
  summarise(Mean_Temp = mean(Temp),
            SD_Temp = sd(Temp),
            IQR_Temp = IQR(Temp))

## # A tibble: 5 × 4
##   Month Mean_Temp SD_Temp IQR_Temp
##   <int>     <dbl>   <dbl>    <dbl>
## 1     5      65.5    6.85     9   
## 2     6      79.1    6.60     6.75
## 3     7      83.9    4.32     4.5 
## 4     8      84.0    6.59     9.5 
## 5     9      76.9    8.36    10

We can start adding a bit more complexity. Let’s select the month and wind speeds for each month. Then we will filter for just wind values above 10 mph. Then we will group my month and get the variance of wind values for each month. We will then arrange the variance in descending order. Keep in mind that the order of operations matters here. Make sure you use the proper order or R will bring back the wrong data.

air %>% 
  select(Month, Wind) %>% 
  group_by(Month) %>% 
  summarise(Var_Wind = var(Wind)) %>% 
  arrange(desc(Var_Wind))

## # A tibble: 5 × 2
##   Month Var_Wind
##   <int>    <dbl>
## 1     6    14.2 
## 2     5    12.5 
## 3     9    12.0 
## 4     8    10.4 
## 5     7     9.22

We can see that the variance for wind isn’t too extreme between months, but we can at least visually inspect it here.

Mutate

Now that we have learned summarise, we can use something conceptually similar called mutate. This function creates variables like summarise, but does so without collapsing the data frame. This means mutate has two functions. First, you can simply create new variables with mutate without grouping. Second, you can make grouped transformations of the data and add them to your dataset without having to eliminate all the other variables in your data in the process. This may not make a ton of sense just yet, but we will make some mutate transformations to see why this is helpful. First, we can make a very simple transformation by just creating a new variable. Let’s convert the temperature variable from Fahrenheit to Celcius. We can apply the following formula:

\[ \text{Celcius} = (\text{Degrees Fahrenheit} - 32) \times 5/9 \] To achieve this, we will use the following code:

air %>% 
  mutate(Celcius = (Temp-32)*(5/9))

## # A tibble: 153 × 7
##    Ozone Solar.R  Wind  Temp Month   Day Celcius
##    <int>   <int> <dbl> <int> <int> <int>   <dbl>
##  1    41     190   7.4    67     5     1    19.4
##  2    36     118   8      72     5     2    22.2
##  3    12     149  12.6    74     5     3    23.3
##  4    18     313  11.5    62     5     4    16.7
##  5    NA      NA  14.3    56     5     5    13.3
##  6    28      NA  14.9    66     5     6    18.9
##  7    23     299   8.6    65     5     7    18.3
##  8    19      99  13.8    59     5     8    15  
##  9     8      19  20.1    61     5     9    16.1
## 10    NA     194   8.6    69     5    10    20.6
## # … with 143 more rows

You can see that it has maintained the data frame while creating this variable. Now perhaps we are interested in the mean differences in monthly temperature. We can just add the group_by variable and create a new mutation that takes the mean of our new variable. We will name it Mean_Celcius.

air %>% 
  mutate(Celcius = (Temp-32)*(5/9)) %>% 
  group_by(Month) %>% 
  mutate(Mean_Celcius = mean(Celcius))

## # A tibble: 153 × 8
## # Groups:   Month [5]
##    Ozone Solar.R  Wind  Temp Month   Day Celcius Mean_Celcius
##    <int>   <int> <dbl> <int> <int> <int>   <dbl>        <dbl>
##  1    41     190   7.4    67     5     1    19.4         18.6
##  2    36     118   8      72     5     2    22.2         18.6
##  3    12     149  12.6    74     5     3    23.3         18.6
##  4    18     313  11.5    62     5     4    16.7         18.6
##  5    NA      NA  14.3    56     5     5    13.3         18.6
##  6    28      NA  14.9    66     5     6    18.9         18.6
##  7    23     299   8.6    65     5     7    18.3         18.6
##  8    19      99  13.8    59     5     8    15           18.6
##  9     8      19  20.1    61     5     9    16.1         18.6
## 10    NA     194   8.6    69     5    10    20.6         18.6
## # … with 143 more rows

You can see now that we have effectively created a column-wise mean for each month. However, if you want to make further transformations with mutate but don’t want to continue using the same grouping, you should use ungroup before doing another mutation. For example, perhaps I want to calculate wind-adjusted temperature for each day, but I don’t want to calculate it by month. I can do so with this manipulation:

air %>% 
  mutate(Celcius = (Temp-32)*(5/9)) %>% 
  group_by(Month) %>% 
  mutate(Mean_Celcius = mean(Celcius)) %>% 
  ungroup() %>% 
  mutate(Adj_Temp = Temp/Wind)

## # A tibble: 153 × 9
##    Ozone Solar.R  Wind  Temp Month   Day Celcius Mean_Celcius Adj_Temp
##    <int>   <int> <dbl> <int> <int> <int>   <dbl>        <dbl>    <dbl>
##  1    41     190   7.4    67     5     1    19.4         18.6     9.05
##  2    36     118   8      72     5     2    22.2         18.6     9   
##  3    12     149  12.6    74     5     3    23.3         18.6     5.87
##  4    18     313  11.5    62     5     4    16.7         18.6     5.39
##  5    NA      NA  14.3    56     5     5    13.3         18.6     3.92
##  6    28      NA  14.9    66     5     6    18.9         18.6     4.43
##  7    23     299   8.6    65     5     7    18.3         18.6     7.56
##  8    19      99  13.8    59     5     8    15           18.6     4.28
##  9     8      19  20.1    61     5     9    16.1         18.6     3.03
## 10    NA     194   8.6    69     5    10    20.6         18.6     8.02
## # … with 143 more rows

One final easy transformation function is rename. This just allows you to quickly rename variables.

air %>% 
  rename(Solar_Radiation = Solar.R)

## # A tibble: 153 × 6
##    Ozone Solar_Radiation  Wind  Temp Month   Day
##    <int>           <int> <dbl> <int> <int> <int>
##  1    41             190   7.4    67     5     1
##  2    36             118   8      72     5     2
##  3    12             149  12.6    74     5     3
##  4    18             313  11.5    62     5     4
##  5    NA              NA  14.3    56     5     5
##  6    28              NA  14.9    66     5     6
##  7    23             299   8.6    65     5     7
##  8    19              99  13.8    59     5     8
##  9     8              19  20.1    61     5     9
## 10    NA             194   8.6    69     5    10
## # … with 143 more rows

This can be pretty useful if you have funky names for your variables (which is very often the case if you are getting it from somebody else).

Pivot Data

We have done a lot already, but I wanted to highlight two very important functions before concluding. Oftentimes data will come in either wide or long format. It is generally advised to make your data in long format, which means that every variable should be in its own column, and every observation should be in rows. Consider for example the following data from the datarium package, which records subject ID’s, the treatment group for a depression study, and four time points.

library(datarium)
depression

## # A tibble: 24 × 6
##    id    treatment    t0    t1    t2    t3
##    <fct> <fct>     <dbl> <dbl> <dbl> <dbl>
##  1 1     ctr         296   175   187   242
##  2 2     ctr         376   329   236   126
##  3 3     ctr         309   238   150   173
##  4 4     ctr         222    60    82   135
##  5 5     ctr         150   271   250   266
##  6 6     ctr         316   291   238   194
##  7 7     ctr         321   364   270   358
##  8 8     ctr         447   402   294   266
##  9 9     ctr         220    70    95   137
## 10 10    ctr         375   335   334   129
## # … with 14 more rows

This can sometimes be problematic because we want to select a single factor of “time” when doing data manipulation. It may also unnecessarily clutter up the data frame by forcing us to scroll through several columns that are unnecessary. We can fix this with the pivot_longer function. This involves multiple steps, but generally goes like this:

Select the columns to pivot (here we will use 3:6 to select the t variables).
What names to give the factor (we will use “time” here).
What names to give the numeric variable that is listed with them.

A direct example is more illustrative of how this works. To use this for later, we will save this as long_depression.

long_depression <- depression %>% 
  pivot_longer(cols = 3:6,
               names_to = "time",
               values_to = "score")
long_depression

## # A tibble: 96 × 4
##    id    treatment time  score
##    <fct> <fct>     <chr> <dbl>
##  1 1     ctr       t0      296
##  2 1     ctr       t1      175
##  3 1     ctr       t2      187
##  4 1     ctr       t3      242
##  5 2     ctr       t0      376
##  6 2     ctr       t1      329
##  7 2     ctr       t2      236
##  8 2     ctr       t3      126
##  9 3     ctr       t0      309
## 10 3     ctr       t1      238
## # … with 86 more rows

Now all of our times are neatly in one column and we can use this directly if we want to manipulate or plot this data. However, you may sometimes need to convert data into wide format. The equivalent pivot_wider helps you accomplish this goal. This time, we use the names_from argument to pull column names from the time variable and the values_from argument to fill the score values into these columns.

long_depression %>% 
  pivot_wider(names_from = time, 
              values_from = score)

## # A tibble: 24 × 6
##    id    treatment    t0    t1    t2    t3
##    <fct> <fct>     <dbl> <dbl> <dbl> <dbl>
##  1 1     ctr         296   175   187   242
##  2 2     ctr         376   329   236   126
##  3 3     ctr         309   238   150   173
##  4 4     ctr         222    60    82   135
##  5 5     ctr         150   271   250   266
##  6 6     ctr         316   291   238   194
##  7 7     ctr         321   364   270   358
##  8 8     ctr         447   402   294   266
##  9 9     ctr         220    70    95   137
## 10 10    ctr         375   335   334   129
## # … with 14 more rows

Generally speaking, its much more useful to have data in long format, so invest some time into learning pivot_longer when you can.

Conclusion

You now have some mastery of the most essential functions of dplyr. If you want to learn more about the tidyverse, there is no better resource than the wonderful book R for Data Science, written by the author of the tidyverse Hadley Wickham. I also recommend reading his article Tidy Data if you want to understand why long format data is important. I will be following this tutorial with another instructional article on the essential package from the tidyverse called ggplot2, which provides unparalleled data visualization and fancy plots for statisticians to use.

My Other RPubs

Thank you for reading this tutorial. If you felt that any part of this tutorial was helpful, and you would like to support me, please consider buying me a cup of coffee.

Check out my other RPubs if you want to learn more about stats in R.

More content is on the way, which I will continue to try to update each week or two when possible. Keep an eye out for it by checking my LinkedIn and other relevant sites.

Stats in R Part 3: Learning DPLYR

Shawn Hemelstrand

2023-02-12