An illustration of happy dplyr pipes.
This tutorial is Part 3 of my Stats in R tutorial series. If you
missed the last tutorial, you can check it out at the
following link. In the previous tutorial, we learned a lot of
descriptive statistics, how to simulate data, and how variables operate
in R. Now that we have a good understanding of a lot of useful
functions, we can talk about the tidyverse, arguably one of
the most important and useful set of packages you will find in R for
data wrangling. We will focus on one package in the
tidyverse, the dplyr package, and you will
then be able to learn how to:
dplyr package and the
%>% operator.dplyr.dplyr method.The tidyverse is unfortunately one of the more difficult
parts of R to master. Learning it now will pay off immensely, else you
will have a much harder time wrangling and visualizing data the way you
want. While this tutorial is certainly not exhaustive, it at least
covers the core functions that are useful to learn as a beginner.
The core functionality of dplyr (short for “data
plyer”), is that it chains commands so that you can do quick and
sophisticated data analysis without having to resort to several base R
commands that require many intermediate steps to get to your goal. While
the following code does not contain real commands, the general structure
of a dplyr flow usually looks something like this:
data %>%
function.x() %>%
function.y() %>%
function.z()
Here is how it works. First, dplyr takes everything to
the left of the %>% pipe, and runs everything after on
it. Here data represents a typical data frame that will be
tidied up by our dplyr commands. The %>%
basically says “take this data and pipe it through to the next command.”
The functions thereafter are applied after the pipe in sequential order,
with each pipe adding an additional step. So here in order, our above
command says “take data, then run function.x,
then after that run function.y, and then finally run
function z.
To visualize this, you can think of these pipes like little tunnels
that your data goes through, with each function serving as a junction
point where the data is manipulated in some way before it continues down
the tunnel to the next command. Much of this will appear abstract until
we apply this to actual dplyr commands, so it may be a good
idea to mention our first function: select.
Select is just like it sounds. If you want to select specific
variables in your data, select achieves this goal. This is
useful because it allows you to select only the variables that are
useful to you, and you can apply more sophisticated selections by using
multiple commands. We will first load the tidyverse, which
automatically loads dplyr. Remember to use
install.packages("tidyverse") if you haven’t installed this
package already, otherwise you can just run
library(tidyverse) if it is already installed.
install.packages("tidyverse")
library(tidyverse)
Now that our package is loaded, we will pick a dataset to use. I
think a good one would be the airquality dataset. It comes
with R so you can literally run airquality by itself and
the data will show up in your Console. We will also convert
airquality into a tibble called air, which
just transforms this data frame into a more readable format.
air <- as_tibble(airquality)
air
## # A tibble: 153 × 6
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 41 190 7.4 67 5 1
## 2 36 118 8 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
## 9 8 19 20.1 61 5 9
## 10 NA 194 8.6 69 5 10
## # … with 143 more rows
You can see air contains the following variables:
Let’s say we want to select the Temp variable. All we
have to do is use this function:
select(air, Temp)
## # A tibble: 153 × 1
## Temp
## <int>
## 1 67
## 2 72
## 3 74
## 4 62
## 5 56
## 6 66
## 7 65
## 8 59
## 9 61
## 10 69
## # … with 143 more rows
The alternative way, and arguably better way, is what we learned
before. First, type in the air data. Then use the
%>% operator (the shortcut for this is
Ctrl + Shift + M). Finally, run select on the
variable of interest, this time Ozone.
air %>%
select(Ozone)
## # A tibble: 153 × 1
## Ozone
## <int>
## 1 41
## 2 36
## 3 12
## 4 18
## 5 NA
## 6 28
## 7 23
## 8 19
## 9 8
## 10 NA
## # … with 143 more rows
You may be asking yourself why it’s necessary to write these
additional lines. This will become much more clear as you combine more
steps. For example, what if we want to only select one variable but also
drop it’s missing (NA) values? We can do so by simply adding one more
line with drop_na.
air %>%
select(Ozone) %>%
drop_na()
## # A tibble: 116 × 1
## Ozone
## <int>
## 1 41
## 2 36
## 3 12
## 4 18
## 5 28
## 6 23
## 7 19
## 8 8
## 9 7
## 10 16
## # … with 106 more rows
Congratulations! You have chained together your first command in the
tidyverse. Notice also that the tibble now lists less rows.
This is because we applied listwise deletion, a form of
missing data removal in statistics. This usually isn’t the best way to
handle missing data (a topic that could be a tutorial of it’s own), but
for now you may just note that this is one way to deal with it. Now to
select multiple variables, we just add more in the select
function.
air %>%
select(Ozone, Wind)
## # A tibble: 153 × 2
## Ozone Wind
## <int> <dbl>
## 1 41 7.4
## 2 36 8
## 3 12 12.6
## 4 18 11.5
## 5 NA 14.3
## 6 28 14.9
## 7 23 8.6
## 8 19 13.8
## 9 8 20.1
## 10 NA 8.6
## # … with 143 more rows
What if we want to just remove some variables from the data? We can
do this by adding - to the variable in our
select function.
air %>%
select(-Ozone, -Wind)
## # A tibble: 153 × 4
## Solar.R Temp Month Day
## <int> <int> <int> <int>
## 1 190 67 5 1
## 2 118 72 5 2
## 3 149 74 5 3
## 4 313 62 5 4
## 5 NA 56 5 5
## 6 NA 66 5 6
## 7 299 65 5 7
## 8 99 59 5 8
## 9 19 61 5 9
## 10 194 69 5 10
## # … with 143 more rows
Now we have removed both ozone and wind variables from our data, with everything else remaining. Now this is all nice and helpful, but what if we have a dataset with a bajillion names? This is can be difficult to deal with and requires more sophisticated selection tools. I will introduce you to at least a few to help.
First off, a base R function that is helpful for quickly obtaining
all of your variable names is colnames).
colnames(air)
## [1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
Once you have your list, you can start picking what you want. Two
useful functions are starts_with and
ends_with. These take a variable’s beginning or ending
letters to find it.
air %>%
select(starts_with("O"))
## # A tibble: 153 × 1
## Ozone
## <int>
## 1 41
## 2 36
## 3 12
## 4 18
## 5 NA
## 6 28
## 7 23
## 8 19
## 9 8
## 10 NA
## # … with 143 more rows
The contains command is more direct in that it just
takes any specified character string in a variable and selects it.
air %>%
select(contains("O"))
## # A tibble: 153 × 3
## Ozone Solar.R Month
## <int> <int> <int>
## 1 41 190 5
## 2 36 118 5
## 3 12 149 5
## 4 18 313 5
## 5 NA NA 5
## 6 28 NA 5
## 7 23 299 5
## 8 19 99 5
## 9 8 19 5
## 10 NA 194 5
## # … with 143 more rows
Notice something? The contains function isn’t
case-specific, so you can use both undercase or uppercase for this
specific search. It is also more generic in what it selects, so be
careful how you word things with this function.
Now that we know how to select, we are going to add
filter to our arsenal. Filter takes the variables you have
selected and designates what values they can have. It’s namesake is
similar to what happens with a coffee filter. You put the data through
the filter (the initial water and coffee grounds), and the result after
filtering is a different form of the data (the coffee). Let’s say we
only want wind speeds that exceed 10 mph. We would use the following
function.
air %>%
filter(Wind > 10)
## # A tibble: 72 × 6
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 12 149 12.6 74 5 3
## 2 18 313 11.5 62 5 4
## 3 NA NA 14.3 56 5 5
## 4 28 NA 14.9 66 5 6
## 5 19 99 13.8 59 5 8
## 6 8 19 20.1 61 5 9
## 7 14 274 10.9 68 5 14
## 8 18 65 13.2 58 5 15
## 9 14 334 11.5 64 5 16
## 10 34 307 12 66 5 17
## # … with 62 more rows
If we want “greater than or equal to” 10 mph, we would just add
>= instead.
air %>%
filter(Wind >= 10)
## # A tibble: 72 × 6
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 12 149 12.6 74 5 3
## 2 18 313 11.5 62 5 4
## 3 NA NA 14.3 56 5 5
## 4 28 NA 14.9 66 5 6
## 5 19 99 13.8 59 5 8
## 6 8 19 20.1 61 5 9
## 7 14 274 10.9 68 5 14
## 8 18 65 13.2 58 5 15
## 9 14 334 11.5 64 5 16
## 10 34 307 12 66 5 17
## # … with 62 more rows
How about less than or equal to?
air %>%
filter(Wind <= 10)
## # A tibble: 81 × 6
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 41 190 7.4 67 5 1
## 2 36 118 8 72 5 2
## 3 23 299 8.6 65 5 7
## 4 NA 194 8.6 69 5 10
## 5 7 NA 6.9 74 5 11
## 6 16 256 9.7 69 5 12
## 7 11 290 9.2 66 5 13
## 8 11 44 9.7 62 5 20
## 9 1 8 9.7 59 5 21
## 10 4 25 9.7 61 5 23
## # … with 71 more rows
If we want something to equal exactly 20.1 mph, we would use
the == operator. Remember, do not use = or R
will act drunk and try to save it as an object.
air %>%
filter(Wind == 20.1)
## # A tibble: 1 × 6
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 8 19 20.1 61 5 9
And if we want every value except for 20.1?
air %>%
filter(Wind != 20.1)
## # A tibble: 152 × 6
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 41 190 7.4 67 5 1
## 2 36 118 8 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
## 9 NA 194 8.6 69 5 10
## 10 7 NA 6.9 74 5 11
## # … with 142 more rows
Of course, the point of this tutorial is to learn how to combine
these things. Let’s select Month and Day, then
filter for just the month of May (5).
air %>%
select(Month, Day) %>%
filter(Month == 5)
## # A tibble: 31 × 2
## Month Day
## <int> <int>
## 1 5 1
## 2 5 2
## 3 5 3
## 4 5 4
## 5 5 5
## 6 5 6
## 7 5 7
## 8 5 8
## 9 5 9
## 10 5 10
## # … with 21 more rows
Now that’s all great, but what if we want to apply multiple filters?
We can just add them in like we did with the select
function. So if we want to filter for the first ten days of July, we
would use the following command:
air %>%
select(Month, Day) %>%
filter(Month == 7, Day <= 10)
## # A tibble: 10 × 2
## Month Day
## <int> <int>
## 1 7 1
## 2 7 2
## 3 7 3
## 4 7 4
## 5 7 5
## 6 7 6
## 7 7 7
## 8 7 8
## 9 7 9
## 10 7 10
However, if we want data that has either July dates or the
first ten days of any given month, we would use the |
operator. Be careful here: make sure to include the name of the variable
each time or R will again wig out and forget how to work.
air %>%
select(Month, Day) %>%
filter(Month == 7 | Day <= 10)
## # A tibble: 71 × 2
## Month Day
## <int> <int>
## 1 5 1
## 2 5 2
## 3 5 3
## 4 5 4
## 5 5 5
## 6 5 6
## 7 5 7
## 8 5 8
## 9 5 9
## 10 5 10
## # … with 61 more rows
Special care must be taken with missing values. If we want to filter
them from a variable, it is helpful to use is.na as a
wrapper for whatever variable we want to remove the missingness from.
This will obtain all values that are NA in
Wind:
air %>%
filter(is.na(Wind))
## # A tibble: 0 × 6
## # … with 6 variables: Ozone <int>, Solar.R <int>, Wind <dbl>, Temp <int>,
## # Month <int>, Day <int>
And including the ! here will negate this by only
finding data that isn’t missing.
air %>%
filter(!is.na(Wind))
## # A tibble: 153 × 6
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 41 190 7.4 67 5 1
## 2 36 118 8 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
## 9 8 19 20.1 61 5 9
## 10 NA 194 8.6 69 5 10
## # … with 143 more rows
A very simple function to add to our arsenal is arrange.
This one requires less brainpower to use, as all it does is sort data in
order (there is actually a base R version too called sort
that does the same thing for vectors). If we want to arrange our
Solar.R variable in ascending order, we run the following
command:
air %>%
arrange(Solar.R)
## # A tibble: 153 × 6
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 16 7 6.9 74 7 21
## 2 1 8 9.7 59 5 21
## 3 23 13 12 67 5 28
## 4 23 14 9.2 71 9 22
## 5 8 19 20.1 61 5 9
## 6 14 20 16.6 63 9 25
## 7 9 24 13.8 81 8 2
## 8 9 24 10.9 71 9 14
## 9 4 25 9.7 61 5 23
## 10 13 27 10.3 76 9 18
## # … with 143 more rows
To arrange in descending order, simply add desc inside
the arrange function.
air %>%
arrange(desc(Solar.R))
## # A tibble: 153 × 6
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 14 334 11.5 64 5 16
## 2 NA 332 13.8 80 6 14
## 3 39 323 11.5 87 6 10
## 4 30 322 11.5 68 5 19
## 5 NA 322 11.5 79 6 15
## 6 11 320 16.6 73 5 22
## 7 40 314 10.9 83 7 6
## 8 18 313 11.5 62 5 4
## 9 34 307 12 66 5 17
## 10 23 299 8.6 65 5 7
## # … with 143 more rows
Let’s combine all of what we know now. We would like to select solar radiation, ozone, and wind. Then we want to filter for solar radiation that is below 200. Finally we want to arrange the wind variable in descending order. This can be achieved with the following sequence of code.
air %>%
select(Ozone, Solar.R, Wind) %>%
filter(Solar.R < 200) %>%
arrange(desc(Wind))
## # A tibble: 71 × 3
## Ozone Solar.R Wind
## <int> <int> <dbl>
## 1 8 19 20.1
## 2 6 78 18.4
## 3 NA 66 16.6
## 4 14 20 16.6
## 5 32 92 15.5
## 6 21 191 14.9
## 7 NA 31 14.9
## 8 27 175 14.9
## 9 7 48 14.3
## 10 9 36 14.3
## # … with 61 more rows
We are starting to get more sophisticated with our data wrangling.
Let’s continue to learn some more useful functions to add to our
tidyverse vocabulary.
So far we haven’t done anything magical in terms of transforming our
data, but now we can start doing so. The first step to this is learning
group_by and summarise. The
group_by function takes data and groups it into categories.
The summarise function then applies transformations to
these groups. For example, if we wanted to know what each month’s mean
temperature is, we can achieve this with these two functions.
Let’s go ahead and do this. To group by month, we will use
group_by(Month). Then we essentially create a new variable
of our choosing in summarise by first assigning a name.
Then we use the = operator after the name, and then finally
the transformation we want after the = operator. Here we
want the mean of the temperature for each month, so we just use
mean(Temp), which will be our new variable called
Mean_Temp.
air %>%
group_by(Month) %>%
summarise(Mean_Temp = mean(Temp))
## # A tibble: 5 × 2
## Month Mean_Temp
## <int> <dbl>
## 1 5 65.5
## 2 6 79.1
## 3 7 83.9
## 4 8 84.0
## 5 9 76.9
The summarise function does something important here: it
collapses the data frame into just the variables we have created. This
makes it a very effective tool for really concise summaries of large
datasets. Here we just have what we need. The mean temperatures for each
month are listed from May to September. If we want to arrange these
means by the hottest temperature, we can go ahead and do so with
arrange. Remember to use our new variable name.
air %>%
group_by(Month) %>%
summarise(Mean_Temp = mean(Temp)) %>%
arrange(desc(Mean_Temp))
## # A tibble: 5 × 2
## Month Mean_Temp
## <int> <dbl>
## 1 8 84.0
## 2 7 83.9
## 3 6 79.1
## 4 9 76.9
## 5 5 65.5
We can see that August is the hottest month on average and May is the coolest on average. What if we want multiple descriptives for our data? Remember that we learned quite a few of these functions in the last tutorial. Let’s go ahead and add some here.
air %>%
group_by(Month) %>%
summarise(Mean_Temp = mean(Temp),
SD_Temp = sd(Temp),
IQR_Temp = IQR(Temp))
## # A tibble: 5 × 4
## Month Mean_Temp SD_Temp IQR_Temp
## <int> <dbl> <dbl> <dbl>
## 1 5 65.5 6.85 9
## 2 6 79.1 6.60 6.75
## 3 7 83.9 4.32 4.5
## 4 8 84.0 6.59 9.5
## 5 9 76.9 8.36 10
We can start adding a bit more complexity. Let’s select the month and wind speeds for each month. Then we will filter for just wind values above 10 mph. Then we will group my month and get the variance of wind values for each month. We will then arrange the variance in descending order. Keep in mind that the order of operations matters here. Make sure you use the proper order or R will bring back the wrong data.
air %>%
select(Month, Wind) %>%
group_by(Month) %>%
summarise(Var_Wind = var(Wind)) %>%
arrange(desc(Var_Wind))
## # A tibble: 5 × 2
## Month Var_Wind
## <int> <dbl>
## 1 6 14.2
## 2 5 12.5
## 3 9 12.0
## 4 8 10.4
## 5 7 9.22
We can see that the variance for wind isn’t too extreme between months, but we can at least visually inspect it here.
Now that we have learned summarise, we can use something
conceptually similar called mutate. This function creates
variables like summarise, but does so without collapsing
the data frame. This means mutate has two functions. First,
you can simply create new variables with mutate without
grouping. Second, you can make grouped transformations of the data and
add them to your dataset without having to eliminate all the other
variables in your data in the process. This may not make a ton of sense
just yet, but we will make some mutate transformations to
see why this is helpful. First, we can make a very simple transformation
by just creating a new variable. Let’s convert the temperature variable
from Fahrenheit to Celcius. We can apply the following formula:
\[ \text{Celcius} = (\text{Degrees Fahrenheit} - 32) \times 5/9 \] To achieve this, we will use the following code:
air %>%
mutate(Celcius = (Temp-32)*(5/9))
## # A tibble: 153 × 7
## Ozone Solar.R Wind Temp Month Day Celcius
## <int> <int> <dbl> <int> <int> <int> <dbl>
## 1 41 190 7.4 67 5 1 19.4
## 2 36 118 8 72 5 2 22.2
## 3 12 149 12.6 74 5 3 23.3
## 4 18 313 11.5 62 5 4 16.7
## 5 NA NA 14.3 56 5 5 13.3
## 6 28 NA 14.9 66 5 6 18.9
## 7 23 299 8.6 65 5 7 18.3
## 8 19 99 13.8 59 5 8 15
## 9 8 19 20.1 61 5 9 16.1
## 10 NA 194 8.6 69 5 10 20.6
## # … with 143 more rows
You can see that it has maintained the data frame while creating this
variable. Now perhaps we are interested in the mean differences in
monthly temperature. We can just add the group_by variable
and create a new mutation that takes the mean of our new variable. We
will name it Mean_Celcius.
air %>%
mutate(Celcius = (Temp-32)*(5/9)) %>%
group_by(Month) %>%
mutate(Mean_Celcius = mean(Celcius))
## # A tibble: 153 × 8
## # Groups: Month [5]
## Ozone Solar.R Wind Temp Month Day Celcius Mean_Celcius
## <int> <int> <dbl> <int> <int> <int> <dbl> <dbl>
## 1 41 190 7.4 67 5 1 19.4 18.6
## 2 36 118 8 72 5 2 22.2 18.6
## 3 12 149 12.6 74 5 3 23.3 18.6
## 4 18 313 11.5 62 5 4 16.7 18.6
## 5 NA NA 14.3 56 5 5 13.3 18.6
## 6 28 NA 14.9 66 5 6 18.9 18.6
## 7 23 299 8.6 65 5 7 18.3 18.6
## 8 19 99 13.8 59 5 8 15 18.6
## 9 8 19 20.1 61 5 9 16.1 18.6
## 10 NA 194 8.6 69 5 10 20.6 18.6
## # … with 143 more rows
You can see now that we have effectively created a column-wise mean
for each month. However, if you want to make further transformations
with mutate but don’t want to continue using the same
grouping, you should use ungroup before doing another
mutation. For example, perhaps I want to calculate wind-adjusted
temperature for each day, but I don’t want to calculate it by month. I
can do so with this manipulation:
air %>%
mutate(Celcius = (Temp-32)*(5/9)) %>%
group_by(Month) %>%
mutate(Mean_Celcius = mean(Celcius)) %>%
ungroup() %>%
mutate(Adj_Temp = Temp/Wind)
## # A tibble: 153 × 9
## Ozone Solar.R Wind Temp Month Day Celcius Mean_Celcius Adj_Temp
## <int> <int> <dbl> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 41 190 7.4 67 5 1 19.4 18.6 9.05
## 2 36 118 8 72 5 2 22.2 18.6 9
## 3 12 149 12.6 74 5 3 23.3 18.6 5.87
## 4 18 313 11.5 62 5 4 16.7 18.6 5.39
## 5 NA NA 14.3 56 5 5 13.3 18.6 3.92
## 6 28 NA 14.9 66 5 6 18.9 18.6 4.43
## 7 23 299 8.6 65 5 7 18.3 18.6 7.56
## 8 19 99 13.8 59 5 8 15 18.6 4.28
## 9 8 19 20.1 61 5 9 16.1 18.6 3.03
## 10 NA 194 8.6 69 5 10 20.6 18.6 8.02
## # … with 143 more rows
One final easy transformation function is rename. This
just allows you to quickly rename variables.
air %>%
rename(Solar_Radiation = Solar.R)
## # A tibble: 153 × 6
## Ozone Solar_Radiation Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 41 190 7.4 67 5 1
## 2 36 118 8 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
## 9 8 19 20.1 61 5 9
## 10 NA 194 8.6 69 5 10
## # … with 143 more rows
This can be pretty useful if you have funky names for your variables (which is very often the case if you are getting it from somebody else).
We have done a lot already, but I wanted to highlight two very
important functions before concluding. Oftentimes data will come in
either wide or long format. It is
generally advised to make your data in long format, which means that
every variable should be in its own column, and
every observation should be in rows. Consider for
example the following data from the datarium package, which
records subject ID’s, the treatment group for a depression study, and
four time points.
library(datarium)
depression
## # A tibble: 24 × 6
## id treatment t0 t1 t2 t3
## <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 ctr 296 175 187 242
## 2 2 ctr 376 329 236 126
## 3 3 ctr 309 238 150 173
## 4 4 ctr 222 60 82 135
## 5 5 ctr 150 271 250 266
## 6 6 ctr 316 291 238 194
## 7 7 ctr 321 364 270 358
## 8 8 ctr 447 402 294 266
## 9 9 ctr 220 70 95 137
## 10 10 ctr 375 335 334 129
## # … with 14 more rows
This can sometimes be problematic because we want to select a single
factor of “time” when doing data manipulation. It may also unnecessarily
clutter up the data frame by forcing us to scroll through several
columns that are unnecessary. We can fix this with the
pivot_longer function. This involves multiple steps, but
generally goes like this:
3:6 to
select the t variables).A direct example is more illustrative of how this works. To use this
for later, we will save this as long_depression.
long_depression <- depression %>%
pivot_longer(cols = 3:6,
names_to = "time",
values_to = "score")
long_depression
## # A tibble: 96 × 4
## id treatment time score
## <fct> <fct> <chr> <dbl>
## 1 1 ctr t0 296
## 2 1 ctr t1 175
## 3 1 ctr t2 187
## 4 1 ctr t3 242
## 5 2 ctr t0 376
## 6 2 ctr t1 329
## 7 2 ctr t2 236
## 8 2 ctr t3 126
## 9 3 ctr t0 309
## 10 3 ctr t1 238
## # … with 86 more rows
Now all of our times are neatly in one column and we can use this
directly if we want to manipulate or plot this data. However, you may
sometimes need to convert data into wide format. The equivalent
pivot_wider helps you accomplish this goal. This time, we
use the names_from argument to pull column names from the
time variable and the values_from argument to
fill the score values into these columns.
long_depression %>%
pivot_wider(names_from = time,
values_from = score)
## # A tibble: 24 × 6
## id treatment t0 t1 t2 t3
## <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 ctr 296 175 187 242
## 2 2 ctr 376 329 236 126
## 3 3 ctr 309 238 150 173
## 4 4 ctr 222 60 82 135
## 5 5 ctr 150 271 250 266
## 6 6 ctr 316 291 238 194
## 7 7 ctr 321 364 270 358
## 8 8 ctr 447 402 294 266
## 9 9 ctr 220 70 95 137
## 10 10 ctr 375 335 334 129
## # … with 14 more rows
Generally speaking, its much more useful to have data in long format,
so invest some time into learning pivot_longer when you
can.
You now have some mastery of the most essential functions of
dplyr. If you want to learn more about the
tidyverse, there is no better resource than the wonderful
book R for Data Science, written by the author of the
tidyverse Hadley Wickham. I also recommend reading his
article Tidy Data if
you want to understand why long format data is important. I will be
following this tutorial with another instructional article on the
essential package from the tidyverse called
ggplot2, which provides unparalleled data visualization and
fancy plots for statisticians to use.
Thank you for reading this tutorial. If you felt that any part of this tutorial was helpful, and you would like to support me, please consider buying me a cup of coffee.
Check out my other RPubs if you want to learn more about stats in R.
More content is on the way, which I will continue to try to update each week or two when possible. Keep an eye out for it by checking my LinkedIn and other relevant sites.