Introduction
Oftentimes, I have to convert data from the wide format to the long
format in order to perform a longitudinal analysis (Figure 1). In the
wide format, longitudinal data can be labelled from x_01
to
x_12
as depicted in Figure 1. When this is transformed from
the wide to the long data format, the number sequence gets its own
column called time
where the values for x_01
to x_12
are
aligned. See Figure 1 for an example of a wide data format transforming
to a long data format.
Figure 1. Wide to Long data.
R
has many functions that will allow you to do this
transformation. However, I find that that the
pivot_longer()
function from the tidyverse
package is a useful tool to perform this transformation.
In this tutorial, I’ll review a couple of examples of using the
pivot_longer()
function to transform data from the wide to
the long format.
Motivating example
We will use data that is available on my GitHub site.
Step 1: Load data
We can load the data using the following command:
### Load data
url_file1 <- "https://raw.githubusercontent.com/mbounthavong/R-tutorials/refs/heads/main/Data/wide_data1.csv"
data.wide1 <- read.csv(url_file1, header = TRUE)
Once the data is loaded, we can visualize its contents.
### Visualize data
str(data.wide1) # Structure of data
## 'data.frame': 10 obs. of 28 variables:
## $ patientid: int 1 2 3 4 5 6 7 8 9 10
## $ age : int 45 66 73 32 19 21 35 31 22 34
## $ sex : int 1 1 1 0 0 1 0 0 1 1
## $ race : int 1 2 3 3 2 1 1 1 1 2
## $ x_01 : int 38 54 45 28 97 16 7 80 71 15
## $ x_02 : int 39 72 11 98 54 6 40 67 82 52
## $ x_03 : int 54 51 45 66 54 66 59 85 27 20
## $ x_04 : int 31 23 8 15 98 19 39 25 44 36
## $ x_05 : int 86 9 67 7 28 16 34 90 20 49
## $ x_06 : int 6 3 44 24 17 58 37 24 53 29
## $ x_07 : int 31 61 73 45 56 80 66 73 37 51
## $ x_08 : int 50 41 50 80 47 32 13 63 12 33
## $ x_09 : int 27 98 68 48 79 57 41 82 80 70
## $ x_10 : int 84 47 91 7 45 66 14 42 25 84
## $ x_11 : int 11 6 20 8 60 76 33 23 90 72
## $ x_12 : int 67 87 63 36 88 76 13 6 26 92
## $ t_01 : int 86 54 60 52 58 36 46 81 87 98
## $ t_02 : int 66 14 6 42 40 55 25 73 2 68
## $ t_03 : int 8 62 45 19 72 90 76 45 94 23
## $ t_04 : int 99 65 25 28 51 41 54 39 8 77
## $ t_05 : int 56 95 46 62 7 42 24 20 86 33
## $ t_06 : int 64 31 15 38 11 18 15 14 58 41
## $ t_07 : int 18 38 11 29 6 17 74 23 44 6
## $ t_08 : int 3 61 51 33 76 87 4 23 86 11
## $ t_09 : int 7 49 63 67 59 11 47 85 92 37
## $ t_10 : int 69 16 20 6 2 15 21 6 94 34
## $ t_11 : int 6 36 31 67 74 42 12 93 30 87
## $ t_12 : int 72 95 22 63 46 49 23 50 55 51
head(data.wide1) # Visualize the first six rows
## patientid age sex race x_01 x_02 x_03 x_04 x_05 x_06 x_07 x_08 x_09 x_10 x_11
## 1 1 45 1 1 38 39 54 31 86 6 31 50 27 84 11
## 2 2 66 1 2 54 72 51 23 9 3 61 41 98 47 6
## 3 3 73 1 3 45 11 45 8 67 44 73 50 68 91 20
## 4 4 32 0 3 28 98 66 15 7 24 45 80 48 7 8
## 5 5 19 0 2 97 54 54 98 28 17 56 47 79 45 60
## 6 6 21 1 1 16 6 66 19 16 58 80 32 57 66 76
## x_12 t_01 t_02 t_03 t_04 t_05 t_06 t_07 t_08 t_09 t_10 t_11 t_12
## 1 67 86 66 8 99 56 64 18 3 7 69 6 72
## 2 87 54 14 62 65 95 31 38 61 49 16 36 95
## 3 63 60 6 45 25 46 15 11 51 63 20 31 22
## 4 36 52 42 19 28 62 38 29 33 67 6 67 63
## 5 88 58 40 72 51 7 11 6 76 59 2 74 46
## 6 76 36 55 90 41 42 18 17 87 11 15 42 49
Notice the convention used to name the columns. The number sequence
is separated from the variable prefix by _
. Additionally,
each column name has two digits (e.g., 01
, 02
,
…, 12
). The naming convention of the column name will be
exploited to transform the data from the wide to the long format. For
instance, we can separate the prefix and number using the
name_sep
argument of the pivot_longer()
function.
But first, we need to locate the position of the variables we want to transform on the dataframe.
Step 2: Locate position of variables on the dataframe
Since R
treats the dataframe like a matrix, we want to
locate the position of the variables that we want to transform from the
wide to the long format. There are two sequences of variables that we
want:
We want to identify where x_0
and t_12
are
located on the dataframe. This will allow us to input the range of
variables that will undergo the transformation from wide to long.
### Identify location of variable in the dataframe
grep("x_01", names(data.wide1))
## [1] 5
grep("t_12", names(data.wide1))
## [1] 28
Using the grep()
function we identified the starting and
ending positions of x_01
and t_12
,
respectively, as 5 and 28. We will use this to inform the
pivot_long()
function which columns to transform from wide
to long format.
Step 3: Transform from wide to long using
pivot_longer()
Now we can use the pivot_longer()
function to transform
the data from wide to long. First, we need to load the
tidyverse
package. Then, we enter the necessary options
into the pivot_longer()
function such as
names_to =
and names_sep =
.
The cols = 5:28
option allows us to determine which
columns will be transformed from wide to long format. The remaining
columns will remain constant. The names_to =
option will
use the digits on the right of the argument
names_to = c(".value", "time")
as the value. For example,
x_02
will be transformed to time = 02
. The
names_sep = c("\\_")
separates the digit by using the
_
in the x_02
column name.
Putting this altogether, we can write out the entire
pivot_longer()
function as:
### Load library
library("tidyverse")
### Reshape from wide to long
data.long1 <- data.wide1 %>%
pivot_longer(cols = 5:28,
names_to = c(".value", "time"),
names_sep = c("\\_"))
The data.long1
object contains the transformed data.
Afterwards, you will notice that time
is a character or
string variable. We want to change this to a numeric.
### Convert time to numeric
data.long1$time <- as.numeric(data.long1$time)
Step 4: Verify the transformation is successful
We can visualize the data using the head()
function.
### Convert time to numeric
head(data.long1)
## # A tibble: 6 × 7
## patientid age sex race time x t
## <int> <int> <int> <int> <dbl> <int> <int>
## 1 1 45 1 1 1 38 86
## 2 1 45 1 1 2 39 66
## 3 1 45 1 1 3 54 8
## 4 1 45 1 1 4 31 99
## 5 1 45 1 1 5 86 56
## 6 1 45 1 1 6 6 64
Other column naming conventions
In this example, we used the column naming convention
x_01
where the digits are separated from the prefix by
_
. But what is we had a different column naming convention
such that there is not _
separating the digit from the
prefix?
### Load data
url_file2 <- "https://raw.githubusercontent.com/mbounthavong/R-tutorials/refs/heads/main/Data/wide_data2.csv"
data.wide2 <- read.csv(url_file2, header = TRUE)
### Visualize data
str(data.wide2) # Structure of data
## 'data.frame': 10 obs. of 28 variables:
## $ patientid: int 1 2 3 4 5 6 7 8 9 10
## $ age : int 45 66 73 32 19 21 35 31 22 34
## $ sex : int 1 1 1 0 0 1 0 0 1 1
## $ race : int 1 2 3 3 2 1 1 1 1 2
## $ x1 : int 38 54 45 28 97 16 7 80 71 15
## $ x2 : int 39 72 11 98 54 6 40 67 82 52
## $ x3 : int 54 51 45 66 54 66 59 85 27 20
## $ x4 : int 31 23 8 15 98 19 39 25 44 36
## $ x5 : int 86 9 67 7 28 16 34 90 20 49
## $ x6 : int 6 3 44 24 17 58 37 24 53 29
## $ x7 : int 31 61 73 45 56 80 66 73 37 51
## $ x8 : int 50 41 50 80 47 32 13 63 12 33
## $ x9 : int 27 98 68 48 79 57 41 82 80 70
## $ x10 : int 84 47 91 7 45 66 14 42 25 84
## $ x11 : int 11 6 20 8 60 76 33 23 90 72
## $ x12 : int 67 87 63 36 88 76 13 6 26 92
## $ t1 : int 86 54 60 52 58 36 46 81 87 98
## $ t2 : int 66 14 6 42 40 55 25 73 2 68
## $ t3 : int 8 62 45 19 72 90 76 45 94 23
## $ t4 : int 99 65 25 28 51 41 54 39 8 77
## $ t5 : int 56 95 46 62 7 42 24 20 86 33
## $ t6 : int 64 31 15 38 11 18 15 14 58 41
## $ t7 : int 18 38 11 29 6 17 74 23 44 6
## $ t8 : int 3 61 51 33 76 87 4 23 86 11
## $ t9 : int 7 49 63 67 59 11 47 85 92 37
## $ t10 : int 69 16 20 6 2 15 21 6 94 34
## $ t11 : int 6 36 31 67 74 42 12 93 30 87
## $ t12 : int 72 95 22 63 46 49 23 50 55 51
head(data.wide2) # Visualize the first six rows
## patientid age sex race x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 t1 t2 t3 t4 t5
## 1 1 45 1 1 38 39 54 31 86 6 31 50 27 84 11 67 86 66 8 99 56
## 2 2 66 1 2 54 72 51 23 9 3 61 41 98 47 6 87 54 14 62 65 95
## 3 3 73 1 3 45 11 45 8 67 44 73 50 68 91 20 63 60 6 45 25 46
## 4 4 32 0 3 28 98 66 15 7 24 45 80 48 7 8 36 52 42 19 28 62
## 5 5 19 0 2 97 54 54 98 28 17 56 47 79 45 60 88 58 40 72 51 7
## 6 6 21 1 1 16 6 66 19 16 58 80 32 57 66 76 76 36 55 90 41 42
## t6 t7 t8 t9 t10 t11 t12
## 1 64 18 3 7 69 6 72
## 2 31 38 61 49 16 36 95
## 3 15 11 51 63 20 31 22
## 4 38 29 33 67 6 67 63
## 5 11 6 76 59 2 74 46
## 6 18 17 87 11 15 42 49
We can address this using the
names_pattern = "(.)(\\d+)"
option in
pivot_longer()
function. The (.)
denotes the
prefix, and (\\d+)
denotes the digits after the prefix.
For example, the column name x12
has the prefix of
x
and the digit of 12
. The d+
is
important because it will generate a digit regardless if there is a
single digit 1
or multiple digits 12
. If you
don’t include the +
after the d
, the
pivot_function()
will only generate values from
0
to 9
and not for 10
or
greater.
#### Reshape from wide to long
data.long2 <- data.wide2 %>%
pivot_longer(cols = 5:28,
names_to = c(".value", "time"),
names_pattern = "(.)(\\d+)")
### Convert time to numeric
data.long2$time <- as.numeric(data.long1$time)
Once you’ve completed this part, you can visualize the data:
str(data.long2)
## tibble [120 × 7] (S3: tbl_df/tbl/data.frame)
## $ patientid: int [1:120] 1 1 1 1 1 1 1 1 1 1 ...
## $ age : int [1:120] 45 45 45 45 45 45 45 45 45 45 ...
## $ sex : int [1:120] 1 1 1 1 1 1 1 1 1 1 ...
## $ race : int [1:120] 1 1 1 1 1 1 1 1 1 1 ...
## $ time : num [1:120] 1 2 3 4 5 6 7 8 9 10 ...
## $ x : int [1:120] 38 39 54 31 86 6 31 50 27 84 ...
## $ t : int [1:120] 86 66 8 99 56 64 18 3 7 69 ...
head(data.long2)
## # A tibble: 6 × 7
## patientid age sex race time x t
## <int> <int> <int> <int> <dbl> <int> <int>
## 1 1 45 1 1 1 38 86
## 2 1 45 1 1 2 39 66
## 3 1 45 1 1 3 54 8
## 4 1 45 1 1 4 31 99
## 5 1 45 1 1 5 86 56
## 6 1 45 1 1 6 6 64
Conclusions
Transforming data from the wide to long format is convenient when
using the pivot_longer()
function. One needs to be careful
with naming convention of the columns to simplify the process. I
recommend using the same naming convention when constructing your data.
It will make transforming from the wide to long format much more
convenient.
Next time, we will learn how to transform our data from the long to the wide format.
Acknowledgements
I used the R CRAN project page on Pivoting to help me learn how to transform data from wide to long.
I learned about advanced pivoting from an online reference on Data
Wrangling developed by Sara Altman, Bill Behrman, and Hadley
Wickham. They provide an excellent explanation of the
names_to =
and names_pattern =
options in
excellent detail in the Advance pivoting
chapter.
Disclaimers
This is for educational purposes only.
This is a work in progress and updates are expected in the future.