Introduction

I have used R for more than 10 years now, and I really like the environment. I found that all of my analytic tasks in epidemiology and genetics could be done within R. IN recent years, I have ‘discovered’ a package called tidyverse written by Hadley Wickham (the inventor of ggplot2 and many others). This package is really useful for data manipulation prior to a formal analysis by a specific package.

As far as I know, tidyverse incorporates the following packages: ggplot2, dplyr, tidyr, readr, tibble, and margrittr. That means when we load tidyverse, all of those packages are also loaded. Nice and clean. The main tidyverse functions are as follows:

mutate() for adding new variables in a dataframe
recode() for recoding data
select() for selecting variables
filter() for selecting rows
summarise() for calculating summary stat
arrange() for sorting rows of data
sample() and sample_n() for sampling data from a dataset

The great thing about tidyverse is that it allows users to do multiple tasks within a line of command. The tasks must be seprated by the sign %>% (called ‘pipe’) which was developed by Stefan Milton Bache. However, the pipe %>% can only be used with tidyverse functions; it cannot be used for basic R functions. In order to use other R functions, we need to use %$% which is just another pipe.

In this RMarkdown, I show you my experience with tidyverse by using various datasets for illustration. I do hope you like it.

Loading library

library(tidyverse, quietly=T)

## Warning: package 'tidyverse' was built under R version 4.0.2

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## Warning: package 'ggplot2' was built under R version 4.0.2

## Warning: package 'tibble' was built under R version 4.0.2

## Warning: package 'tidyr' was built under R version 4.0.2

## Warning: package 'readr' was built under R version 4.0.2

## Warning: package 'dplyr' was built under R version 4.0.2

## Warning: package 'stringr' was built under R version 4.0.2

## Warning: package 'forcats' was built under R version 4.0.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(gapminder, quietly=T) # to get gapminder dataset
library(readxl, quietly=T) # for reading excel files
library(MASS, quietly=T) # for interesting datasets

## Warning: package 'MASS' was built under R version 4.0.2

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(magrittr, quietly=T) # for interesting datasets

## Warning: package 'magrittr' was built under R version 4.0.2

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

Loading data

Let us load 3 datasets: birthwt (within the MASS package), ChickWeight (within the MASS package) and gapminder (within the gapminder package).

The birthwt dataset has the following variables (they are all numeric): - low (0 = normal, 1=low birthweight) - age (age of mother) - lwt (mother’s weight in pounds) - race (1=white, 2=black, 3=other) - smoke (1=yes, 0=no) - ptl (number of previous premature labors) - ht (history of hypertension, 1=yes, 0=no) - ui (presence of uterine irritability, 0/1) - ftv (number of physician visits during the first trimester) - bwt (birthweight in grams)

The ChickWeight dataset has 4 variables:

weight (numeric)
Time (numeric)
Chick (factor)
Diet (factor)

The gapminder dataset has 6 variables as follows:

country (character variable)
continent (character variable)
year (numeric)
lifeExp (numeric)
pop (numeric)
gdpPercap (numeric)

We will use tidyverse to work with the datasets. Note that I am going to rename the datasets as cw and gm.

bw = as.data.frame(birthwt)
head(bw)

##    low age lwt race smoke ptl ht ui ftv  bwt
## 85   0  19 182    2     0   0  0  1   0 2523
## 86   0  33 155    3     0   0  0  0   3 2551
## 87   0  20 105    1     1   0  0  0   1 2557
## 88   0  21 108    1     1   0  0  1   2 2594
## 89   0  18 107    1     1   0  0  1   0 2600
## 91   0  21 124    3     0   0  0  0   0 2622

cw = as.data.frame(ChickWeight)
head(cw)

##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1

gm = as.data.frame(gapminder)
head(gm)

##       country continent year lifeExp      pop gdpPercap
## 1 Afghanistan      Asia 1952  28.801  8425333  779.4453
## 2 Afghanistan      Asia 1957  30.332  9240934  820.8530
## 3 Afghanistan      Asia 1962  31.997 10267083  853.1007
## 4 Afghanistan      Asia 1967  34.020 11537966  836.1971
## 5 Afghanistan      Asia 1972  36.088 13079460  739.9811
## 6 Afghanistan      Asia 1977  38.438 14880372  786.1134

Recoding data with recode

The way to recode data in R is very straightforward. In tidyverse we use the function recode as follows:

temp = cw %>% mutate(Group = recode(Diet, "1"="A", "2"="B", "3"="C", "4"="D"))
head(temp)

##   weight Time Chick Diet Group
## 1     42    0     1    1     A
## 2     51    2     1    1     A
## 3     59    4     1    1     A
## 4     64    6     1    1     A
## 5     76    8     1    1     A
## 6     93   10     1    1     A

# for the birthwt data, we want to code the race and smoke variables into factor variables:

temp1 = bw %>% mutate(Race = recode(race, "1"="White", "2"="Black", "3"="Other"), Smoking=recode(smoke, "0"="No", "1"="Yes"))
head(temp1)

##    low age lwt race smoke ptl ht ui ftv  bwt  Race Smoking
## 85   0  19 182    2     0   0  0  1   0 2523 Black      No
## 86   0  33 155    3     0   0  0  0   3 2551 Other      No
## 87   0  20 105    1     1   0  0  0   1 2557 White     Yes
## 88   0  21 108    1     1   0  0  1   2 2594 White     Yes
## 89   0  18 107    1     1   0  0  1   0 2600 White     Yes
## 91   0  21 124    3     0   0  0  0   0 2622 Other      No

Creating a new variable with mutate

Similarly, we can recode a numeric variable into a character variable. Let’s group weight into 4 groups with values Q1, Q2, Q3 and Q4 and call the new variable as “Wgroup”. The function is case_when:

temp = cw %>% mutate(Wgroup = case_when(
  weight <63 ~ "Q1",
  weight >= 63 & weight <103 ~ "Q2",
  weight >= 103 & weight <163 ~ "Q3",
  weight >= 163 ~ "Q4"))
head(temp)

##   weight Time Chick Diet Wgroup
## 1     42    0     1    1     Q1
## 2     51    2     1    1     Q1
## 3     59    4     1    1     Q1
## 4     64    6     1    1     Q2
## 5     76    8     1    1     Q2
## 6     93   10     1    1     Q2

One can also create a new variable as a calculated one. In the example below we scale variable weight so that it has mean 0 and variance 1.

temp = cw %>% mutate(Zweight = scale(weight))
head(temp)

##   weight Time Chick Diet    Zweight
## 1     42    0     1    1 -1.1230637
## 2     51    2     1    1 -0.9964315
## 3     59    4     1    1 -0.8838695
## 4     64    6     1    1 -0.8135183
## 5     76    8     1    1 -0.6446753
## 6     93   10     1    1 -0.4054811

Selecting variables and rows of data

The function select allows us to select the variables of interest, whereas the function filter allows us to pick up the rows of interest. We can perform 2 tasks simultaneously in a command line. In the following example, we select 3 variables (low, age, lwt and bwt) for those with bwt < 2500. Note that I use dplyr::select to specifically tell R that the function select is from dplyr:

temp = bw %>% dplyr::select(low, age, lwt, bwt) %>% filter(bwt < 2500)

We can also use %in% to select specific rows we want:

temp = gapminder %>% filter(country %in% c("Vietnam", "Korea, Rep.", "Thailand")) %>% dplyr::select(country, year, lifeExp)

Removing missing values

Sometimes, we want to remove missing data from a variable, and this can be done by drop_na function as follows:

# Generate 5 rows of missing values from the bw dataset 
bw[1:5, 3] = NA
# remove the missing values
temp = bw %>% drop_na(lwt)
head(temp)

##    low age lwt race smoke ptl ht ui ftv  bwt
## 91   0  21 124    3     0   0  0  0   0 2622
## 92   0  22 118    1     0   0  0  0   1 2637
## 93   0  17 103    3     0   0  0  0   1 2637
## 94   0  29 123    1     1   0  0  0   1 2663
## 95   0  26 113    1     1   0  0  0   0 2665
## 96   0  19  95    3     0   0  0  0   0 2722

Sorting data

In tidyverse, we can sort the data by multiple variables, and this can be done via the function arrange as follows:

temp = bw %>% dplyr::select(age, bwt) %>% dplyr::arrange(age, bwt)
head(temp)

##     age  bwt
## 78   14 2466
## 81   14 2495
## 213  14 3941
## 57   15 2353
## 62   15 2381
## 102  15 2778

Sampling data

Random selecting 50 rows from bw

temp = sample_n(bw, 50)

Random selecting 40% rows from bw

temp = sample_frac(bw, 0.40)

Summary by group

Another useful function is summarise. This function allows us to summary multilevel data by group which can then be used for further analysis. Back to the chicken weight data, we want to calculate weight for each diet group:

# Summarising continuous variables 
t = cw %>% group_by(Diet) %>% summarise(n = n(), mean = mean(weight), var=var(weight))
# or 
t = cw %>% dplyr::group_by(Diet) %>% dplyr::summarise(n = n(), mean = mean(weight), var=var(weight))
t

## # A tibble: 4 x 4
##   Diet      n  mean   var
##   <fct> <int> <dbl> <dbl>
## 1 1       220  103. 3210.
## 2 2       120  123. 5128.
## 3 3       120  143. 7489.
## 4 4       118  135. 4737.

# Count and percentage
cw %>% count(Diet)

##   Diet   n
## 1    1 220
## 2    2 120
## 3    3 120
## 4    4 118

cw %>% count(Diet) %>% mutate(percent = round(n/sum(n)*100, 1))

##   Diet   n percent
## 1    1 220    38.1
## 2    2 120    20.8
## 3    3 120    20.8
## 4    4 118    20.4

# Count nested within group 
head(bw)

##    low age lwt race smoke ptl ht ui ftv  bwt
## 85   0  19  NA    2     0   0  0  1   0 2523
## 86   0  33  NA    3     0   0  0  0   3 2551
## 87   0  20  NA    1     1   0  0  0   1 2557
## 88   0  21  NA    1     1   0  0  1   2 2594
## 89   0  18  NA    1     1   0  0  1   0 2600
## 91   0  21 124    3     0   0  0  0   0 2622

bw %>% count(race, low) %>% group_by(race) %>% mutate(pct = n/sum(n)*100)

## # A tibble: 6 x 4
## # Groups:   race [3]
##    race   low     n   pct
##   <int> <int> <int> <dbl>
## 1     1     0    73  76.0
## 2     1     1    23  24.0
## 3     2     0    15  57.7
## 4     2     1    11  42.3
## 5     3     0    42  62.7
## 6     3     1    25  37.3

Now, we can also summarise data stratified by time:

t = cw %>% dplyr::group_by(Diet, Time) %>% dplyr::summarise(n = n(), mean = mean(weight), var=var(weight))

## `summarise()` has grouped output by 'Diet'. You can override using the `.groups` argument.

## # A tibble: 48 x 5
## # Groups:   Diet [4]
##    Diet   Time     n  mean      var
##    <fct> <dbl> <int> <dbl>    <dbl>
##  1 1         0    20  41.4    0.989
##  2 1         2    20  47.2   18.3  
##  3 1         4    19  56.5   17.0  
##  4 1         6    19  66.8   60.2  
##  5 1         8    19  79.7  190.   
##  6 1        10    19  93.1  508.   
##  7 1        12    19 109.  1064.   
##  8 1        14    18 123.  1398.   
##  9 1        16    17 145.  2028.   
## 10 1        18    17 159.  2423.   
## # … with 38 more rows

Multiple tasks using pipe %>%

As mentioned above, it is possible to do multiple tasks with the pipe %>%. Here is an example: select 3 countries from the gapminder dataset; summarise the life expectancy by country; then boxplot the data:

gapminder %>% 
  filter(country %in% c("Vietnam", "Korea, Rep.", "Thailand"))  %>% 
  dplyr::group_by(country) %>% 
  dplyr::summarise(meanLE = round(mean(lifeExp), 1)) %>% 
  ggplot(aes(x=country, y=meanLE, fill=country, label=meanLE)) + geom_bar(stat="identity") + geom_text(size=3, col="white", position=position_stack(vjust=0.9)) + labs(x="Country", y="Life expectancy")

However, this does not work for base R functions. For base functions we need to use a different pipe: %$%. In the example below we calculate the correlation between lwt and bwt for those with bwt < 2500:

bw %$% subset(bw, bwt<2500) %$% cor(lwt, bwt, use = "pairwise.complete.obs")

## [1] -0.1061342

Data manipulation using ‘tidyverse’

T Nguyen

02/10/2021