I have used R for more than 10 years now, and I really like the environment. I found that all of my analytic tasks in epidemiology and genetics could be done within R. IN recent years, I have ‘discovered’ a package called tidyverse written by Hadley Wickham (the inventor of ggplot2 and many others). This package is really useful for data manipulation prior to a formal analysis by a specific package.
As far as I know, tidyverse incorporates the following packages: ggplot2, dplyr, tidyr, readr, tibble, and margrittr. That means when we load tidyverse, all of those packages are also loaded. Nice and clean. The main tidyverse functions are as follows:
The great thing about tidyverse is that it allows users to do multiple tasks within a line of command. The tasks must be seprated by the sign %>% (called ‘pipe’) which was developed by Stefan Milton Bache. However, the pipe %>% can only be used with tidyverse functions; it cannot be used for basic R functions. In order to use other R functions, we need to use %$% which is just another pipe.
In this RMarkdown, I show you my experience with tidyverse by using various datasets for illustration. I do hope you like it.
library(tidyverse, quietly=T)
## Warning: package 'tidyverse' was built under R version 4.0.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.0.2
## Warning: package 'tibble' was built under R version 4.0.2
## Warning: package 'tidyr' was built under R version 4.0.2
## Warning: package 'readr' was built under R version 4.0.2
## Warning: package 'dplyr' was built under R version 4.0.2
## Warning: package 'stringr' was built under R version 4.0.2
## Warning: package 'forcats' was built under R version 4.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(gapminder, quietly=T) # to get gapminder dataset
library(readxl, quietly=T) # for reading excel files
library(MASS, quietly=T) # for interesting datasets
## Warning: package 'MASS' was built under R version 4.0.2
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(magrittr, quietly=T) # for interesting datasets
## Warning: package 'magrittr' was built under R version 4.0.2
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
Let us load 3 datasets: birthwt (within the MASS package), ChickWeight (within the MASS package) and gapminder (within the gapminder package).
The birthwt dataset has the following variables (they are all numeric): - low (0 = normal, 1=low birthweight) - age (age of mother) - lwt (mother’s weight in pounds) - race (1=white, 2=black, 3=other) - smoke (1=yes, 0=no) - ptl (number of previous premature labors) - ht (history of hypertension, 1=yes, 0=no) - ui (presence of uterine irritability, 0/1) - ftv (number of physician visits during the first trimester) - bwt (birthweight in grams)
The ChickWeight dataset has 4 variables:
The gapminder dataset has 6 variables as follows:
We will use tidyverse to work with the datasets. Note that I am going to rename the datasets as cw and gm.
bw = as.data.frame(birthwt)
head(bw)
## low age lwt race smoke ptl ht ui ftv bwt
## 85 0 19 182 2 0 0 0 1 0 2523
## 86 0 33 155 3 0 0 0 0 3 2551
## 87 0 20 105 1 1 0 0 0 1 2557
## 88 0 21 108 1 1 0 0 1 2 2594
## 89 0 18 107 1 1 0 0 1 0 2600
## 91 0 21 124 3 0 0 0 0 0 2622
cw = as.data.frame(ChickWeight)
head(cw)
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
gm = as.data.frame(gapminder)
head(gm)
## country continent year lifeExp pop gdpPercap
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
The way to recode data in R is very straightforward. In tidyverse we use the function recode as follows:
temp = cw %>% mutate(Group = recode(Diet, "1"="A", "2"="B", "3"="C", "4"="D"))
head(temp)
## weight Time Chick Diet Group
## 1 42 0 1 1 A
## 2 51 2 1 1 A
## 3 59 4 1 1 A
## 4 64 6 1 1 A
## 5 76 8 1 1 A
## 6 93 10 1 1 A
# for the birthwt data, we want to code the race and smoke variables into factor variables:
temp1 = bw %>% mutate(Race = recode(race, "1"="White", "2"="Black", "3"="Other"), Smoking=recode(smoke, "0"="No", "1"="Yes"))
head(temp1)
## low age lwt race smoke ptl ht ui ftv bwt Race Smoking
## 85 0 19 182 2 0 0 0 1 0 2523 Black No
## 86 0 33 155 3 0 0 0 0 3 2551 Other No
## 87 0 20 105 1 1 0 0 0 1 2557 White Yes
## 88 0 21 108 1 1 0 0 1 2 2594 White Yes
## 89 0 18 107 1 1 0 0 1 0 2600 White Yes
## 91 0 21 124 3 0 0 0 0 0 2622 Other No
Similarly, we can recode a numeric variable into a character variable. Let’s group weight into 4 groups with values Q1, Q2, Q3 and Q4 and call the new variable as “Wgroup”. The function is case_when:
temp = cw %>% mutate(Wgroup = case_when(
weight <63 ~ "Q1",
weight >= 63 & weight <103 ~ "Q2",
weight >= 103 & weight <163 ~ "Q3",
weight >= 163 ~ "Q4"))
head(temp)
## weight Time Chick Diet Wgroup
## 1 42 0 1 1 Q1
## 2 51 2 1 1 Q1
## 3 59 4 1 1 Q1
## 4 64 6 1 1 Q2
## 5 76 8 1 1 Q2
## 6 93 10 1 1 Q2
One can also create a new variable as a calculated one. In the example below we scale variable weight so that it has mean 0 and variance 1.
temp = cw %>% mutate(Zweight = scale(weight))
head(temp)
## weight Time Chick Diet Zweight
## 1 42 0 1 1 -1.1230637
## 2 51 2 1 1 -0.9964315
## 3 59 4 1 1 -0.8838695
## 4 64 6 1 1 -0.8135183
## 5 76 8 1 1 -0.6446753
## 6 93 10 1 1 -0.4054811
The function select allows us to select the variables of interest, whereas the function filter allows us to pick up the rows of interest. We can perform 2 tasks simultaneously in a command line. In the following example, we select 3 variables (low, age, lwt and bwt) for those with bwt < 2500. Note that I use dplyr::select to specifically tell R that the function select is from dplyr:
temp = bw %>% dplyr::select(low, age, lwt, bwt) %>% filter(bwt < 2500)
We can also use %in% to select specific rows we want:
temp = gapminder %>% filter(country %in% c("Vietnam", "Korea, Rep.", "Thailand")) %>% dplyr::select(country, year, lifeExp)
Sometimes, we want to remove missing data from a variable, and this can be done by drop_na function as follows:
# Generate 5 rows of missing values from the bw dataset
bw[1:5, 3] = NA
# remove the missing values
temp = bw %>% drop_na(lwt)
head(temp)
## low age lwt race smoke ptl ht ui ftv bwt
## 91 0 21 124 3 0 0 0 0 0 2622
## 92 0 22 118 1 0 0 0 0 1 2637
## 93 0 17 103 3 0 0 0 0 1 2637
## 94 0 29 123 1 1 0 0 0 1 2663
## 95 0 26 113 1 1 0 0 0 0 2665
## 96 0 19 95 3 0 0 0 0 0 2722
In tidyverse, we can sort the data by multiple variables, and this can be done via the function arrange as follows:
temp = bw %>% dplyr::select(age, bwt) %>% dplyr::arrange(age, bwt)
head(temp)
## age bwt
## 78 14 2466
## 81 14 2495
## 213 14 3941
## 57 15 2353
## 62 15 2381
## 102 15 2778
Random selecting 50 rows from bw
temp = sample_n(bw, 50)
Random selecting 40% rows from bw
temp = sample_frac(bw, 0.40)
Another useful function is summarise. This function allows us to summary multilevel data by group which can then be used for further analysis. Back to the chicken weight data, we want to calculate weight for each diet group:
# Summarising continuous variables
t = cw %>% group_by(Diet) %>% summarise(n = n(), mean = mean(weight), var=var(weight))
# or
t = cw %>% dplyr::group_by(Diet) %>% dplyr::summarise(n = n(), mean = mean(weight), var=var(weight))
t
## # A tibble: 4 x 4
## Diet n mean var
## <fct> <int> <dbl> <dbl>
## 1 1 220 103. 3210.
## 2 2 120 123. 5128.
## 3 3 120 143. 7489.
## 4 4 118 135. 4737.
# Count and percentage
cw %>% count(Diet)
## Diet n
## 1 1 220
## 2 2 120
## 3 3 120
## 4 4 118
cw %>% count(Diet) %>% mutate(percent = round(n/sum(n)*100, 1))
## Diet n percent
## 1 1 220 38.1
## 2 2 120 20.8
## 3 3 120 20.8
## 4 4 118 20.4
# Count nested within group
head(bw)
## low age lwt race smoke ptl ht ui ftv bwt
## 85 0 19 NA 2 0 0 0 1 0 2523
## 86 0 33 NA 3 0 0 0 0 3 2551
## 87 0 20 NA 1 1 0 0 0 1 2557
## 88 0 21 NA 1 1 0 0 1 2 2594
## 89 0 18 NA 1 1 0 0 1 0 2600
## 91 0 21 124 3 0 0 0 0 0 2622
bw %>% count(race, low) %>% group_by(race) %>% mutate(pct = n/sum(n)*100)
## # A tibble: 6 x 4
## # Groups: race [3]
## race low n pct
## <int> <int> <int> <dbl>
## 1 1 0 73 76.0
## 2 1 1 23 24.0
## 3 2 0 15 57.7
## 4 2 1 11 42.3
## 5 3 0 42 62.7
## 6 3 1 25 37.3
Now, we can also summarise data stratified by time:
t = cw %>% dplyr::group_by(Diet, Time) %>% dplyr::summarise(n = n(), mean = mean(weight), var=var(weight))
## `summarise()` has grouped output by 'Diet'. You can override using the `.groups` argument.
t
## # A tibble: 48 x 5
## # Groups: Diet [4]
## Diet Time n mean var
## <fct> <dbl> <int> <dbl> <dbl>
## 1 1 0 20 41.4 0.989
## 2 1 2 20 47.2 18.3
## 3 1 4 19 56.5 17.0
## 4 1 6 19 66.8 60.2
## 5 1 8 19 79.7 190.
## 6 1 10 19 93.1 508.
## 7 1 12 19 109. 1064.
## 8 1 14 18 123. 1398.
## 9 1 16 17 145. 2028.
## 10 1 18 17 159. 2423.
## # … with 38 more rows
As mentioned above, it is possible to do multiple tasks with the pipe %>%. Here is an example: select 3 countries from the gapminder dataset; summarise the life expectancy by country; then boxplot the data:
gapminder %>%
filter(country %in% c("Vietnam", "Korea, Rep.", "Thailand")) %>%
dplyr::group_by(country) %>%
dplyr::summarise(meanLE = round(mean(lifeExp), 1)) %>%
ggplot(aes(x=country, y=meanLE, fill=country, label=meanLE)) + geom_bar(stat="identity") + geom_text(size=3, col="white", position=position_stack(vjust=0.9)) + labs(x="Country", y="Life expectancy")
However, this does not work for base R functions. For base functions we need to use a different pipe: %$%. In the example below we calculate the correlation between lwt and bwt for those with bwt < 2500:
bw %$% subset(bw, bwt<2500) %$% cor(lwt, bwt, use = "pairwise.complete.obs")
## [1] -0.1061342