## Introduction

I have used R for more than 10 years now, and I really like the environment. I found that all of my analytic tasks in epidemiology and genetics could be done within R. IN recent years, I have ‘discovered’ a package called tidyverse written by Hadley Wickham (the inventor of ggplot2 and many others). This package is really useful for data manipulation prior to a formal analysis by a specific package.

As far as I know, tidyverse incorporates the following packages: ggplot2, dplyr, tidyr, readr, tibble, and margrittr. That means when we load tidyverse, all of those packages are also loaded. Nice and clean. The main tidyverse functions are as follows:

• mutate() for adding new variables in a dataframe
• recode() for recoding data
• select() for selecting variables
• filter() for selecting rows
• summarise() for calculating summary stat
• arrange() for sorting rows of data
• sample() and sample_n() for sampling data from a dataset

The great thing about tidyverse is that it allows users to do multiple tasks within a line of command. The tasks must be seprated by the sign %>% (called ‘pipe’) which was developed by Stefan Milton Bache. However, the pipe %>% can only be used with tidyverse functions; it cannot be used for basic R functions. In order to use other R functions, we need to use %$% which is just another pipe. In this RMarkdown, I show you my experience with tidyverse by using various datasets for illustration. I do hope you like it. # Loading library library(tidyverse, quietly=T) ## Warning: package 'tidyverse' was built under R version 4.0.2 ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ── ## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4 ## ✓ tibble 3.1.2 ✓ dplyr 1.0.7 ## ✓ tidyr 1.1.3 ✓ stringr 1.4.0 ## ✓ readr 1.4.0 ✓ forcats 0.5.1 ## Warning: package 'ggplot2' was built under R version 4.0.2 ## Warning: package 'tibble' was built under R version 4.0.2 ## Warning: package 'tidyr' was built under R version 4.0.2 ## Warning: package 'readr' was built under R version 4.0.2 ## Warning: package 'dplyr' was built under R version 4.0.2 ## Warning: package 'stringr' was built under R version 4.0.2 ## Warning: package 'forcats' was built under R version 4.0.2 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() library(gapminder, quietly=T) # to get gapminder dataset library(readxl, quietly=T) # for reading excel files library(MASS, quietly=T) # for interesting datasets  ## Warning: package 'MASS' was built under R version 4.0.2 ## ## Attaching package: 'MASS' ## The following object is masked from 'package:dplyr': ## ## select library(magrittr, quietly=T) # for interesting datasets ## Warning: package 'magrittr' was built under R version 4.0.2 ## ## Attaching package: 'magrittr' ## The following object is masked from 'package:purrr': ## ## set_names ## The following object is masked from 'package:tidyr': ## ## extract ## Loading data Let us load 3 datasets: birthwt (within the MASS package), ChickWeight (within the MASS package) and gapminder (within the gapminder package). The birthwt dataset has the following variables (they are all numeric): - low (0 = normal, 1=low birthweight) - age (age of mother) - lwt (mother’s weight in pounds) - race (1=white, 2=black, 3=other) - smoke (1=yes, 0=no) - ptl (number of previous premature labors) - ht (history of hypertension, 1=yes, 0=no) - ui (presence of uterine irritability, 0/1) - ftv (number of physician visits during the first trimester) - bwt (birthweight in grams) The ChickWeight dataset has 4 variables: • weight (numeric) • Time (numeric) • Chick (factor) • Diet (factor) The gapminder dataset has 6 variables as follows: • country (character variable) • continent (character variable) • year (numeric) • lifeExp (numeric) • pop (numeric) • gdpPercap (numeric) We will use tidyverse to work with the datasets. Note that I am going to rename the datasets as cw and gm. bw = as.data.frame(birthwt) head(bw) ## low age lwt race smoke ptl ht ui ftv bwt ## 85 0 19 182 2 0 0 0 1 0 2523 ## 86 0 33 155 3 0 0 0 0 3 2551 ## 87 0 20 105 1 1 0 0 0 1 2557 ## 88 0 21 108 1 1 0 0 1 2 2594 ## 89 0 18 107 1 1 0 0 1 0 2600 ## 91 0 21 124 3 0 0 0 0 0 2622 cw = as.data.frame(ChickWeight) head(cw) ## weight Time Chick Diet ## 1 42 0 1 1 ## 2 51 2 1 1 ## 3 59 4 1 1 ## 4 64 6 1 1 ## 5 76 8 1 1 ## 6 93 10 1 1 gm = as.data.frame(gapminder) head(gm) ## country continent year lifeExp pop gdpPercap ## 1 Afghanistan Asia 1952 28.801 8425333 779.4453 ## 2 Afghanistan Asia 1957 30.332 9240934 820.8530 ## 3 Afghanistan Asia 1962 31.997 10267083 853.1007 ## 4 Afghanistan Asia 1967 34.020 11537966 836.1971 ## 5 Afghanistan Asia 1972 36.088 13079460 739.9811 ## 6 Afghanistan Asia 1977 38.438 14880372 786.1134 ## Recoding data with recode The way to recode data in R is very straightforward. In tidyverse we use the function recode as follows: temp = cw %>% mutate(Group = recode(Diet, "1"="A", "2"="B", "3"="C", "4"="D")) head(temp) ## weight Time Chick Diet Group ## 1 42 0 1 1 A ## 2 51 2 1 1 A ## 3 59 4 1 1 A ## 4 64 6 1 1 A ## 5 76 8 1 1 A ## 6 93 10 1 1 A # for the birthwt data, we want to code the race and smoke variables into factor variables: temp1 = bw %>% mutate(Race = recode(race, "1"="White", "2"="Black", "3"="Other"), Smoking=recode(smoke, "0"="No", "1"="Yes")) head(temp1) ## low age lwt race smoke ptl ht ui ftv bwt Race Smoking ## 85 0 19 182 2 0 0 0 1 0 2523 Black No ## 86 0 33 155 3 0 0 0 0 3 2551 Other No ## 87 0 20 105 1 1 0 0 0 1 2557 White Yes ## 88 0 21 108 1 1 0 0 1 2 2594 White Yes ## 89 0 18 107 1 1 0 0 1 0 2600 White Yes ## 91 0 21 124 3 0 0 0 0 0 2622 Other No ## Creating a new variable with mutate Similarly, we can recode a numeric variable into a character variable. Let’s group weight into 4 groups with values Q1, Q2, Q3 and Q4 and call the new variable as “Wgroup”. The function is case_when: temp = cw %>% mutate(Wgroup = case_when( weight <63 ~ "Q1", weight >= 63 & weight <103 ~ "Q2", weight >= 103 & weight <163 ~ "Q3", weight >= 163 ~ "Q4")) head(temp) ## weight Time Chick Diet Wgroup ## 1 42 0 1 1 Q1 ## 2 51 2 1 1 Q1 ## 3 59 4 1 1 Q1 ## 4 64 6 1 1 Q2 ## 5 76 8 1 1 Q2 ## 6 93 10 1 1 Q2 One can also create a new variable as a calculated one. In the example below we scale variable weight so that it has mean 0 and variance 1. temp = cw %>% mutate(Zweight = scale(weight)) head(temp) ## weight Time Chick Diet Zweight ## 1 42 0 1 1 -1.1230637 ## 2 51 2 1 1 -0.9964315 ## 3 59 4 1 1 -0.8838695 ## 4 64 6 1 1 -0.8135183 ## 5 76 8 1 1 -0.6446753 ## 6 93 10 1 1 -0.4054811 ## Selecting variables and rows of data The function select allows us to select the variables of interest, whereas the function filter allows us to pick up the rows of interest. We can perform 2 tasks simultaneously in a command line. In the following example, we select 3 variables (low, age, lwt and bwt) for those with bwt < 2500. Note that I use dplyr::select to specifically tell R that the function select is from dplyr: temp = bw %>% dplyr::select(low, age, lwt, bwt) %>% filter(bwt < 2500) We can also use %in% to select specific rows we want: temp = gapminder %>% filter(country %in% c("Vietnam", "Korea, Rep.", "Thailand")) %>% dplyr::select(country, year, lifeExp) ## Removing missing values Sometimes, we want to remove missing data from a variable, and this can be done by drop_na function as follows: # Generate 5 rows of missing values from the bw dataset bw[1:5, 3] = NA # remove the missing values temp = bw %>% drop_na(lwt) head(temp) ## low age lwt race smoke ptl ht ui ftv bwt ## 91 0 21 124 3 0 0 0 0 0 2622 ## 92 0 22 118 1 0 0 0 0 1 2637 ## 93 0 17 103 3 0 0 0 0 1 2637 ## 94 0 29 123 1 1 0 0 0 1 2663 ## 95 0 26 113 1 1 0 0 0 0 2665 ## 96 0 19 95 3 0 0 0 0 0 2722 ## Sorting data In tidyverse, we can sort the data by multiple variables, and this can be done via the function arrange as follows: temp = bw %>% dplyr::select(age, bwt) %>% dplyr::arrange(age, bwt) head(temp) ## age bwt ## 78 14 2466 ## 81 14 2495 ## 213 14 3941 ## 57 15 2353 ## 62 15 2381 ## 102 15 2778 ## Sampling data Random selecting 50 rows from bw temp = sample_n(bw, 50) Random selecting 40% rows from bw temp = sample_frac(bw, 0.40) ## Summary by group Another useful function is summarise. This function allows us to summary multilevel data by group which can then be used for further analysis. Back to the chicken weight data, we want to calculate weight for each diet group: # Summarising continuous variables t = cw %>% group_by(Diet) %>% summarise(n = n(), mean = mean(weight), var=var(weight)) # or t = cw %>% dplyr::group_by(Diet) %>% dplyr::summarise(n = n(), mean = mean(weight), var=var(weight)) t ## # A tibble: 4 x 4 ## Diet n mean var ## <fct> <int> <dbl> <dbl> ## 1 1 220 103. 3210. ## 2 2 120 123. 5128. ## 3 3 120 143. 7489. ## 4 4 118 135. 4737. # Count and percentage cw %>% count(Diet) ## Diet n ## 1 1 220 ## 2 2 120 ## 3 3 120 ## 4 4 118 cw %>% count(Diet) %>% mutate(percent = round(n/sum(n)*100, 1)) ## Diet n percent ## 1 1 220 38.1 ## 2 2 120 20.8 ## 3 3 120 20.8 ## 4 4 118 20.4 # Count nested within group head(bw) ## low age lwt race smoke ptl ht ui ftv bwt ## 85 0 19 NA 2 0 0 0 1 0 2523 ## 86 0 33 NA 3 0 0 0 0 3 2551 ## 87 0 20 NA 1 1 0 0 0 1 2557 ## 88 0 21 NA 1 1 0 0 1 2 2594 ## 89 0 18 NA 1 1 0 0 1 0 2600 ## 91 0 21 124 3 0 0 0 0 0 2622 bw %>% count(race, low) %>% group_by(race) %>% mutate(pct = n/sum(n)*100)  ## # A tibble: 6 x 4 ## # Groups: race [3] ## race low n pct ## <int> <int> <int> <dbl> ## 1 1 0 73 76.0 ## 2 1 1 23 24.0 ## 3 2 0 15 57.7 ## 4 2 1 11 42.3 ## 5 3 0 42 62.7 ## 6 3 1 25 37.3 Now, we can also summarise data stratified by time: t = cw %>% dplyr::group_by(Diet, Time) %>% dplyr::summarise(n = n(), mean = mean(weight), var=var(weight)) ## summarise() has grouped output by 'Diet'. You can override using the .groups argument. t ## # A tibble: 48 x 5 ## # Groups: Diet [4] ## Diet Time n mean var ## <fct> <dbl> <int> <dbl> <dbl> ## 1 1 0 20 41.4 0.989 ## 2 1 2 20 47.2 18.3 ## 3 1 4 19 56.5 17.0 ## 4 1 6 19 66.8 60.2 ## 5 1 8 19 79.7 190. ## 6 1 10 19 93.1 508. ## 7 1 12 19 109. 1064. ## 8 1 14 18 123. 1398. ## 9 1 16 17 145. 2028. ## 10 1 18 17 159. 2423. ## # … with 38 more rows # Multiple tasks using pipe %>% As mentioned above, it is possible to do multiple tasks with the pipe %>%. Here is an example: select 3 countries from the gapminder dataset; summarise the life expectancy by country; then boxplot the data: gapminder %>% filter(country %in% c("Vietnam", "Korea, Rep.", "Thailand")) %>% dplyr::group_by(country) %>% dplyr::summarise(meanLE = round(mean(lifeExp), 1)) %>% ggplot(aes(x=country, y=meanLE, fill=country, label=meanLE)) + geom_bar(stat="identity") + geom_text(size=3, col="white", position=position_stack(vjust=0.9)) + labs(x="Country", y="Life expectancy") However, this does not work for base R functions. For base functions we need to use a different pipe: %$%. In the example below we calculate the correlation between lwt and bwt for those with bwt < 2500:

bw %$% subset(bw, bwt<2500) %$% cor(lwt, bwt, use = "pairwise.complete.obs")  
## [1] -0.1061342