Introduction to the tidyverse

The tidyverse is a collection of packages by the creators of RStudio that share an approach to data science.

The tidyverse packages replace some of the base R functions with alternatives that are intended to be more user friendly for data scientists who are following this life cycle.

We will only be covering a few of the packages from the tidyverse.

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(magrittr)

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:tidyr':
## 
##     extract

birthweight <- read.csv("birthweight.csv")

Pipes: combining tidyverse functions

The tidyverse employs piping to send the output of one function to another function, rather than the nesting used in base r. The “pipe” is written with a greater than symbol sandwiched between two percent signs, like this: %>%.

birthweight %>%
  filter(low.birthweight == TRUE) %>%
  select(birth.date, length, birthweight, smoker)

##   birth.date length birthweight smoker
## 1  3/23/1967     50        2.51    yes
## 2   1/8/1968     47        2.66    yes
## 3   4/2/1968     48        2.37    yes
## 4  7/18/1968     46        2.05    yes
## 5  9/16/1968     48        1.92    yes
## 6  9/27/1968     43        2.65     no

# equivalent to:
birthweight[birthweight$low.birthweight == TRUE, c("birth.date", "length", "birthweight", "smoker")]

##    birth.date length birthweight smoker
## 6   3/23/1967     50        2.51    yes
## 22   1/8/1968     47        2.66    yes
## 28   4/2/1968     48        2.37    yes
## 32  7/18/1968     46        2.05    yes
## 37  9/16/1968     48        1.92    yes
## 38  9/27/1968     43        2.65     no

7.3 Transforming data

The separate() function makes the conversion of the “birth.date” column into “month,” “day,” and “year” trivial.

birthweight %>%
  filter(low.birthweight == TRUE) %>%
  select(birth.date, length, birthweight, smoker) %>%
  separate(col = birth.date, sep = "[/]", into = c("month", "day", "year"))

##   month day year length birthweight smoker
## 1     3  23 1967     50        2.51    yes
## 2     1   8 1968     47        2.66    yes
## 3     4   2 1968     48        2.37    yes
## 4     7  18 1968     46        2.05    yes
## 5     9  16 1968     48        1.92    yes
## 6     9  27 1968     43        2.65     no

The mutate() function adds a new column based on data contained in the existing columns.

birthweight %>%
  filter(low.birthweight == TRUE) %>%
  select(birth.date, length, birthweight, smoker) %>%
  mutate(d = birthweight / length)

##   birth.date length birthweight smoker          d
## 1  3/23/1967     50        2.51    yes 0.05020000
## 2   1/8/1968     47        2.66    yes 0.05659574
## 3   4/2/1968     48        2.37    yes 0.04937500
## 4  7/18/1968     46        2.05    yes 0.04456522
## 5  9/16/1968     48        1.92    yes 0.04000000
## 6  9/27/1968     43        2.65     no 0.06162791

7.4 Summarizing data

The group_by() and summarize() functions apply a function to a group defined by one or more categorical variables.

birthweight %>%
  group_by(smoker) %>%
  summarize(mean.birthweight = mean(birthweight))

## # A tibble: 2 × 2
##   smoker mean.birthweight
##   <chr>             <dbl>
## 1 no                 3.51
## 2 yes                3.13

birthweight %>%
  group_by(smoker, low.birthweight) %>%
  summarize(mean.birthweight = mean(birthweight))

## `summarise()` has grouped output by 'smoker'. You can override using the
## `.groups` argument.

## # A tibble: 4 × 3
## # Groups:   smoker [2]
##   smoker low.birthweight mean.birthweight
##   <chr>            <int>            <dbl>
## 1 no                   0             3.55
## 2 no                   1             2.65
## 3 yes                  0             3.38
## 4 yes                  1             2.30

To change the order of rows, use arrange(). To return one or more specified rows, use slice().

birthweight %>%
  group_by(smoker) %>%
  select(smoker, birthweight, length, head.circumference, weeks.gestation) %>%
  slice_max(order_by = birthweight, n = 5)

## # A tibble: 10 × 5
## # Groups:   smoker [2]
##    smoker birthweight length head.circumference weeks.gestation
##    <chr>        <dbl>  <int>              <int>           <int>
##  1 no            4.55     56                 34              44
##  2 no            4.32     53                 36              40
##  3 no            4.1      58                 39              41
##  4 no            4.07     53                 38              44
##  5 no            3.94     54                 37              42
##  6 yes           4.57     58                 39              41
##  7 yes           3.87     50                 33              45
##  8 yes           3.86     52                 36              39
##  9 yes           3.64     53                 38              40
## 10 yes           3.59     53                 34              40

The pivot_longer() and pivot_wider() functions rearrange data, decreasing or increasing the number of columns. The use of this will become more evident during visualization.

birthweight %>%
  filter(low.birthweight == TRUE) %>%
  select(smoker,length, birthweight) %>%
  pivot_longer(cols = c(length, birthweight),
               names_to = "gene",
               values_to = "expression")

## # A tibble: 12 × 3
##    smoker gene        expression
##    <chr>  <chr>            <dbl>
##  1 yes    length           50   
##  2 yes    birthweight       2.51
##  3 yes    length           47   
##  4 yes    birthweight       2.66
##  5 yes    length           48   
##  6 yes    birthweight       2.37
##  7 yes    length           46   
##  8 yes    birthweight       2.05
##  9 yes    length           48   
## 10 yes    birthweight       1.92
## 11 no     length           43   
## 12 no     birthweight       2.65

7.5 Exercise 4: converting between base R and Tidyverse

Reproduce the table 7.5 or table 7.6 using base R. Use Tidyverse functions to answer the question you addressed in exercise 3.

reference : https://ucdavis-bioinformatics-training.github.io/2022_February_Introduction_to_R_for_Bioinformatics/introduction-to-the-tidyverse.html

Introduction to the tidyverse

Sefti Agustini (220605220003)

Malang, 2023-09-25

Pipes: combining tidyverse functions

7.3 Transforming data

7.4 Summarizing data

7.5 Exercise 4: converting between base R and Tidyverse