The tidyverse is a collection of packages by the creators of RStudio that share an approach to data science.
The tidyverse packages replace some of the base R functions with alternatives that are intended to be more user friendly for data scientists who are following this life cycle.
We will only be covering a few of the packages from the tidyverse.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
##
## extract
birthweight <- read.csv("birthweight.csv")
The tidyverse employs piping to send the output of one function to
another function, rather than the nesting used in base r. The “pipe” is
written with a greater than symbol sandwiched between two percent signs,
like this: %>%.
birthweight %>%
filter(low.birthweight == TRUE) %>%
select(birth.date, length, birthweight, smoker)
## birth.date length birthweight smoker
## 1 3/23/1967 50 2.51 yes
## 2 1/8/1968 47 2.66 yes
## 3 4/2/1968 48 2.37 yes
## 4 7/18/1968 46 2.05 yes
## 5 9/16/1968 48 1.92 yes
## 6 9/27/1968 43 2.65 no
# equivalent to:
birthweight[birthweight$low.birthweight == TRUE, c("birth.date", "length", "birthweight", "smoker")]
## birth.date length birthweight smoker
## 6 3/23/1967 50 2.51 yes
## 22 1/8/1968 47 2.66 yes
## 28 4/2/1968 48 2.37 yes
## 32 7/18/1968 46 2.05 yes
## 37 9/16/1968 48 1.92 yes
## 38 9/27/1968 43 2.65 no
The separate() function makes the conversion of the
“birth.date” column into “month,” “day,” and “year” trivial.
birthweight %>%
filter(low.birthweight == TRUE) %>%
select(birth.date, length, birthweight, smoker) %>%
separate(col = birth.date, sep = "[/]", into = c("month", "day", "year"))
## month day year length birthweight smoker
## 1 3 23 1967 50 2.51 yes
## 2 1 8 1968 47 2.66 yes
## 3 4 2 1968 48 2.37 yes
## 4 7 18 1968 46 2.05 yes
## 5 9 16 1968 48 1.92 yes
## 6 9 27 1968 43 2.65 no
The mutate() function adds a new column based on data
contained in the existing columns.
birthweight %>%
filter(low.birthweight == TRUE) %>%
select(birth.date, length, birthweight, smoker) %>%
mutate(d = birthweight / length)
## birth.date length birthweight smoker d
## 1 3/23/1967 50 2.51 yes 0.05020000
## 2 1/8/1968 47 2.66 yes 0.05659574
## 3 4/2/1968 48 2.37 yes 0.04937500
## 4 7/18/1968 46 2.05 yes 0.04456522
## 5 9/16/1968 48 1.92 yes 0.04000000
## 6 9/27/1968 43 2.65 no 0.06162791
The group_by() and summarize() functions
apply a function to a group defined by one or more categorical
variables.
birthweight %>%
group_by(smoker) %>%
summarize(mean.birthweight = mean(birthweight))
## # A tibble: 2 × 2
## smoker mean.birthweight
## <chr> <dbl>
## 1 no 3.51
## 2 yes 3.13
birthweight %>%
group_by(smoker, low.birthweight) %>%
summarize(mean.birthweight = mean(birthweight))
## `summarise()` has grouped output by 'smoker'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## # Groups: smoker [2]
## smoker low.birthweight mean.birthweight
## <chr> <int> <dbl>
## 1 no 0 3.55
## 2 no 1 2.65
## 3 yes 0 3.38
## 4 yes 1 2.30
To change the order of rows, use arrange(). To return
one or more specified rows, use slice().
birthweight %>%
group_by(smoker) %>%
select(smoker, birthweight, length, head.circumference, weeks.gestation) %>%
slice_max(order_by = birthweight, n = 5)
## # A tibble: 10 × 5
## # Groups: smoker [2]
## smoker birthweight length head.circumference weeks.gestation
## <chr> <dbl> <int> <int> <int>
## 1 no 4.55 56 34 44
## 2 no 4.32 53 36 40
## 3 no 4.1 58 39 41
## 4 no 4.07 53 38 44
## 5 no 3.94 54 37 42
## 6 yes 4.57 58 39 41
## 7 yes 3.87 50 33 45
## 8 yes 3.86 52 36 39
## 9 yes 3.64 53 38 40
## 10 yes 3.59 53 34 40
The pivot_longer() and pivot_wider()
functions rearrange data, decreasing or increasing the number of
columns. The use of this will become more evident during
visualization.
birthweight %>%
filter(low.birthweight == TRUE) %>%
select(smoker,length, birthweight) %>%
pivot_longer(cols = c(length, birthweight),
names_to = "gene",
values_to = "expression")
## # A tibble: 12 × 3
## smoker gene expression
## <chr> <chr> <dbl>
## 1 yes length 50
## 2 yes birthweight 2.51
## 3 yes length 47
## 4 yes birthweight 2.66
## 5 yes length 48
## 6 yes birthweight 2.37
## 7 yes length 46
## 8 yes birthweight 2.05
## 9 yes length 48
## 10 yes birthweight 1.92
## 11 no length 43
## 12 no birthweight 2.65
Reproduce the table 7.5 or table 7.6 using base R. Use Tidyverse functions to answer the question you addressed in exercise 3.