# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 A 0.7 2
2 B 37 80
3 C 212 213
2023-05-11
You can view this presentation at https://rpubs.com/alperyilmaz/tidyr-slides
Source: R for Data Science book, Chapter 12 Tidy Data
The data is often not tidy so we need to to tidy it. For that, we’ll be using {tidyr} package.
Which table is suitable for ggplot, do you think?
# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 A 0.7 2
2 B 37 80
3 C 212 213
How can we plot a line plot showing years on x-axis and cases on y-axis?
+ geom_line(aes(x="1999"))
won’t work and also if we have many years, we shouldn’t be using column names individually.
We need a column which keeps year data
How about this new shape?
# A tibble: 6 × 3
country year cases
<chr> <chr> <dbl>
1 A 1999 0.7
2 A 2000 2
3 B 1999 37
4 B 2000 80
5 C 1999 212
6 C 2000 213
Having year and cases values in its own column will both help data analysis and plotting.
Let’s plot the line plot that we wanted to plot originally.
Consider the following data, a single gene is associated with multiple diseases. Since number of diseases is not constant, we won’t be able to use separate
. However, separate_rows
is the ideal function for this purpose.
Here’s the result of separate_rows function
# A tibble: 5 × 2
gene_name disease
<chr> <chr>
1 A d1
2 A d2
3 A d3
4 B d1
5 B d3
This is tidy data, we can group_by gene or disease name, we can do join on disease column easily.
Quite often you’ll notice that, after pivoting the data, we need to separate the new names into multiple columns.
Consider the sample data below(the numbers are completely arbitrary, they don’t represent actual medal counts)
After pivot_longer “game” column still contains multiple variables, year and season, it needs separate
# A tibble: 8 × 3
country game medals
<chr> <chr> <dbl>
1 USA Summer_2016 5
2 USA Summer_2020 20
3 USA Winter_2014 17
4 USA Winter_2018 7
5 Russia Summer_2016 7
6 Russia Summer_2020 21
7 Russia Winter_2014 15
8 Russia Winter_2018 9
Here’s result of pivot_longer and separate which is suitable for group_by, inner_join operations or plotting.
medals |>
pivot_longer(-country,
names_to = "game",
values_to = "medals") |>
separate(game, sep = "_", into=c("season","year"))
# A tibble: 8 × 4
country season year medals
<chr> <chr> <chr> <dbl>
1 USA Summer 2016 5
2 USA Summer 2020 20
3 USA Winter 2014 17
4 USA Winter 2018 7
5 Russia Summer 2016 7
6 Russia Summer 2020 21
7 Russia Winter 2014 15
8 Russia Winter 2018 9
Here’s another example, we have gene expression data for replicate tissues.
gene_tissue <- tibble(gene=c("gene1","gene2"),
Brain.1=c(120,20),
Brain.2=c(100,30),
Brain.3=c(140,50),
Liver.1=c(1200,500),
Liver.2=c(1000,400))
gene_tissue
# A tibble: 2 × 6
gene Brain.1 Brain.2 Brain.3 Liver.1 Liver.2
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 gene1 120 100 140 1200 1000
2 gene2 20 30 50 500 400
Question: What is the average expression of each gene in Brain and Liver?
In this case, the numbers after tissue names were not significant, thus we separated data into single column and dropped the extra data.
gene_tissue |>
pivot_longer(-gene,
names_to = "tissue",
values_to = "expression") |>
separate(tissue, into=c("tissue"), extra = "drop")
# A tibble: 10 × 3
gene tissue expression
<chr> <chr> <dbl>
1 gene1 Brain 120
2 gene1 Brain 100
3 gene1 Brain 140
4 gene1 Liver 1200
5 gene1 Liver 1000
6 gene2 Brain 20
7 gene2 Brain 30
8 gene2 Brain 50
9 gene2 Liver 500
10 gene2 Liver 400
By adding the following part we can get average values, even plot it
# A tibble: 4 × 3
# Groups: gene [2]
gene tissue avg_exp
<chr> <chr> <dbl>
1 gene1 Brain 120
2 gene1 Liver 1100
3 gene2 Brain 33.3
4 gene2 Liver 450
By adding the following part we plot our findings
... |>
group_by(gene, tissue) |>
summarize(avg_exp=mean(expression)) |>
ggplot(aes(gene, avg_exp)) +geom_col() +
facet_wrap(~tissue)
Tidy format might be great for analysis and plotting but for reporting we might want to see the tissue information in indivudal columns. Let’s use pivot_wider for that
gene_tissue |>
pivot_longer(-gene,
names_to = "tissue",
values_to = "expression") |>
separate(tissue, into=c("tissue"), extra = "drop") |>
group_by(gene, tissue) |>
summarize(avg_exp=mean(expression)) |>
pivot_wider(names_from = tissue, values_from = avg_exp) |>
knitr::kable() # this is for pretty printing
gene | Brain | Liver |
---|---|---|
gene1 | 120.00000 | 1100 |
gene2 | 33.33333 | 450 |
Since, pivot_longer() |> separate()
is common, pivot_longer()
has lots of arguments to avoid most of additional steps.
Here’s an example data from WHO summarizing tuberculosis diagnosis cases
# A tibble: 6 × 60
country iso2 iso3 year new_sp…¹ new_s…² new_s…³ new_s…⁴ new_s…⁵ new_s…⁶
<chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int>
1 Afghanistan AF AFG 1980 NA NA NA NA NA NA
2 Afghanistan AF AFG 1981 NA NA NA NA NA NA
3 Afghanistan AF AFG 1982 NA NA NA NA NA NA
4 Afghanistan AF AFG 1983 NA NA NA NA NA NA
5 Afghanistan AF AFG 1984 NA NA NA NA NA NA
6 Afghanistan AF AFG 1985 NA NA NA NA NA NA
# … with 50 more variables: new_sp_m65 <int>, new_sp_f014 <int>,
# new_sp_f1524 <int>, new_sp_f2534 <int>, new_sp_f3544 <int>,
# new_sp_f4554 <int>, new_sp_f5564 <int>, new_sp_f65 <int>,
# new_sn_m014 <int>, new_sn_m1524 <int>, new_sn_m2534 <int>,
# new_sn_m3544 <int>, new_sn_m4554 <int>, new_sn_m5564 <int>,
# new_sn_m65 <int>, new_sn_f014 <int>, new_sn_f1524 <int>,
# new_sn_f2534 <int>, new_sn_f3544 <int>, new_sn_f4554 <int>, …
The column names are were un-tidy. They contain so much variable. For instance new_sp_m014
means:
With elegant arguments in pivot_longer
the data can be pivotted and separated at the same time
who |>
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = c("diagnosis", "gender", "age"),
names_pattern = "new_?(.*)_(.)(.*)",
values_to = "count"
)
# A tibble: 405,440 × 8
country iso2 iso3 year diagnosis gender age count
<chr> <chr> <chr> <int> <chr> <chr> <chr> <int>
1 Afghanistan AF AFG 1980 sp m 014 NA
2 Afghanistan AF AFG 1980 sp m 1524 NA
3 Afghanistan AF AFG 1980 sp m 2534 NA
4 Afghanistan AF AFG 1980 sp m 3544 NA
5 Afghanistan AF AFG 1980 sp m 4554 NA
6 Afghanistan AF AFG 1980 sp m 5564 NA
7 Afghanistan AF AFG 1980 sp m 65 NA
8 Afghanistan AF AFG 1980 sp f 014 NA
9 Afghanistan AF AFG 1980 sp f 1524 NA
10 Afghanistan AF AFG 1980 sp f 2534 NA
# … with 405,430 more rows
Let’s go over sample data analysis using most of the verbs we have learned:
df <- data.frame(
gene = c("GeneA", "GeneB", "GeneC", "GeneA", "GeneB", "GeneC"),
tissue_1 = c(10, 15, 8, 12, 18, 9),
tissue_2 = c(5, 8, 6, 10, 12, 7),
tissue_3 = c(15, 20, 10, 8, 15, 12)
)
df
gene tissue_1 tissue_2 tissue_3
1 GeneA 10 5 15
2 GeneB 15 8 20
3 GeneC 8 6 10
4 GeneA 12 10 8
5 GeneB 18 12 15
6 GeneC 9 7 12
df |>
pivot_longer(cols = starts_with("tissue"),
names_to = "tissue",
values_to = "expression") |>
separate(tissue, into = c("var", "tissue"), sep = "_") |>
mutate(tissue = as.integer(tissue)) |>
group_by(gene) |>
mutate(avg_expression = mean(expression), total_expression = sum(expression)) |>
ungroup() |>
select(gene, tissue, avg_expression, total_expression) |>
pivot_longer(cols = c(avg_expression, total_expression),
names_to = "variable",
values_to = "value") |>
ggplot( aes(x = tissue, y = value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ gene, ncol = 1) +
labs(x = "Tissue", y = "Value", fill = "Variable") +
theme_minimal()
purrr::map() can operate on nested data
# A tibble: 3 × 2
# Groups: Species [3]
Species data
<fct> <list>
1 setosa <tibble [50 × 4]>
2 versicolor <tibble [50 × 4]>
3 virginica <tibble [50 × 4]>
Notice that data column is a list column and “<tibble [50 × 4]>” means that each cell contains tibble (aka data frame) of 50 rows, 4 columns
Refer to purrr cheatsheat for more information about list columns
If you want, you can access the data as if you’re accessing a cell value in a data frame.
[[1]]
# A tibble: 50 × 4
Sepal.Length Sepal.Width Petal.Length Petal.Width
<dbl> <dbl> <dbl> <dbl>
1 5.1 3.5 1.4 0.2
2 4.9 3 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5 3.6 1.4 0.2
6 5.4 3.9 1.7 0.4
7 4.6 3.4 1.4 0.3
8 5 3.4 1.5 0.2
9 4.4 2.9 1.4 0.2
10 4.9 3.1 1.5 0.1
# … with 40 more rows
Let’s apply a function to elements of vec
and keep the results.
apply
family of functionsReturns a vector or array or list of values obtained by applying a function to margins of an array or matrix
purrr
is tidyverse equivalent of apply family functionsmap()
from {purrr} package is apply
equivalent. Both apply()
and map()
are useful for applying functions in R, while map()
is particularly useful for iterating over individual elements of a data structure, such as a vector or list and map()
is pipe-friendly and integral to tidyverse suite of packages.
map
always returns list. but it has type stable versions, map_dbl
, map_chr
, etc.
When we write
we mean:
map()
can process, lists, columns or cellsmap()
can process list elements
if you use map()
with a dataframe, the function will be applied to each column (try running dput(mtcars)
, which might explain why each column)
Finally, we can run a function per each cell in a column using the format below:
as_tibble(mtcars) |> # mtcars is dataframe, this is for cosmetics
mutate(squareroot= map(mpg, sqrt) ) |>
head()
# A tibble: 6 × 12
mpg cyl disp hp drat wt qsec vs am gear carb squareroot
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 <dbl [1]>
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 <dbl [1]>
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 <dbl [1]>
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 <dbl [1]>
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 <dbl [1]>
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 <dbl [1]>
# A tibble: 6 × 12
mpg cyl disp hp drat wt qsec vs am gear carb squareroot
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 4.58
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 4.58
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 4.77
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 4.63
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 4.32
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 4.25
# A tibble: 3 × 3
# Groups: Species [3]
Species data rows
<fct> <list> <list>
1 setosa <tibble [50 × 4]> <int [1]>
2 versicolor <tibble [50 × 4]> <int [1]>
3 virginica <tibble [50 × 4]> <int [1]>
lm()
is used for linear modelling.
The formula below models wt
(weight) based on disp
(displacement).
Stats about a model can be extracted via summary()
Call:
lm(formula = wt ~ disp, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-0.89044 -0.29775 -0.00684 0.33428 0.66525
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.5998146 0.1729964 9.248 2.74e-10 ***
disp 0.0070103 0.0006629 10.576 1.22e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4574 on 30 degrees of freedom
Multiple R-squared: 0.7885, Adjusted R-squared: 0.7815
F-statistic: 111.8 on 1 and 30 DF, p-value: 1.222e-11
This is not tidy! The individual elements can be extracted as shown below:
broom
package for tidying lm()
Instead of summary()
function, tidy()
, glance()
and augment()
functions from broom package can be used to get stats in “tidy” format.
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1.60 0.173 9.25 2.74e-10
2 disp 0.00701 0.000663 10.6 1.22e-11
# A tibble: 1 × 12
r.squared adj.r.squa…¹ sigma stati…² p.value df logLik AIC BIC devia…³
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.789 0.781 0.457 112. 1.22e-11 1 -19.3 44.7 49.1 6.28
# … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
# variable names ¹adj.r.squared, ²statistic, ³deviance
# A tibble: 32 × 9
.rownames wt disp .fitted .resid .hat .sigma .cooksd .std.r…¹
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 2.62 160 2.72 -0.101 0.0418 0.465 0.00112 -0.227
2 Mazda RX4 Wag 2.88 160 2.72 0.154 0.0418 0.464 0.00256 0.343
3 Datsun 710 2.32 108 2.36 -0.0369 0.0629 0.465 0.000233 -0.0834
4 Hornet 4 Drive 3.22 258 3.41 -0.193 0.0328 0.464 0.00314 -0.430
5 Hornet Sportabout 3.44 360 4.12 -0.684 0.0663 0.446 0.0850 -1.55
6 Valiant 3.46 225 3.18 0.283 0.0313 0.462 0.00638 0.628
7 Duster 360 3.57 360 4.12 -0.554 0.0663 0.453 0.0557 -1.25
8 Merc 240D 3.19 147. 2.63 0.562 0.0461 0.453 0.0382 1.26
9 Merc 230 3.15 141. 2.59 0.563 0.0482 0.453 0.0403 1.26
10 Merc 280 3.44 168. 2.77 0.665 0.0396 0.448 0.0454 1.48
# … with 22 more rows, and abbreviated variable name ¹.std.resid
The following code will group and nest the data according to cyl
(cylinder) column. Then 3 separate nested data will be modeled with lm
and “r.squared” value will be extracted from model results and will be sorted accordingly.
mtcars |>
group_by(cyl) |>
nest() |>
mutate(model= map(data, ~ lm(wt ~ disp, data=.x))) |>
mutate(rsq = map(model, glance)) |>
unnest(rsq) |>
arrange(desc(r.squared))
# A tibble: 3 × 15
# Groups: cyl [3]
cyl data model r.squ…¹ adj.r…² sigma stati…³ p.value df logLik AIC
<dbl> <list> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 <tibble> <lm> 0.734 0.704 0.310 24.8 7.61e-4 1 -1.62 9.23
2 8 <tibble> <lm> 0.570 0.535 0.518 15.9 1.79e-3 1 -9.58 25.2
3 6 <tibble> <lm> 0.223 0.0679 0.344 1.44 2.84e-1 1 -1.29 8.57
# … with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
# nobs <int>, and abbreviated variable names ¹r.squared, ²adj.r.squared,
# ³statistic
please refer to chapter 25 of R for Data Science book and Hadley Wickham’s talk Many Models
In the talk, Hadley Wickham mentions that “plots does not scale, models do”. He means that although plots are visually great to process going through lots of images is not sustainable. But if develop a simple model and extract a measure from it, hundreds or thousands of cases/groups can be analyzed.
Let’s take the case of lifeExp
in gapminder
data. If we aim to detect any discrepencies in the data, we can plot them.
Such a plot will be difficult to extract important information.
Instead, let’s do lm()
modeling of life expextancy with year and then see which countries has worst model (meaning that their plot is not a straight line)
For more readability, let’s define a function for modelling and use it in map()
Let’s nest the data per country
# A tibble: 6 × 3
# Groups: country, continent [6]
country continent data
<fct> <fct> <list>
1 Afghanistan Asia <tibble [12 × 4]>
2 Albania Europe <tibble [12 × 4]>
3 Algeria Africa <tibble [12 × 4]>
4 Angola Africa <tibble [12 × 4]>
5 Argentina Americas <tibble [12 × 4]>
6 Australia Oceania <tibble [12 × 4]>
We’re ready for modeling and extracting model performance for each country. All this is taking place in a tibble/data frame without any loops or intermediate steps.
by_country |>
mutate(model = map(data, country_model)) |>
mutate(glance = map(model, broom::glance)) |>
unnest(glance) |>
arrange(r.squared)
# A tibble: 142 × 16
# Groups: country, continent [142]
country conti…¹ data model r.squ…² adj.r.…³ sigma stati…⁴ p.value df
<fct> <fct> <list> <lis> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Rwanda Africa <tibble> <lm> 0.0172 -0.0811 6.56 0.175 0.685 1
2 Botswana Africa <tibble> <lm> 0.0340 -0.0626 6.11 0.352 0.566 1
3 Zimbabwe Africa <tibble> <lm> 0.0562 -0.0381 7.21 0.596 0.458 1
4 Zambia Africa <tibble> <lm> 0.0598 -0.0342 4.53 0.636 0.444 1
5 Swaziland Africa <tibble> <lm> 0.0682 -0.0250 6.64 0.732 0.412 1
6 Lesotho Africa <tibble> <lm> 0.0849 -0.00666 5.93 0.927 0.358 1
7 Cote d'I… Africa <tibble> <lm> 0.283 0.212 3.93 3.95 0.0748 1
8 South Af… Africa <tibble> <lm> 0.312 0.244 4.74 4.54 0.0588 1
9 Uganda Africa <tibble> <lm> 0.342 0.276 3.19 5.20 0.0457 1
10 Congo, D… Africa <tibble> <lm> 0.348 0.283 2.43 5.34 0.0434 1
# … with 132 more rows, 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
# deviance <dbl>, df.residual <int>, nobs <int>, and abbreviated variable
# names ¹continent, ²r.squared, ³adj.r.squared, ⁴statistic
What happened in Rwanda?
Please visit Rwandan genocide (Apr-Jul 1994) page at Wikipedia.
map
in parallellIf the function in mutate(model = map(data, country_model))
was slow, we had to wait for each country’s model to be modeled sequentially. By installing furrr
package and using future_map
instead of map
, modeling will be performed in parallel.
The updated code will look like this
For more information, visit furrr package site