Rehshaping and Nesting Data

alper yilmaz

2023-05-11

You can view this presentation at https://rpubs.com/alperyilmaz/tidyr-slides

Data should be tidy

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

The {tidyr} package

The data is often not tidy so we need to to tidy it. For that, we’ll be using {tidyr} package.

It can reshape the table (pivot_longer, pivot_wider functions)
It can separate (or unite) cells (separate, separate_rows, unite)
It can nest data (nest and unnest_ functions)
It can expand tables with possible combinations (expand and complete functions)
It can handle missing data (drop_na, fill and replace_na functions)

Cheatsheet

Pivot longer

Pivot wider

Why we need pivotting

 table4a <- tibble(country=c("A","B","C"), 
                   "1999"=c(0.7,37,212), 
                   "2000"=c(2,80,213))

 table4a

# A tibble: 3 × 3
  country `1999` `2000`
  <chr>    <dbl>  <dbl>
1 A          0.7      2
2 B         37       80
3 C        212      213

How can we plot a line plot showing years on x-axis and cases on y-axis?

+ geom_line(aes(x="1999")) won’t work and also if we have many years, we shouldn’t be using column names individually.

We need a column which keeps year data

How about this new shape?

table4a |>
  pivot_longer(-country, 
               names_to = "year", 
               values_to = "cases")

# A tibble: 6 × 3
  country year  cases
  <chr>   <chr> <dbl>
1 A       1999    0.7
2 A       2000    2  
3 B       1999   37  
4 B       2000   80  
5 C       1999  212  
6 C       2000  213

Having year and cases values in its own column will both help data analysis and plotting.

Let’s plot the line plot that we wanted to plot originally.

table4a |>
  pivot_longer(-country, 
               names_to = "year", 
               values_to = "cases") |>
  ggplot(aes(x=year, y=cases, group=country)) +
  geom_line()

Unite cells

Separate cells

Consider the following data, a single gene is associated with multiple diseases. Since number of diseases is not constant, we won’t be able to use separate. However, separate_rows is the ideal function for this purpose.

gene_table <- tibble(gene_name=c("A","B"), 
                     disease=c("d1,d2,d3","d1,d3"))

gene_table

# A tibble: 2 × 2
  gene_name disease 
  <chr>     <chr>   
1 A         d1,d2,d3
2 B         d1,d3

Here’s the result of separate_rows function

gene_table |>
  separate_rows(disease, sep = ",")

# A tibble: 5 × 2
  gene_name disease
  <chr>     <chr>  
1 A         d1     
2 A         d2     
3 A         d3     
4 B         d1     
5 B         d3

This is tidy data, we can group_by gene or disease name, we can do join on disease column easily.

pivot_longer |> separate

Quite often you’ll notice that, after pivoting the data, we need to separate the new names into multiple columns.

Consider the sample data below(the numbers are completely arbitrary, they don’t represent actual medal counts)

medals <- tibble(country=c("USA","Russia"), 
                 Summer_2016= c(5,7), 
                 Summer_2020=c(20,21), 
                 Winter_2014=c(17,15), 
                 Winter_2018=c(7,9))

medals

# A tibble: 2 × 5
  country Summer_2016 Summer_2020 Winter_2014 Winter_2018
  <chr>         <dbl>       <dbl>       <dbl>       <dbl>
1 USA               5          20          17           7
2 Russia            7          21          15           9

After pivot_longer “game” column still contains multiple variables, year and season, it needs separate

medals |> 
  pivot_longer(-country, 
               names_to = "game",
               values_to = "medals")

# A tibble: 8 × 3
  country game        medals
  <chr>   <chr>        <dbl>
1 USA     Summer_2016      5
2 USA     Summer_2020     20
3 USA     Winter_2014     17
4 USA     Winter_2018      7
5 Russia  Summer_2016      7
6 Russia  Summer_2020     21
7 Russia  Winter_2014     15
8 Russia  Winter_2018      9

Here’s result of pivot_longer and separate which is suitable for group_by, inner_join operations or plotting.

medals |> 
  pivot_longer(-country, 
               names_to = "game",
               values_to = "medals") |>
  separate(game, sep = "_", into=c("season","year"))

# A tibble: 8 × 4
  country season year  medals
  <chr>   <chr>  <chr>  <dbl>
1 USA     Summer 2016       5
2 USA     Summer 2020      20
3 USA     Winter 2014      17
4 USA     Winter 2018       7
5 Russia  Summer 2016       7
6 Russia  Summer 2020      21
7 Russia  Winter 2014      15
8 Russia  Winter 2018       9

Here’s another example, we have gene expression data for replicate tissues.

gene_tissue <- tibble(gene=c("gene1","gene2"), 
                      Brain.1=c(120,20), 
                      Brain.2=c(100,30), 
                      Brain.3=c(140,50), 
                      Liver.1=c(1200,500), 
                      Liver.2=c(1000,400))
gene_tissue

# A tibble: 2 × 6
  gene  Brain.1 Brain.2 Brain.3 Liver.1 Liver.2
  <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 gene1     120     100     140    1200    1000
2 gene2      20      30      50     500     400

Question: What is the average expression of each gene in Brain and Liver?

In this case, the numbers after tissue names were not significant, thus we separated data into single column and dropped the extra data.

gene_tissue |> 
  pivot_longer(-gene, 
               names_to = "tissue", 
               values_to = "expression") |>
  separate(tissue, into=c("tissue"), extra = "drop")

# A tibble: 10 × 3
   gene  tissue expression
   <chr> <chr>       <dbl>
 1 gene1 Brain         120
 2 gene1 Brain         100
 3 gene1 Brain         140
 4 gene1 Liver        1200
 5 gene1 Liver        1000
 6 gene2 Brain          20
 7 gene2 Brain          30
 8 gene2 Brain          50
 9 gene2 Liver         500
10 gene2 Liver         400

By adding the following part we can get average values, even plot it

... |>
  group_by(gene, tissue) |> 
  summarize(avg_exp=mean(expression))

# A tibble: 4 × 3
# Groups:   gene [2]
  gene  tissue avg_exp
  <chr> <chr>    <dbl>
1 gene1 Brain    120  
2 gene1 Liver   1100  
3 gene2 Brain     33.3
4 gene2 Liver    450

By adding the following part we plot our findings

... |>
  group_by(gene, tissue) |> 
  summarize(avg_exp=mean(expression)) |>
  ggplot(aes(gene, avg_exp)) +geom_col() + 
  facet_wrap(~tissue)

Tidy format might be great for analysis and plotting but for reporting we might want to see the tissue information in indivudal columns. Let’s use pivot_wider for that

gene_tissue |> 
  pivot_longer(-gene, 
               names_to = "tissue", 
               values_to = "expression") |>
  separate(tissue, into=c("tissue"), extra = "drop") |>
  group_by(gene, tissue) |> 
  summarize(avg_exp=mean(expression)) |>
  pivot_wider(names_from = tissue, values_from = avg_exp) |>
  knitr::kable()   # this is for pretty printing

gene	Brain	Liver
gene1	120.00000	1100
gene2	33.33333	450

Since, pivot_longer() |> separate() is common, pivot_longer() has lots of arguments to avoid most of additional steps.

Here’s an example data from WHO summarizing tuberculosis diagnosis cases

who |> head()

# A tibble: 6 × 60
  country     iso2  iso3   year new_sp…¹ new_s…² new_s…³ new_s…⁴ new_s…⁵ new_s…⁶
  <chr>       <chr> <chr> <int>    <int>   <int>   <int>   <int>   <int>   <int>
1 Afghanistan AF    AFG    1980       NA      NA      NA      NA      NA      NA
2 Afghanistan AF    AFG    1981       NA      NA      NA      NA      NA      NA
3 Afghanistan AF    AFG    1982       NA      NA      NA      NA      NA      NA
4 Afghanistan AF    AFG    1983       NA      NA      NA      NA      NA      NA
5 Afghanistan AF    AFG    1984       NA      NA      NA      NA      NA      NA
6 Afghanistan AF    AFG    1985       NA      NA      NA      NA      NA      NA
# … with 50 more variables: new_sp_m65 <int>, new_sp_f014 <int>,
#   new_sp_f1524 <int>, new_sp_f2534 <int>, new_sp_f3544 <int>,
#   new_sp_f4554 <int>, new_sp_f5564 <int>, new_sp_f65 <int>,
#   new_sn_m014 <int>, new_sn_m1524 <int>, new_sn_m2534 <int>,
#   new_sn_m3544 <int>, new_sn_m4554 <int>, new_sn_m5564 <int>,
#   new_sn_m65 <int>, new_sn_f014 <int>, new_sn_f1524 <int>,
#   new_sn_f2534 <int>, new_sn_f3544 <int>, new_sn_f4554 <int>, …

The column names are were un-tidy. They contain so much variable. For instance new_sp_m014 means:

m: male (versus f: female)
sp: positive pulmonary smear (other options are: “rel”, “sn” and “ep”)
014: ages between 0-14 (there are other age ranges such as 15-24, 25-34, 35-44, 45-54, 55-64 and 65-)

With elegant arguments in pivot_longer the data can be pivotted and separated at the same time

who |> 
  pivot_longer(
    cols = new_sp_m014:newrel_f65,
    names_to = c("diagnosis", "gender", "age"),
    names_pattern = "new_?(.*)_(.)(.*)",
    values_to = "count"
  )

# A tibble: 405,440 × 8
   country     iso2  iso3   year diagnosis gender age   count
   <chr>       <chr> <chr> <int> <chr>     <chr>  <chr> <int>
 1 Afghanistan AF    AFG    1980 sp        m      014      NA
 2 Afghanistan AF    AFG    1980 sp        m      1524     NA
 3 Afghanistan AF    AFG    1980 sp        m      2534     NA
 4 Afghanistan AF    AFG    1980 sp        m      3544     NA
 5 Afghanistan AF    AFG    1980 sp        m      4554     NA
 6 Afghanistan AF    AFG    1980 sp        m      5564     NA
 7 Afghanistan AF    AFG    1980 sp        m      65       NA
 8 Afghanistan AF    AFG    1980 sp        f      014      NA
 9 Afghanistan AF    AFG    1980 sp        f      1524     NA
10 Afghanistan AF    AFG    1980 sp        f      2534     NA
# … with 405,430 more rows

(Almost) all verbs combined

Let’s go over sample data analysis using most of the verbs we have learned:

df <- data.frame(
      gene = c("GeneA", "GeneB", "GeneC", "GeneA", "GeneB", "GeneC"),
      tissue_1 = c(10, 15, 8, 12, 18, 9),
      tissue_2 = c(5, 8, 6, 10, 12, 7),
      tissue_3 = c(15, 20, 10, 8, 15, 12)
    )
df

   gene tissue_1 tissue_2 tissue_3
1 GeneA       10        5       15
2 GeneB       15        8       20
3 GeneC        8        6       10
4 GeneA       12       10        8
5 GeneB       18       12       15
6 GeneC        9        7       12

df |>
  pivot_longer(cols = starts_with("tissue"), 
               names_to = "tissue", 
               values_to = "expression") |>
  separate(tissue, into = c("var", "tissue"), sep = "_") |>
  mutate(tissue = as.integer(tissue)) |>
  group_by(gene) |>
  mutate(avg_expression = mean(expression), total_expression = sum(expression)) |>
  ungroup() |>
  select(gene, tissue, avg_expression, total_expression) |>
  pivot_longer(cols = c(avg_expression, total_expression), 
               names_to = "variable", 
               values_to = "value") |>
  ggplot( aes(x = tissue, y = value, fill = variable)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~ gene, ncol = 1) +
  labs(x = "Tissue", y = "Value", fill = "Variable") +
  theme_minimal()

Nest data

n_storms <- storms %>%
group_by(name) %>%
nest()

Unnest data

Operate on nested data

purrr::map() can operate on nested data

list columns, case of nested iris

n_iris <- iris |>  
  group_by(Species) |> 
  nest()

n_iris

# A tibble: 3 × 2
# Groups:   Species [3]
  Species    data             
  <fct>      <list>           
1 setosa     <tibble [50 × 4]>
2 versicolor <tibble [50 × 4]>
3 virginica  <tibble [50 × 4]>

Notice that data column is a list column and “<tibble [50 × 4]>” means that each cell contains tibble (aka data frame) of 50 rows, 4 columns

If you want, you can access the data as if you’re accessing a cell value in a data frame.

n_iris$data[1]

[[1]]
# A tibble: 50 × 4
   Sepal.Length Sepal.Width Petal.Length Petal.Width
          <dbl>       <dbl>        <dbl>       <dbl>
 1          5.1         3.5          1.4         0.2
 2          4.9         3            1.4         0.2
 3          4.7         3.2          1.3         0.2
 4          4.6         3.1          1.5         0.2
 5          5           3.6          1.4         0.2
 6          5.4         3.9          1.7         0.4
 7          4.6         3.4          1.4         0.3
 8          5           3.4          1.5         0.2
 9          4.4         2.9          1.4         0.2
10          4.9         3.1          1.5         0.1
# … with 40 more rows

Loops and functional programming with purrr::map()

for loops in R

for (x in seq(5)) {
  print(sqrt(x))
}

[1] 1
[1] 1.414214
[1] 1.732051
[1] 2
[1] 2.236068

Let’s apply a function to elements of vec and keep the results.

vec <- seq(1,10,3)
result <- 0
for(i in 1:length(vec)){
  result[i] <- sqrt(vec[i])
}

result

[1] 1.000000 2.000000 2.645751 3.162278

`apply` family of functions

Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix

sapply(vec, sqrt)

[1] 1.000000 2.000000 2.645751 3.162278

`purrr` is tidyverse equivalent of apply family functions

map() from {purrr} package is apply equivalent. Both apply() and map() are useful for applying functions in R, while map() is particularly useful for iterating over individual elements of a data structure, such as a vector or list and map() is pipe-friendly and integral to tidyverse suite of packages.

map always returns list. but it has type stable versions, map_dbl, map_chr, etc.

# library(purrr)   # no need to load purrr separately
                   # it's part of tidyverse
map(vec,sqrt)

[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 2.645751

[[4]]
[1] 3.162278

map_dbl(vec,sqrt)

[1] 1.000000 2.000000 2.645751 3.162278

When we write

result = map(some_numbers, some_function)

we mean:

result = vector("list", length(some_numbers))

for(i in seq_along(some_numbers)){
  result[[i]] = some_function(some_numbers[[i]])
}

print(result)

`map()` can process, lists, columns or cells

map() can process list elements

test <- list(a=c(1,2,3), 
             b=10, 
             c=seq(1:10))
test

$a
[1] 1 2 3

$b
[1] 10

$c
 [1]  1  2  3  4  5  6  7  8  9 10

map(test, sum)

$a
[1] 6

$b
[1] 10

$c
[1] 55

if you use map() with a dataframe, the function will be applied to each column (try running dput(mtcars), which might explain why each column)

mtcars |> map_dbl(sum)

     mpg      cyl     disp       hp     drat       wt     qsec       vs 
 642.900  198.000 7383.100 4694.000  115.090  102.952  571.160   14.000 
      am     gear     carb 
  13.000  118.000   90.000

Finally, we can run a function per each cell in a column using the format below:

as_tibble(mtcars) |>    # mtcars is dataframe, this is for cosmetics
  mutate(squareroot= map(mpg, sqrt) ) |>
  head()

# A tibble: 6 × 12
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb squareroot
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list>    
1  21       6   160   110  3.9   2.62  16.5     0     1     4     4 <dbl [1]> 
2  21       6   160   110  3.9   2.88  17.0     0     1     4     4 <dbl [1]> 
3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1 <dbl [1]> 
4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1 <dbl [1]> 
5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2 <dbl [1]> 
6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1 <dbl [1]>

as_tibble(mtcars) |>
  mutate(squareroot= map_dbl(mpg, sqrt) ) |>
  head()

# A tibble: 6 × 12
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb squareroot
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl>
1  21       6   160   110  3.9   2.62  16.5     0     1     4     4       4.58
2  21       6   160   110  3.9   2.88  17.0     0     1     4     4       4.58
3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1       4.77
4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1       4.63
5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2       4.32
6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1       4.25

applying function to nested data

n_iris %>% 
  mutate(rows = map(data, nrow))

# A tibble: 3 × 3
# Groups:   Species [3]
  Species    data              rows     
  <fct>      <list>            <list>   
1 setosa     <tibble [50 × 4]> <int [1]>
2 versicolor <tibble [50 × 4]> <int [1]>
3 virginica  <tibble [50 × 4]> <int [1]>

n_iris %>% 
  mutate(rows = map_dbl(data, nrow))

# A tibble: 3 × 3
# Groups:   Species [3]
  Species    data               rows
  <fct>      <list>            <dbl>
1 setosa     <tibble [50 × 4]>    50
2 versicolor <tibble [50 × 4]>    50
3 virginica  <tibble [50 × 4]>    50

purrr and lm()

lm() is used for linear modelling.

The formula below models wt (weight) based on disp (displacement).

lm(wt ~ disp, data=mtcars)


Call:
lm(formula = wt ~ disp, data = mtcars)

Coefficients:
(Intercept)         disp  
    1.59981      0.00701

Stats about a model can be extracted via summary()

fit <- lm(wt ~ disp, data=mtcars)
summary(fit)


Call:
lm(formula = wt ~ disp, data = mtcars)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.89044 -0.29775 -0.00684  0.33428  0.66525 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.5998146  0.1729964   9.248 2.74e-10 ***
disp        0.0070103  0.0006629  10.576 1.22e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4574 on 30 degrees of freedom
Multiple R-squared:  0.7885,    Adjusted R-squared:  0.7815 
F-statistic: 111.8 on 1 and 30 DF,  p-value: 1.222e-11

This is not tidy! The individual elements can be extracted as shown below:

summary(fit)$r.squared

[1] 0.7885083

`broom` package for tidying `lm()`

Instead of summary() function, tidy(), glance() and augment() functions from broom package can be used to get stats in “tidy” format.

library(broom)
tidy(fit)

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  1.60     0.173         9.25 2.74e-10
2 disp         0.00701  0.000663     10.6  1.22e-11

glance(fit)

# A tibble: 1 × 12
  r.squared adj.r.squa…¹ sigma stati…²  p.value    df logLik   AIC   BIC devia…³
      <dbl>        <dbl> <dbl>   <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
1     0.789        0.781 0.457    112. 1.22e-11     1  -19.3  44.7  49.1    6.28
# … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
#   variable names ¹adj.r.squared, ²statistic, ³deviance

augment(fit)

# A tibble: 32 × 9
   .rownames            wt  disp .fitted  .resid   .hat .sigma  .cooksd .std.r…¹
   <chr>             <dbl> <dbl>   <dbl>   <dbl>  <dbl>  <dbl>    <dbl>    <dbl>
 1 Mazda RX4          2.62  160     2.72 -0.101  0.0418  0.465 0.00112   -0.227 
 2 Mazda RX4 Wag      2.88  160     2.72  0.154  0.0418  0.464 0.00256    0.343 
 3 Datsun 710         2.32  108     2.36 -0.0369 0.0629  0.465 0.000233  -0.0834
 4 Hornet 4 Drive     3.22  258     3.41 -0.193  0.0328  0.464 0.00314   -0.430 
 5 Hornet Sportabout  3.44  360     4.12 -0.684  0.0663  0.446 0.0850    -1.55  
 6 Valiant            3.46  225     3.18  0.283  0.0313  0.462 0.00638    0.628 
 7 Duster 360         3.57  360     4.12 -0.554  0.0663  0.453 0.0557    -1.25  
 8 Merc 240D          3.19  147.    2.63  0.562  0.0461  0.453 0.0382     1.26  
 9 Merc 230           3.15  141.    2.59  0.563  0.0482  0.453 0.0403     1.26  
10 Merc 280           3.44  168.    2.77  0.665  0.0396  0.448 0.0454     1.48  
# … with 22 more rows, and abbreviated variable name ¹.std.resid

The following code will group and nest the data according to cyl (cylinder) column. Then 3 separate nested data will be modeled with lm and “r.squared” value will be extracted from model results and will be sorted accordingly.

mtcars |>
  group_by(cyl) |> 
  nest() |>
  mutate(model= map(data, ~ lm(wt ~ disp, data=.x))) |> 
  mutate(rsq = map(model, glance)) |>
  unnest(rsq) |>
  arrange(desc(r.squared))

# A tibble: 3 × 15
# Groups:   cyl [3]
    cyl data     model  r.squ…¹ adj.r…² sigma stati…³ p.value    df logLik   AIC
  <dbl> <list>   <list>   <dbl>   <dbl> <dbl>   <dbl>   <dbl> <dbl>  <dbl> <dbl>
1     4 <tibble> <lm>     0.734  0.704  0.310   24.8  7.61e-4     1  -1.62  9.23
2     8 <tibble> <lm>     0.570  0.535  0.518   15.9  1.79e-3     1  -9.58 25.2 
3     6 <tibble> <lm>     0.223  0.0679 0.344    1.44 2.84e-1     1  -1.29  8.57
# … with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
#   nobs <int>, and abbreviated variable names ¹r.squared, ²adj.r.squared,
#   ³statistic

Many models

please refer to chapter 25 of R for Data Science book and Hadley Wickham’s talk Many Models

In the talk, Hadley Wickham mentions that “plots does not scale, models do”. He means that although plots are visually great to process going through lots of images is not sustainable. But if develop a simple model and extract a measure from it, hundreds or thousands of cases/groups can be analyzed.

Let’s take the case of lifeExp in gapminder data. If we aim to detect any discrepencies in the data, we can plot them.

gapminder |>
  ggplot(aes(year, lifeExp, group = country)) +
  geom_line(alpha = 1/3)

Such a plot will be difficult to extract important information.

Instead, let’s do lm() modeling of life expextancy with year and then see which countries has worst model (meaning that their plot is not a straight line)

For more readability, let’s define a function for modelling and use it in map()

country_model <- function(df) {
      lm(lifeExp ~ year, data = df)
    }

Let’s nest the data per country

by_country <- gapminder |>
  group_by(country, continent) |>
  nest()

by_country |> head()

# A tibble: 6 × 3
# Groups:   country, continent [6]
  country     continent data             
  <fct>       <fct>     <list>           
1 Afghanistan Asia      <tibble [12 × 4]>
2 Albania     Europe    <tibble [12 × 4]>
3 Algeria     Africa    <tibble [12 × 4]>
4 Angola      Africa    <tibble [12 × 4]>
5 Argentina   Americas  <tibble [12 × 4]>
6 Australia   Oceania   <tibble [12 × 4]>

We’re ready for modeling and extracting model performance for each country. All this is taking place in a tibble/data frame without any loops or intermediate steps.

 by_country |>
  mutate(model = map(data, country_model)) |>
  mutate(glance = map(model, broom::glance)) |>
  unnest(glance) |>
  arrange(r.squared)

# A tibble: 142 × 16
# Groups:   country, continent [142]
   country   conti…¹ data     model r.squ…² adj.r.…³ sigma stati…⁴ p.value    df
   <fct>     <fct>   <list>   <lis>   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <dbl>
 1 Rwanda    Africa  <tibble> <lm>   0.0172 -0.0811   6.56   0.175  0.685      1
 2 Botswana  Africa  <tibble> <lm>   0.0340 -0.0626   6.11   0.352  0.566      1
 3 Zimbabwe  Africa  <tibble> <lm>   0.0562 -0.0381   7.21   0.596  0.458      1
 4 Zambia    Africa  <tibble> <lm>   0.0598 -0.0342   4.53   0.636  0.444      1
 5 Swaziland Africa  <tibble> <lm>   0.0682 -0.0250   6.64   0.732  0.412      1
 6 Lesotho   Africa  <tibble> <lm>   0.0849 -0.00666  5.93   0.927  0.358      1
 7 Cote d'I… Africa  <tibble> <lm>   0.283   0.212    3.93   3.95   0.0748     1
 8 South Af… Africa  <tibble> <lm>   0.312   0.244    4.74   4.54   0.0588     1
 9 Uganda    Africa  <tibble> <lm>   0.342   0.276    3.19   5.20   0.0457     1
10 Congo, D… Africa  <tibble> <lm>   0.348   0.283    2.43   5.34   0.0434     1
# … with 132 more rows, 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
#   deviance <dbl>, df.residual <int>, nobs <int>, and abbreviated variable
#   names ¹continent, ²r.squared, ³adj.r.squared, ⁴statistic

What happened in Rwanda?

gapminder |> 
  filter(country=="Rwanda") |> 
  ggplot(aes(year,lifeExp,group=country)) + 
  geom_line()

`map` in parallell

If the function in mutate(model = map(data, country_model)) was slow, we had to wait for each country’s model to be modeled sequentially. By installing furrr package and using future_map instead of map, modeling will be performed in parallel.

The updated code will look like this

library(furrr)
 by_country |>
  mutate(model = future_map(data, country_model)) |>
  mutate(glance = map(model, broom::glance)) |>
  unnest(glance) |>
  arrange(r.squared)