Using nest, unnest, map, and tidy functions to model and compare nested data

Nest and unnest - creating lists within dataframes and tidy data for modelling

This is an introduction to the nest and unnest functions found in the ‘tidyr’ package which is included in the tidyverse.

When you nest a data frame you create a column that contains a list of data frames. Nesting works as a summarizing function since you get one row for each group defined by the non-nested columns.

You can create nested data frames using tidyr::nest() or df %>% nest(x, y) specifies the columns to be nested.

When used in conjunction with the ‘purr’ and ‘broom’ packages you can apply operations to your lists of dataframes.

Loading the libraries and the dataset

For this example I will be using several tidyverse packages including tidyr, magrittr, broom, and purrr and we will load these first.

library(tidyverse)
library(magrittr)
library(broom)
library(purrr)

Next we load the csv file format data that will be used for our examples. This data was obtained from Kaggle and is originally sourced from The World Happiness Report published by the Sustainable Development Solutions Network. This version of the data can be found here: https://www.kaggle.com/ajaypalsinghlo/world-happiness-report-2021?select=world-happiness-report.csv

df <- as.data.frame(read.delim("https://raw.githubusercontent.com/cassandra-coste/CUNY607/main/data/world-happiness-report.csv", header = TRUE, stringsAsFactors = FALSE, sep = ",", fileEncoding = "UTF-8-BOM"))

Setting up the model and demonstraing the tidy model format

First we start with an example with a single country, in this case Afghanistan, to demonstrate what we will ultimately want to do to all countries in the dataset. We will do this by filtering the dataset for the country name Afghanistan. Then we will run a simple linear model with the outcome variable life expectancy and the predictor variable of year. Finally, we can use the tidy function from the broom package to view the linear regression information in a tidy model format.

Afghanistan_by_year <- df %>% filter(Country.name == "Afghanistan")

Afghanistan_lm <- lm(Healthy.life.expectancy.at.birth ~ year , Afghanistan_by_year)

tidy(Afghanistan_lm)

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept) -280.      79.8        -3.51 0.00562
## 2 year           0.165    0.0396      4.17 0.00193

Created nested dataframes using nest function

To prepare to run our analysis on all countries, we can create nested dataframes. The below code will indicated that we want to nest all columns besides the country name column into a column named data. So for each country, there will be a dataframe containing the other 10 variables for that country.

First, we use the map function to identify NAs and see that for our outcome variable of interest, healthy life expectancy, we have 55 na values, which we will drop for the purpose of allowing our model to run.

Then we code to nest all variables except for the country name leaving us with the aforementioned data column to run our linear model on.

#identify na values that may be an issue 

map(df, ~sum(is.na(.)))

## $Country.name
## [1] 0
## 
## $year
## [1] 0
## 
## $Life.Ladder
## [1] 0
## 
## $Log.GDP.per.capita
## [1] 36
## 
## $Social.support
## [1] 13
## 
## $Healthy.life.expectancy.at.birth
## [1] 55
## 
## $Freedom.to.make.life.choices
## [1] 32
## 
## $Generosity
## [1] 89
## 
## $Perceptions.of.corruption
## [1] 110
## 
## $Positive.affect
## [1] 22
## 
## $Negative.affect
## [1] 16

#drop na values in outcome variable column 

by_country <- df %>% drop_na(Healthy.life.expectancy.at.birth)

#nest the dataframe by country 

by_country %<>%
  nest(data = !Country.name)

Run models on nested dataframes using map function

Now that we have nested dataframes for each country, we can use the purrr package and the map function to run the linear regression for each country.

Map in general allows for you to apply an operation to each item in a list.

If you had a list a <- list(1, 2, 3, 4)

And used map to apply a operation of multiply by 2 using

map(a, ~ . * 2) it would return each item in the list “a” multiplied by 2 and return 2, 4, 6, 8, as seen below.

a <- list(1, 2, 3, 4)

map(a, ~ . * 2)

## [[1]]
## [1] 2
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 6
## 
## [[4]]
## [1] 8

Returning to the country’s life expectancy example, we can use the map function to run simple linear regressions for each country and store it in a new column named model.

# Use map to run the linear regression model for each country in the dataframe using the nested dataframes 

by_country_model <- by_country %>% mutate(model = map(data, ~ lm(Healthy.life.expectancy.at.birth ~ year, data = .x)))

Tidy models using tidy and unnest functions

To take this one step farther we can tidy our model column which contains lists. We use map again to achieve this and the tidy function to turn those lists into nested dataframes in a new column called tidied and finally use unnest on our tidied column so that we now can easily see the coefficients for the model run for each country

# Here we run the same models as earlier but tidy and unnest the results 

by_country_model <- by_country %>%
  mutate(model = map(data, ~ lm(Healthy.life.expectancy.at.birth ~ year, data = .)), tidied = map(model, tidy))%>% unnest(tidied)

# View our tidied model results 

head(by_country_model)

## # A tibble: 6 x 8
##   Country.name data         model term     estimate std.error statistic  p.value
##   <chr>        <list>       <lis> <chr>       <dbl>     <dbl>     <dbl>    <dbl>
## 1 Afghanistan  <tibble [12~ <lm>  (Interc~ -280.     79.8         -3.51 5.62e- 3
## 2 Afghanistan  <tibble [12~ <lm>  year        0.165   0.0396       4.17 1.93e- 3
## 3 Albania      <tibble [13~ <lm>  (Interc~ -490.     10.1        -48.3  3.68e-14
## 4 Albania      <tibble [13~ <lm>  year        0.277   0.00504     55.0  8.93e-15
## 5 Algeria      <tibble [8 ~ <lm>  (Interc~ -291.      6.63       -43.9  9.31e- 9
## 6 Algeria      <tibble [8 ~ <lm>  year        0.177   0.00329     53.8  2.77e- 9

Cassandra Coste TidyVerse

Cassandra Coste

4/11/2021