This is an introduction to the nest and unnest functions found in the ‘tidyr’ package which is included in the tidyverse.
When you nest a data frame you create a column that contains a list of data frames. Nesting works as a summarizing function since you get one row for each group defined by the non-nested columns.
You can create nested data frames using tidyr::nest() or df %>% nest(x, y) specifies the columns to be nested.
When used in conjunction with the ‘purr’ and ‘broom’ packages you can apply operations to your lists of dataframes.
For this example I will be using several tidyverse packages including tidyr, magrittr, broom, and purrr and we will load these first.
library(tidyverse)
library(magrittr)
library(broom)
library(purrr)
Next we load the csv file format data that will be used for our examples. This data was obtained from Kaggle and is originally sourced from The World Happiness Report published by the Sustainable Development Solutions Network. This version of the data can be found here: https://www.kaggle.com/ajaypalsinghlo/world-happiness-report-2021?select=world-happiness-report.csv
df <- as.data.frame(read.delim("https://raw.githubusercontent.com/cassandra-coste/CUNY607/main/data/world-happiness-report.csv", header = TRUE, stringsAsFactors = FALSE, sep = ",", fileEncoding = "UTF-8-BOM"))
First we start with an example with a single country, in this case Afghanistan, to demonstrate what we will ultimately want to do to all countries in the dataset. We will do this by filtering the dataset for the country name Afghanistan. Then we will run a simple linear model with the outcome variable life expectancy and the predictor variable of year. Finally, we can use the tidy function from the broom package to view the linear regression information in a tidy model format.
Afghanistan_by_year <- df %>% filter(Country.name == "Afghanistan")
Afghanistan_lm <- lm(Healthy.life.expectancy.at.birth ~ year , Afghanistan_by_year)
tidy(Afghanistan_lm)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -280. 79.8 -3.51 0.00562
## 2 year 0.165 0.0396 4.17 0.00193
To prepare to run our analysis on all countries, we can create nested dataframes. The below code will indicated that we want to nest all columns besides the country name column into a column named data. So for each country, there will be a dataframe containing the other 10 variables for that country.
First, we use the map function to identify NAs and see that for our outcome variable of interest, healthy life expectancy, we have 55 na values, which we will drop for the purpose of allowing our model to run.
Then we code to nest all variables except for the country name leaving us with the aforementioned data column to run our linear model on.
#identify na values that may be an issue
map(df, ~sum(is.na(.)))
## $Country.name
## [1] 0
##
## $year
## [1] 0
##
## $Life.Ladder
## [1] 0
##
## $Log.GDP.per.capita
## [1] 36
##
## $Social.support
## [1] 13
##
## $Healthy.life.expectancy.at.birth
## [1] 55
##
## $Freedom.to.make.life.choices
## [1] 32
##
## $Generosity
## [1] 89
##
## $Perceptions.of.corruption
## [1] 110
##
## $Positive.affect
## [1] 22
##
## $Negative.affect
## [1] 16
#drop na values in outcome variable column
by_country <- df %>% drop_na(Healthy.life.expectancy.at.birth)
#nest the dataframe by country
by_country %<>%
nest(data = !Country.name)
Now that we have nested dataframes for each country, we can use the purrr package and the map function to run the linear regression for each country.
Map in general allows for you to apply an operation to each item in a list.
If you had a list a <- list(1, 2, 3, 4)
And used map to apply a operation of multiply by 2 using
map(a, ~ . * 2) it would return each item in the list “a” multiplied by 2 and return 2, 4, 6, 8, as seen below.
a <- list(1, 2, 3, 4)
map(a, ~ . * 2)
## [[1]]
## [1] 2
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 6
##
## [[4]]
## [1] 8
Returning to the country’s life expectancy example, we can use the map function to run simple linear regressions for each country and store it in a new column named model.
# Use map to run the linear regression model for each country in the dataframe using the nested dataframes
by_country_model <- by_country %>% mutate(model = map(data, ~ lm(Healthy.life.expectancy.at.birth ~ year, data = .x)))
To take this one step farther we can tidy our model column which contains lists. We use map again to achieve this and the tidy function to turn those lists into nested dataframes in a new column called tidied and finally use unnest on our tidied column so that we now can easily see the coefficients for the model run for each country
# Here we run the same models as earlier but tidy and unnest the results
by_country_model <- by_country %>%
mutate(model = map(data, ~ lm(Healthy.life.expectancy.at.birth ~ year, data = .)), tidied = map(model, tidy))%>% unnest(tidied)
# View our tidied model results
head(by_country_model)
## # A tibble: 6 x 8
## Country.name data model term estimate std.error statistic p.value
## <chr> <list> <lis> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan <tibble [12~ <lm> (Interc~ -280. 79.8 -3.51 5.62e- 3
## 2 Afghanistan <tibble [12~ <lm> year 0.165 0.0396 4.17 1.93e- 3
## 3 Albania <tibble [13~ <lm> (Interc~ -490. 10.1 -48.3 3.68e-14
## 4 Albania <tibble [13~ <lm> year 0.277 0.00504 55.0 8.93e-15
## 5 Algeria <tibble [8 ~ <lm> (Interc~ -291. 6.63 -43.9 9.31e- 9
## 6 Algeria <tibble [8 ~ <lm> year 0.177 0.00329 53.8 2.77e- 9