Train and analyze many models for TidyTuesday crop fields

Lately I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from just getting started to tuning more complex models. Today’s screencast explores how to fluently and tidy data principles to the tasks of building many models using with this week’s TidyTuesday dataset on crop yields.

Here is the coe I used i nthe video, for those who prefer reading instead of or in addition to video.

Explore the data

Our modeling goal is to estimate how crops yields are changing around the world using this week. We can build many models for the country-crop combinations we are interested in.

First, Let’s read in two of the dataset for this week

library(tidyverse)
## -- Attaching packages --------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.0
## v tidyr   1.1.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
theme_set(theme_light())

key_crop_yields <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-01/key_crop_yields.csv")
## Parsed with column specification:
## cols(
##   Entity = col_character(),
##   Code = col_character(),
##   Year = col_double(),
##   `Wheat (tonnes per hectare)` = col_double(),
##   `Rice (tonnes per hectare)` = col_double(),
##   `Maize (tonnes per hectare)` = col_double(),
##   `Soybeans (tonnes per hectare)` = col_double(),
##   `Potatoes (tonnes per hectare)` = col_double(),
##   `Beans (tonnes per hectare)` = col_double(),
##   `Peas (tonnes per hectare)` = col_double(),
##   `Cassava (tonnes per hectare)` = col_double(),
##   `Barley (tonnes per hectare)` = col_double(),
##   `Cocoa beans (tonnes per hectare)` = col_double(),
##   `Bananas (tonnes per hectare)` = col_double()
## )
land_use <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-01/land_use_vs_yield_change_in_cereal_production.csv")
## Parsed with column specification:
## cols(
##   Entity = col_character(),
##   Code = col_character(),
##   Year = col_character(),
##   `Cereal yield index` = col_double(),
##   `Change to land area used for cereal production since 1961` = col_double(),
##   `Total population (Gapminder)` = col_double()
## )

I’m going to use the land_use dataset only t find the top population countries. Let’s create a vector of their names.

top_countries <- land_use %>% 
               janitor::clean_names() %>% 
               filter(!is.na(code), entity !='World') %>% 
               group_by(entity) %>% 
               filter(year == max(year)) %>% 
               ungroup() %>% 
               slice_max(total_population_gapminder, n = 30) %>% 
               pull(entity)

Now let’s create a tidy version of the crop yields data, for the countries and crops I am interested in.

tidy_yields <- key_crop_yields %>% 
               janitor::clean_names() %>% 
               pivot_longer(wheat_tonnes_per_hectare:bananas_tonnes_per_hectare,
                            names_to = 'crop', values_to = 'yield') %>% 
               mutate(crop = str_remove(crop, '_tonnes_per_hectare')) %>% 
               filter(
                              crop %in% c('wheat', 'rice', 'maize', 'barley'),
                              entity %in% top_countries,
                              !is.na(yield)
               )


tidy_yields
## # A tibble: 6,032 x 5
##    entity     code   year crop   yield
##    <chr>      <chr> <dbl> <chr>  <dbl>
##  1 Bangladesh BGD    1961 wheat  0.574
##  2 Bangladesh BGD    1961 rice   1.70 
##  3 Bangladesh BGD    1961 maize  0.799
##  4 Bangladesh BGD    1961 barley 0.577
##  5 Bangladesh BGD    1962 wheat  0.675
##  6 Bangladesh BGD    1962 rice   1.53 
##  7 Bangladesh BGD    1962 maize  0.738
##  8 Bangladesh BGD    1962 barley 0.544
##  9 Bangladesh BGD    1963 wheat  0.607
## 10 Bangladesh BGD    1963 rice   1.77 
## # ... with 6,022 more rows

This data structure is just right for plotting crop yield over time!

tidy_yields %>% 
               ggplot(aes(year, yield, color =  crop)) + 
               geom_line(alpha = 0.7, size = 1.5) + 
               geom_point() +
               facet_wrap(~entity, ncol = 5) +
               scale_x_continuous(guide = guide_axis(angle = 90)) + 
labs(x= NULL, y =' yield (tons per hectare)')

Notice that not all countries produce all crop, but that overall crop yields are increasing

Many models

Now let’s fit a linear model to each country-crop combination

library(tidymodels)
## -- Attaching packages -------------------------------- tidymodels 0.1.1 --
## v broom     0.7.0      v recipes   0.1.13
## v dials     0.0.8      v rsample   0.0.7 
## v infer     0.5.3      v tune      0.1.1 
## v modeldata 0.0.2      v workflows 0.1.2 
## v parsnip   0.1.2      v yardstick 0.0.7
## -- Conflicts ----------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
tidy_lm <- tidy_yields %>% 
               nest(yields = c(year, yield)) %>% 
               mutate(model = map(yields, ~lm(yield ~ year, data = .x)))
tidy_lm
## # A tibble: 111 x 5
##    entity     code  crop   yields            model 
##    <chr>      <chr> <chr>  <list>            <list>
##  1 Bangladesh BGD   wheat  <tibble [58 x 2]> <lm>  
##  2 Bangladesh BGD   rice   <tibble [58 x 2]> <lm>  
##  3 Bangladesh BGD   maize  <tibble [58 x 2]> <lm>  
##  4 Bangladesh BGD   barley <tibble [58 x 2]> <lm>  
##  5 Brazil     BRA   wheat  <tibble [58 x 2]> <lm>  
##  6 Brazil     BRA   rice   <tibble [58 x 2]> <lm>  
##  7 Brazil     BRA   maize  <tibble [58 x 2]> <lm>  
##  8 Brazil     BRA   barley <tibble [58 x 2]> <lm>  
##  9 China      CHN   wheat  <tibble [58 x 2]> <lm>  
## 10 China      CHN   rice   <tibble [58 x 2]> <lm>  
## # ... with 101 more rows

Next let’s tidy() those models to get out the coefficients, and adjust the p-values for multiple comparisons while we’re at it

slopes <- tidy_lm %>% 
               mutate(coefs = map(model, tidy)) %>% 
               unnest(coefs) %>%
               filter(term == 'year') %>% 
               mutate(p.value = p.adjust(p.value))

slopes
## # A tibble: 111 x 10
##    entity  code  crop  yields  model term  estimate std.error statistic  p.value
##    <chr>   <chr> <chr> <list>  <lis> <chr>    <dbl>     <dbl>     <dbl>    <dbl>
##  1 Bangla~ BGD   wheat <tibbl~ <lm>  year   0.0389   0.00253      15.4  5.11e-20
##  2 Bangla~ BGD   rice  <tibbl~ <lm>  year   0.0600   0.00231      26.0  6.05e-31
##  3 Bangla~ BGD   maize <tibbl~ <lm>  year   0.122    0.0107       11.3  1.82e-14
##  4 Bangla~ BGD   barl~ <tibbl~ <lm>  year   0.00505  0.000596      8.47 4.34e-10
##  5 Brazil  BRA   wheat <tibbl~ <lm>  year   0.0366   0.00222      16.5  2.55e-21
##  6 Brazil  BRA   rice  <tibbl~ <lm>  year   0.0755   0.00490      15.4  4.96e-20
##  7 Brazil  BRA   maize <tibbl~ <lm>  year   0.0709   0.00395      18.0  4.37e-23
##  8 Brazil  BRA   barl~ <tibbl~ <lm>  year   0.0466   0.00319      14.6  5.05e-19
##  9 China   CHN   wheat <tibbl~ <lm>  year   0.0880   0.00141      62.6  1.72e-51
## 10 China   CHN   rice  <tibbl~ <lm>  year   0.0843   0.00289      29.2  1.47e-33
## # ... with 101 more rows

Explore results

Now we can visualize the results of this modeling, which is estimating how crop yields are changing around the world.

library(ggrepel)

slopes %>% ggplot(aes(estimate, p.value, label = entity)) + 
               geom_vline(
                              xintercept = 0, lty = 2,
                              size = 1.5, alpha = 0.7, color = 'gray50'
               ) + 
               geom_point(aes(color = crop), alpha = 0.8, size = 2.5, show.legend = FALSE) +
               scale_y_log10() +
               facet_wrap(~crop) + 
               geom_text_repel(size = 3, family = 'IBMPlexSans') +
               theme_light(base_family = 'IBMPlexSans') +
               theme(strip.text = element_text(family = 'IBMPlexSans-Bold', size = 12)) +
               labs(x = 'Increase in tons per hectare per year')
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database

## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

We can extend this to check out how well these models fit the data with glance(). This approach for using statistical models to estimate changes in many subgroups at once has been so helpful to me in many situations!