Train and analyze many models for TidyTuesday crop fields
Lately I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from just getting started to tuning more complex models. Today’s screencast explores how to fluently and tidy data principles to the tasks of building many models using with this week’s TidyTuesday dataset on crop yields.
Here is the coe I used i nthe video, for those who prefer reading instead of or in addition to video.
Our modeling goal is to estimate how crops yields are changing around the world using this week. We can build many models for the country-crop combinations we are interested in.
First, Let’s read in two of the dataset for this week
library(tidyverse)
## -- Attaching packages --------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.0
## v tidyr 1.1.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
theme_set(theme_light())
key_crop_yields <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-01/key_crop_yields.csv")
## Parsed with column specification:
## cols(
## Entity = col_character(),
## Code = col_character(),
## Year = col_double(),
## `Wheat (tonnes per hectare)` = col_double(),
## `Rice (tonnes per hectare)` = col_double(),
## `Maize (tonnes per hectare)` = col_double(),
## `Soybeans (tonnes per hectare)` = col_double(),
## `Potatoes (tonnes per hectare)` = col_double(),
## `Beans (tonnes per hectare)` = col_double(),
## `Peas (tonnes per hectare)` = col_double(),
## `Cassava (tonnes per hectare)` = col_double(),
## `Barley (tonnes per hectare)` = col_double(),
## `Cocoa beans (tonnes per hectare)` = col_double(),
## `Bananas (tonnes per hectare)` = col_double()
## )
land_use <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-01/land_use_vs_yield_change_in_cereal_production.csv")
## Parsed with column specification:
## cols(
## Entity = col_character(),
## Code = col_character(),
## Year = col_character(),
## `Cereal yield index` = col_double(),
## `Change to land area used for cereal production since 1961` = col_double(),
## `Total population (Gapminder)` = col_double()
## )
I’m going to use the land_use dataset only t find the top population countries. Let’s create a vector of their names.
top_countries <- land_use %>%
janitor::clean_names() %>%
filter(!is.na(code), entity !='World') %>%
group_by(entity) %>%
filter(year == max(year)) %>%
ungroup() %>%
slice_max(total_population_gapminder, n = 30) %>%
pull(entity)
Now let’s create a tidy version of the crop yields data, for the countries and crops I am interested in.
tidy_yields <- key_crop_yields %>%
janitor::clean_names() %>%
pivot_longer(wheat_tonnes_per_hectare:bananas_tonnes_per_hectare,
names_to = 'crop', values_to = 'yield') %>%
mutate(crop = str_remove(crop, '_tonnes_per_hectare')) %>%
filter(
crop %in% c('wheat', 'rice', 'maize', 'barley'),
entity %in% top_countries,
!is.na(yield)
)
tidy_yields
## # A tibble: 6,032 x 5
## entity code year crop yield
## <chr> <chr> <dbl> <chr> <dbl>
## 1 Bangladesh BGD 1961 wheat 0.574
## 2 Bangladesh BGD 1961 rice 1.70
## 3 Bangladesh BGD 1961 maize 0.799
## 4 Bangladesh BGD 1961 barley 0.577
## 5 Bangladesh BGD 1962 wheat 0.675
## 6 Bangladesh BGD 1962 rice 1.53
## 7 Bangladesh BGD 1962 maize 0.738
## 8 Bangladesh BGD 1962 barley 0.544
## 9 Bangladesh BGD 1963 wheat 0.607
## 10 Bangladesh BGD 1963 rice 1.77
## # ... with 6,022 more rows
This data structure is just right for plotting crop yield over time!
tidy_yields %>%
ggplot(aes(year, yield, color = crop)) +
geom_line(alpha = 0.7, size = 1.5) +
geom_point() +
facet_wrap(~entity, ncol = 5) +
scale_x_continuous(guide = guide_axis(angle = 90)) +
labs(x= NULL, y =' yield (tons per hectare)')
Notice that not all countries produce all crop, but that overall crop yields are increasing
Now let’s fit a linear model to each country-crop combination
library(tidymodels)
## -- Attaching packages -------------------------------- tidymodels 0.1.1 --
## v broom 0.7.0 v recipes 0.1.13
## v dials 0.0.8 v rsample 0.0.7
## v infer 0.5.3 v tune 0.1.1
## v modeldata 0.0.2 v workflows 0.1.2
## v parsnip 0.1.2 v yardstick 0.0.7
## -- Conflicts ----------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
tidy_lm <- tidy_yields %>%
nest(yields = c(year, yield)) %>%
mutate(model = map(yields, ~lm(yield ~ year, data = .x)))
tidy_lm
## # A tibble: 111 x 5
## entity code crop yields model
## <chr> <chr> <chr> <list> <list>
## 1 Bangladesh BGD wheat <tibble [58 x 2]> <lm>
## 2 Bangladesh BGD rice <tibble [58 x 2]> <lm>
## 3 Bangladesh BGD maize <tibble [58 x 2]> <lm>
## 4 Bangladesh BGD barley <tibble [58 x 2]> <lm>
## 5 Brazil BRA wheat <tibble [58 x 2]> <lm>
## 6 Brazil BRA rice <tibble [58 x 2]> <lm>
## 7 Brazil BRA maize <tibble [58 x 2]> <lm>
## 8 Brazil BRA barley <tibble [58 x 2]> <lm>
## 9 China CHN wheat <tibble [58 x 2]> <lm>
## 10 China CHN rice <tibble [58 x 2]> <lm>
## # ... with 101 more rows
Next let’s tidy() those models to get out the coefficients, and adjust the p-values for multiple comparisons while we’re at it
slopes <- tidy_lm %>%
mutate(coefs = map(model, tidy)) %>%
unnest(coefs) %>%
filter(term == 'year') %>%
mutate(p.value = p.adjust(p.value))
slopes
## # A tibble: 111 x 10
## entity code crop yields model term estimate std.error statistic p.value
## <chr> <chr> <chr> <list> <lis> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Bangla~ BGD wheat <tibbl~ <lm> year 0.0389 0.00253 15.4 5.11e-20
## 2 Bangla~ BGD rice <tibbl~ <lm> year 0.0600 0.00231 26.0 6.05e-31
## 3 Bangla~ BGD maize <tibbl~ <lm> year 0.122 0.0107 11.3 1.82e-14
## 4 Bangla~ BGD barl~ <tibbl~ <lm> year 0.00505 0.000596 8.47 4.34e-10
## 5 Brazil BRA wheat <tibbl~ <lm> year 0.0366 0.00222 16.5 2.55e-21
## 6 Brazil BRA rice <tibbl~ <lm> year 0.0755 0.00490 15.4 4.96e-20
## 7 Brazil BRA maize <tibbl~ <lm> year 0.0709 0.00395 18.0 4.37e-23
## 8 Brazil BRA barl~ <tibbl~ <lm> year 0.0466 0.00319 14.6 5.05e-19
## 9 China CHN wheat <tibbl~ <lm> year 0.0880 0.00141 62.6 1.72e-51
## 10 China CHN rice <tibbl~ <lm> year 0.0843 0.00289 29.2 1.47e-33
## # ... with 101 more rows
Now we can visualize the results of this modeling, which is estimating how crop yields are changing around the world.
library(ggrepel)
slopes %>% ggplot(aes(estimate, p.value, label = entity)) +
geom_vline(
xintercept = 0, lty = 2,
size = 1.5, alpha = 0.7, color = 'gray50'
) +
geom_point(aes(color = crop), alpha = 0.8, size = 2.5, show.legend = FALSE) +
scale_y_log10() +
facet_wrap(~crop) +
geom_text_repel(size = 3, family = 'IBMPlexSans') +
theme_light(base_family = 'IBMPlexSans') +
theme(strip.text = element_text(family = 'IBMPlexSans-Bold', size = 12)) +
labs(x = 'Increase in tons per hectare per year')
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
On the x-axis is the slope of these models. Notice that most countries are on the positive side, with increasing crop yields. The further to the right a country is the larger the increase in crop yield over this time period. Corn yields have increased the most.
On the y-axis is the p-value, a measure of how surprising the effect we see is under the assumption of no relationship (no change with time). Countries lower in the plots have smaller p-values, we are certain those are real relationships.
We can extend this to check out how well these models fit the data with glance(). This approach for using statistical models to estimate changes in many subgroups at once has been so helpful to me in many situations!