In this sample vignette, I will walk through some of the functionalities of the the purrr Tidyverse package using an agriculture dataset from Kaggle.
According the Tidyverse site, purrr “enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors.” It replaces a conventional “for loop” with a collection of functions centered around the map function to introduce an easy-to-read, compact syntax, that is powerful and performant.
In this tutorial, I will go over the following functions - map - map_* family - map2 - pmap
Before we get started, lets read in our csv file and prep the farm_data dataframe.
#setwd("")
farm_data <- read.csv('https://raw.githubusercontent.com/mcastro64/d607-assignments/refs/heads/master/assignment-07/agriculture_dataset.csv') %>%
rename_all(~str_to_lower(.)) %>% # Convert all column names to lowercase
rename_all(~str_replace_all(., "\\.", "_")) %>% # Replace spaces with underscore
rename_all(~str_replace_all(., "_$", "")) # remove trailing _
glimpse(farm_data)
## Rows: 50
## Columns: 10
## $ farm_id <chr> "F001", "F002", "F003", "F004", "F005", "F006…
## $ crop_type <chr> "Cotton", "Carrot", "Sugarcane", "Tomato", "T…
## $ farm_area_acres <dbl> 329.40, 18.67, 306.03, 380.21, 135.56, 12.50,…
## $ irrigation_type <chr> "Sprinkler", "Manual", "Flood", "Rain-fed", "…
## $ fertilizer_used_tons <dbl> 8.14, 4.77, 2.91, 3.32, 8.33, 6.42, 1.83, 5.1…
## $ pesticide_used_kg <dbl> 2.21, 4.36, 0.56, 4.35, 4.48, 2.25, 2.37, 0.9…
## $ yield_tons <dbl> 14.44, 42.91, 33.44, 34.08, 43.28, 38.18, 44.…
## $ soil_type <chr> "Loamy", "Peaty", "Silty", "Silty", "Clay", "…
## $ season <chr> "Kharif", "Kharif", "Kharif", "Zaid", "Zaid",…
## $ water_usage_cubic_meters <dbl> 76648.20, 68725.54, 75538.56, 45401.23, 93718…
The map function makes a list by passing a list or an atomic verctor and a function as parameters in the format map(.x, .f), where .x is our list/vector and .f is our function.
In example 1, we want to use Prrr to calculate the mean for the numerical columns of our sample dataframe. We first pass the dataframe to the dplyr function “select”, then use “where” to select the correct columns using is.numeric to return only the columns with numeric values. We can then pass this list into the map function using the .x parameter and pass function “mean” using the .f parameter. Example 2 shows an abbreviated version of the same calculation.
Example 1
ex1_list <- farm_data |>
select(where(is.numeric))
map(.x = ex1_list, .f = mean)
## $farm_area_acres
## [1] 254.9638
##
## $fertilizer_used_tons
## [1] 4.9054
##
## $pesticide_used_kg
## [1] 2.398
##
## $yield_tons
## [1] 27.0592
##
## $water_usage_cubic_meters
## [1] 56724.3
Example 2
ex2_list <- farm_data |>
select(where(is.numeric)) |>
map(mean)
ex2_list
## $farm_area_acres
## [1] 254.9638
##
## $fertilizer_used_tons
## [1] 4.9054
##
## $pesticide_used_kg
## [1] 2.398
##
## $yield_tons
## [1] 27.0592
##
## $water_usage_cubic_meters
## [1] 56724.3
In addition to the map function that returns a list, Purrr has dedicated functions that return specific vector types. Each functions named with the “map_” prefix followed by an abbreviation of the vector type, as seen in Example 3.
Example 3
farm_area_miles <- map_dbl(farm_data$farm_area_acres, function(.x) {
return(.x/640)
})
farm_area_miles
## [1] 0.51468750 0.02917188 0.47817187 0.59407813 0.21181250 0.01953125
## [7] 0.56259375 0.72593750 0.60839062 0.28807813 0.43742187 0.22706250
## [13] 0.51421875 0.38440625 0.47679687 0.09409375 0.44376562 0.20035937
## [19] 0.72020313 0.09195312 0.58914062 0.14479687 0.02448437 0.75606250
## [25] 0.11818750 0.25356250 0.58609375 0.40029687 0.45081250 0.44768750
## [31] 0.21275000 0.54753125 0.69806250 0.41268750 0.41567187 0.69712500
## [37] 0.24390625 0.67378125 0.34450000 0.26065625 0.57935938 0.65467188
## [43] 0.12310938 0.13143750 0.51045312 0.17625000 0.54321875 0.12092187
## [49] 0.72245313 0.45664063
Purr can also return a dataframe using the map_df function. Example 4 modifies the map_dbl example to return a dataframe of farm area in acres and miles. This example also uses an inline funciton notation whose variable .x represents the rows of the data variable (“farm_data$farm_area_acres”) that was passed into the map_df functions. The function returns a data frame.
Example 4
map_area_df <- map_df(farm_data$farm_area_acres, function(.x) {
return(data.frame(acres = .x, miles = .x/640))
})
map_area_df
## acres miles
## 1 329.40 0.51468750
## 2 18.67 0.02917188
## 3 306.03 0.47817187
## 4 380.21 0.59407813
## 5 135.56 0.21181250
## 6 12.50 0.01953125
## 7 360.06 0.56259375
## 8 464.60 0.72593750
## 9 389.37 0.60839062
## 10 184.37 0.28807813
## 11 279.95 0.43742187
## 12 145.32 0.22706250
## 13 329.10 0.51421875
## 14 246.02 0.38440625
## 15 305.15 0.47679687
## 16 60.22 0.09409375
## 17 284.01 0.44376562
## 18 128.23 0.20035937
## 19 460.93 0.72020313
## 20 58.85 0.09195312
## 21 377.05 0.58914062
## 22 92.67 0.14479687
## 23 15.67 0.02448437
## 24 483.88 0.75606250
## 25 75.64 0.11818750
## 26 162.28 0.25356250
## 27 375.10 0.58609375
## 28 256.19 0.40029687
## 29 288.52 0.45081250
## 30 286.52 0.44768750
## 31 136.16 0.21275000
## 32 350.42 0.54753125
## 33 446.76 0.69806250
## 34 264.12 0.41268750
## 35 266.03 0.41567187
## 36 446.16 0.69712500
## 37 156.10 0.24390625
## 38 431.22 0.67378125
## 39 220.48 0.34450000
## 40 166.82 0.26065625
## 41 370.79 0.57935938
## 42 418.99 0.65467188
## 43 78.79 0.12310938
## 44 84.12 0.13143750
## 45 326.69 0.51045312
## 46 112.80 0.17625000
## 47 347.66 0.54321875
## 48 77.39 0.12092187
## 49 462.37 0.72245313
## 50 292.25 0.45664063
Purrr also has two functions for handling more than 1 list at a time: - map2- applies a function to two elements - pmap- applies a function to a list of elements
** Working with map2**
Example 4 shows how to work with map2. We pass two lists into variables .x and .y. These variables are then passed into the function when initializing it and manipulated within its body. The function returns a list.
Example 4
# map2 example
map2_ex <- map2(.x = farm_data$yield_tons, .y = farm_data$farm_area_acres, function(.x, .y) {
return(.x /.y)
})
# print type and first 3 rows
class(map2_ex)
## [1] "list"
print(map2_ex[1:3])
## [[1]]
## [1] 0.04383728
##
## [[2]]
## [1] 2.29834
##
## [[3]]
## [1] 0.1092703
Working with pmap
pmap allows you to loop through more than two by passing in a list of those elements. Rather then use .x or .y, you need to specify the exact names of list elements in your function call. Example 5 generates randomized sample data with pnorm using pmap. The first half of the code chunk is used to create a list for means (means), standard deviations (sds), and sample size (samplesize) from the farm_data dataframe. We then pass these variables into our function when calling pmap.
Example 5
# construct list for this example
calcs_list <- farm_data |>
group_by(crop_type) |>
mutate(
means = mean(yield_tons),
sds = sd(yield_tons),
samplesize = 50
) |>
subset(select=c(means, sds, samplesize))
# pmap_example
set.seed(1234)
pmap_ex <- pmap(calcs_list, function(means, sds, samplesize) {
data.frame(sample = rnorm(n=samplesize, mean=means, sd=sds))
})
ggplot(pmap_ex[[1]], aes(x=sample))+
geom_histogram(bins = 50)
ggplot(pmap_ex[[2]], aes(x=sample))+
geom_histogram(bins = 50)
### Nesting and combining map functions
The Purrr functions can also be used in combination with other Tidyverse packages and functions. Example 6 uses the map function within mutate to add a linear model using the lm function to every “crop_type” in our farm data. Before we can pass the data to map, the example also uses nest to convert the grouped dataframe into a list where the group is in the first column and the remaining columns from our original dataframe are “nested” into the second column of our list. The returned object adds the result of lms as a third column, which can be further cleaned up and unnest’ed for other applications.
Example 6
farm_data_lms <- farm_data |>
group_by(crop_type) |>
nest() |>
mutate(lms = map(data, ~lm(yield_tons ~ fertilizer_used_tons + pesticide_used_kg, data = .)))
farm_data_lms[[3]][1]
## [[1]]
##
## Call:
## lm(formula = yield_tons ~ fertilizer_used_tons + pesticide_used_kg,
## data = .)
##
## Coefficients:
## (Intercept) fertilizer_used_tons pesticide_used_kg
## 51.305 -2.207 -7.480
The Purrr dataset is a great alternative to using for loops. One advantage is more human-friendly syntax that compresses multiple lines of code into an easier to ready format.
This tutorial only covers some of the most common functions of the Purrr package, but the package includes many more promising tools to make your R programming more efficient, including walk, modify and other map variants like imap and map_if. For more information, visit the Purrr documentation or review the Posit Purrr Cheatsheet.