Challenge 10 Instructions

challenge_10

purrr

Author

Sean Conway

Published

January 10, 2024

Challenge Overview

The purrr package is a powerful tool for functional programming. It allows the user to apply a single function across multiple objects. It can replace for loops with a more readable (and often faster) simple function call.

For example, we can draw n random samples from 10 different distributions using a vector of 10 means.

n <- 100 # sample size
m <- seq(1,10) # means 
samps <- map(m,rnorm,n=n)

We can then use map_dbl to verify that this worked correctly by computing the mean for each sample.

samps %>%
  map_dbl(mean)

 [1] 1.068130 2.015383 2.992601 3.767358 4.921564 5.856775 7.055993 8.128033
 [9] 9.044389 9.994806

purrr is tricky to learn (but beyond useful once you get a handle on it). Therefore, it’s imperative that you complete the purr and map readings before attempting this challenge.

The challenge

Use purrr with a function to perform some data science task. What this task is is up to you. It could involve computing summary statistics, reading in multiple datasets, running a random process multiple times, or anything else you might need to do in your work as a data analyst. You might consider using purrr with a function you wrote for challenge 9.

Solutions

Reading the Data

The working directory for RStudio has been set such that “eggs_tidy.csv” can be found at the root of the working directory using the setwd() method.

eggs <- read_csv(here("eggs_tidy.csv"))
eggs

# A tibble: 120 × 6
   month      year large_half_dozen large_dozen extra_large_half_dozen
   <chr>     <dbl>            <dbl>       <dbl>                  <dbl>
 1 January    2004             126         230                    132 
 2 February   2004             128.        226.                   134.
 3 March      2004             131         225                    137 
 4 April      2004             131         225                    137 
 5 May        2004             131         225                    137 
 6 June       2004             134.        231.                   137 
 7 July       2004             134.        234.                   137 
 8 August     2004             134.        234.                   137 
 9 September  2004             130.        234.                   136.
10 October    2004             128.        234.                   136.
# ℹ 110 more rows
# ℹ 1 more variable: extra_large_dozen <dbl>

Data Description

High Level Description

The data set comprises of 120 rows with 6 columns.

eggs

# A tibble: 120 × 6
   month      year large_half_dozen large_dozen extra_large_half_dozen
   <chr>     <dbl>            <dbl>       <dbl>                  <dbl>
 1 January    2004             126         230                    132 
 2 February   2004             128.        226.                   134.
 3 March      2004             131         225                    137 
 4 April      2004             131         225                    137 
 5 May        2004             131         225                    137 
 6 June       2004             134.        231.                   137 
 7 July       2004             134.        234.                   137 
 8 August     2004             134.        234.                   137 
 9 September  2004             130.        234.                   136.
10 October    2004             128.        234.                   136.
# ℹ 110 more rows
# ℹ 1 more variable: extra_large_dozen <dbl>

The data set has a total of 1 <chr> type column and the remaining columns are of the <dbl> type. The month and year variables represent the month and year of observation respectively. large_half_dozen, large_dozen, extra_large_half_dozen and extra_large_dozen are variables that represent the type of eggs. Each case represents the count for each type of egg collected for that month and year.

How was the Data likely collected?

The dataset seems to provide a count of the total number of eggs for each of the 4 types collected for a month and year combination. The dataset is pre-cleaned since no NA values are seen. The data is likely to have been collected using official/unofficial sources providing egg count for a poultry facility.

Tidying the Data

The dataset needs a date variable and also needs to be pivoted to a long and narrow form for ease of analysis. The following query achieves this and stores the dataframe as eggs_tidy.

eggs_tidy <- eggs %>%
  pivot_longer(cols=3:6,
               values_to = "price") %>%
  mutate(name=str_replace(name,"extra_large","extra large"),
         name=str_replace(name,"half_dozen","half dozen")) %>%
  separate(name,into=c("size","amount"),sep="_") %>%
  mutate(date = str_c(month, year, sep=" "),
         date = my(date))
eggs_tidy

# A tibble: 480 × 6
   month     year size        amount     price date      
   <chr>    <dbl> <chr>       <chr>      <dbl> <date>    
 1 January   2004 large       half dozen  126  2004-01-01
 2 January   2004 large       dozen       230  2004-01-01
 3 January   2004 extra large half dozen  132  2004-01-01
 4 January   2004 extra large dozen       230  2004-01-01
 5 February  2004 large       half dozen  128. 2004-02-01
 6 February  2004 large       dozen       226. 2004-02-01
 7 February  2004 extra large half dozen  134. 2004-02-01
 8 February  2004 extra large dozen       230  2004-02-01
 9 March     2004 large       half dozen  131  2004-03-01
10 March     2004 large       dozen       225  2004-03-01
# ℹ 470 more rows

Creating a Function

The following function is created to compute summary statistics (mean, median and standard deviation) given the dataframe and egg_size as parameters. The returned object is the stat dataframe grouped for the particular egg_size.

compute_stats <- function(egg_size, dataframe){
  stats <- dataframe %>%
    filter(size == egg_size) %>%
    group_by(size, amount) %>%
    summarize(mean_price = mean(price),
              median_price = median(price),
              sd_price = sd(price))
  return(stats)
}

Making use of purrr

The map_dfr function from the purrr package can be used to apply the compute_stats function created over all egg_sizes instead of applying the function repeatedly. It row-binds the dataframes returned by the compute_stats function and returns a single dataframe.

egg_sizes <- c("large", "extra large")
map_dfr(egg_sizes, compute_stats, dataframe=eggs_tidy)

# A tibble: 4 × 5
# Groups:   size [2]
  size        amount     mean_price median_price sd_price
  <chr>       <chr>           <dbl>        <dbl>    <dbl>
1 large       dozen            254.         268.     18.5
2 large       half dozen       155.         174.     22.6
3 extra large dozen            267.         286.     22.8
4 extra large half dozen       164.         186.     24.7