n <- 100 # sample size
m <- seq(1,10) # means
samps <- map(m,rnorm,n=n) Challenge 10 Instructions
Challenge Overview
The purrr package is a powerful tool for functional programming. It allows the user to apply a single function across multiple objects. It can replace for loops with a more readable (and often faster) simple function call.
For example, we can draw n random samples from 10 different distributions using a vector of 10 means.
We can then use map_dbl to verify that this worked correctly by computing the mean for each sample.
samps %>%
map_dbl(mean) [1] 1.068130 2.015383 2.992601 3.767358 4.921564 5.856775 7.055993 8.128033
[9] 9.044389 9.994806
purrr is tricky to learn (but beyond useful once you get a handle on it). Therefore, it’s imperative that you complete the purr and map readings before attempting this challenge.
The challenge
Use purrr with a function to perform some data science task. What this task is is up to you. It could involve computing summary statistics, reading in multiple datasets, running a random process multiple times, or anything else you might need to do in your work as a data analyst. You might consider using purrr with a function you wrote for challenge 9.
Solutions
Reading the Data
The working directory for RStudio has been set such that “eggs_tidy.csv” can be found at the root of the working directory using the setwd() method.
eggs <- read_csv(here("eggs_tidy.csv"))
eggs# A tibble: 120 × 6
month year large_half_dozen large_dozen extra_large_half_dozen
<chr> <dbl> <dbl> <dbl> <dbl>
1 January 2004 126 230 132
2 February 2004 128. 226. 134.
3 March 2004 131 225 137
4 April 2004 131 225 137
5 May 2004 131 225 137
6 June 2004 134. 231. 137
7 July 2004 134. 234. 137
8 August 2004 134. 234. 137
9 September 2004 130. 234. 136.
10 October 2004 128. 234. 136.
# ℹ 110 more rows
# ℹ 1 more variable: extra_large_dozen <dbl>
Data Description
High Level Description
The data set comprises of 120 rows with 6 columns.
eggs# A tibble: 120 × 6
month year large_half_dozen large_dozen extra_large_half_dozen
<chr> <dbl> <dbl> <dbl> <dbl>
1 January 2004 126 230 132
2 February 2004 128. 226. 134.
3 March 2004 131 225 137
4 April 2004 131 225 137
5 May 2004 131 225 137
6 June 2004 134. 231. 137
7 July 2004 134. 234. 137
8 August 2004 134. 234. 137
9 September 2004 130. 234. 136.
10 October 2004 128. 234. 136.
# ℹ 110 more rows
# ℹ 1 more variable: extra_large_dozen <dbl>
The data set has a total of 1 <chr> type column and the remaining columns are of the <dbl> type. The month and year variables represent the month and year of observation respectively. large_half_dozen, large_dozen, extra_large_half_dozen and extra_large_dozen are variables that represent the type of eggs. Each case represents the count for each type of egg collected for that month and year.
How was the Data likely collected?
The dataset seems to provide a count of the total number of eggs for each of the 4 types collected for a month and year combination. The dataset is pre-cleaned since no NA values are seen. The data is likely to have been collected using official/unofficial sources providing egg count for a poultry facility.
Tidying the Data
The dataset needs a date variable and also needs to be pivoted to a long and narrow form for ease of analysis. The following query achieves this and stores the dataframe as eggs_tidy.
eggs_tidy <- eggs %>%
pivot_longer(cols=3:6,
values_to = "price") %>%
mutate(name=str_replace(name,"extra_large","extra large"),
name=str_replace(name,"half_dozen","half dozen")) %>%
separate(name,into=c("size","amount"),sep="_") %>%
mutate(date = str_c(month, year, sep=" "),
date = my(date))
eggs_tidy# A tibble: 480 × 6
month year size amount price date
<chr> <dbl> <chr> <chr> <dbl> <date>
1 January 2004 large half dozen 126 2004-01-01
2 January 2004 large dozen 230 2004-01-01
3 January 2004 extra large half dozen 132 2004-01-01
4 January 2004 extra large dozen 230 2004-01-01
5 February 2004 large half dozen 128. 2004-02-01
6 February 2004 large dozen 226. 2004-02-01
7 February 2004 extra large half dozen 134. 2004-02-01
8 February 2004 extra large dozen 230 2004-02-01
9 March 2004 large half dozen 131 2004-03-01
10 March 2004 large dozen 225 2004-03-01
# ℹ 470 more rows
Creating a Function
The following function is created to compute summary statistics (mean, median and standard deviation) given the dataframe and egg_size as parameters. The returned object is the stat dataframe grouped for the particular egg_size.
compute_stats <- function(egg_size, dataframe){
stats <- dataframe %>%
filter(size == egg_size) %>%
group_by(size, amount) %>%
summarize(mean_price = mean(price),
median_price = median(price),
sd_price = sd(price))
return(stats)
}Making use of purrr
The map_dfr function from the purrr package can be used to apply the compute_stats function created over all egg_sizes instead of applying the function repeatedly. It row-binds the dataframes returned by the compute_stats function and returns a single dataframe.
egg_sizes <- c("large", "extra large")
map_dfr(egg_sizes, compute_stats, dataframe=eggs_tidy)# A tibble: 4 × 5
# Groups: size [2]
size amount mean_price median_price sd_price
<chr> <chr> <dbl> <dbl> <dbl>
1 large dozen 254. 268. 18.5
2 large half dozen 155. 174. 22.6
3 extra large dozen 267. 286. 22.8
4 extra large half dozen 164. 186. 24.7