Challenge 9 Instructions

challenge_9
eggs
Creating a function
Author

Sean Conway

Published

January 10, 2024

library(tidyverse)
library(ggplot2)
library(dplyr)
library(here)
library(ggsci)
library(ggthemes)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is simple. Create a function, and use it to perform a data analysis / cleaning / visualization task:

Examples of such functions are: 1) A function that reads in and cleans a dataset. 2) A function that computes summary statistics (e.g., computes the z score for a variable).
3) A function that plots a histogram.

That’s it!

This is an intentionally straightforward challenge. You can use any dataset in the challenge datasets folder.

Solutions

Reading the Data

The working directory for RStudio has been set such that “eggs_tidy.csv” can be found at the root of the working directory using the setwd() method.

eggs <- read_csv(here("eggs_tidy.csv"))
eggs
# A tibble: 120 × 6
   month      year large_half_dozen large_dozen extra_large_half_dozen
   <chr>     <dbl>            <dbl>       <dbl>                  <dbl>
 1 January    2004             126         230                    132 
 2 February   2004             128.        226.                   134.
 3 March      2004             131         225                    137 
 4 April      2004             131         225                    137 
 5 May        2004             131         225                    137 
 6 June       2004             134.        231.                   137 
 7 July       2004             134.        234.                   137 
 8 August     2004             134.        234.                   137 
 9 September  2004             130.        234.                   136.
10 October    2004             128.        234.                   136.
# ℹ 110 more rows
# ℹ 1 more variable: extra_large_dozen <dbl>

Data Description

High Level Description

The data set comprises of 120 rows with 6 columns.

eggs
# A tibble: 120 × 6
   month      year large_half_dozen large_dozen extra_large_half_dozen
   <chr>     <dbl>            <dbl>       <dbl>                  <dbl>
 1 January    2004             126         230                    132 
 2 February   2004             128.        226.                   134.
 3 March      2004             131         225                    137 
 4 April      2004             131         225                    137 
 5 May        2004             131         225                    137 
 6 June       2004             134.        231.                   137 
 7 July       2004             134.        234.                   137 
 8 August     2004             134.        234.                   137 
 9 September  2004             130.        234.                   136.
10 October    2004             128.        234.                   136.
# ℹ 110 more rows
# ℹ 1 more variable: extra_large_dozen <dbl>

The data set has a total of 1 <chr> type column and the remaining columns are of the <dbl> type. The month and year variables represent the month and year of observation respectively. large_half_dozen, large_dozen, extra_large_half_dozen and extra_large_dozen are variables that represent the type of eggs. Each case represents the count for each type of egg collected for that month and year.

How was the Data likely collected?

The dataset seems to provide a count of the total number of eggs for each of the 4 types collected for a month and year combination. The dataset is pre-cleaned since no NA values are seen. The data is likely to have been collected using official/unofficial sources providing egg count for a poultry facility.

Tidying the Data

The dataset needs a date variable and also needs to be pivoted to a long and narrow form for ease of analysis. The following query achieves this and stores the dataframe as eggs_tidy.

eggs_tidy <- eggs %>%
  pivot_longer(cols=3:6,
               values_to = "price") %>%
  mutate(name=str_replace(name,"extra_large","extra large"),
         name=str_replace(name,"half_dozen","half dozen")) %>%
  separate(name,into=c("size","amount"),sep="_") %>%
  mutate(date = str_c(month, year, sep=" "),
         date = my(date))
eggs_tidy
# A tibble: 480 × 6
   month     year size        amount     price date      
   <chr>    <dbl> <chr>       <chr>      <dbl> <date>    
 1 January   2004 large       half dozen  126  2004-01-01
 2 January   2004 large       dozen       230  2004-01-01
 3 January   2004 extra large half dozen  132  2004-01-01
 4 January   2004 extra large dozen       230  2004-01-01
 5 February  2004 large       half dozen  128. 2004-02-01
 6 February  2004 large       dozen       226. 2004-02-01
 7 February  2004 extra large half dozen  134. 2004-02-01
 8 February  2004 extra large dozen       230  2004-02-01
 9 March     2004 large       half dozen  131  2004-03-01
10 March     2004 large       dozen       225  2004-03-01
# ℹ 470 more rows

Creating a Function

The following function is created to plot a dataframe. It accepts as parameters - the dataframe to be used, the X and Y axes, the fill variable and the variable to created facets on.

plot_grouped_barchart <- function(dataframe, x_axis, y_axis, fill_variable, facet_wrap_var){
  ggplot(dataframe,aes_string(x=x_axis,y=y_axis,fill=fill_variable))+
  geom_bar(position="dodge", stat="identity") +
  scale_x_date()+
  scale_y_continuous(labels=scales::label_dollar(),limits=c(0,300))+
  facet_wrap(as.formula(paste("~", facet_wrap_var)))+
  ggsci::scale_color_rickandmorty() +
  labs(title=paste("Grouped Bar Plot by ", facet_wrap_var, "\n", y_axis, " vs ", x_axis), x=x_axis,y=y_axis)+
  ggthemes::theme_few()+
  theme(plot.title = element_text(hjust=0.5),
        axis.text.x=element_text(angle=90))
}

The above created function can now be executed to created a grouped bar plot using the eggs_tidy dataframe.

plot_grouped_barchart(dataframe=eggs_tidy, x_axis="date", y_axis="price", fill_variable="amount", facet_wrap_var="size")

The above created function also allows keeping the title of the plot and the labels of the axes dynamic based on the parameters provided to the function.