Challenge 4 Instructions

challenge_4

abc_poll

eggs

fed_rates

hotel_bookings

debt

More data wrangling: pivoting

Author

Sean Conway

Published

December 27, 2023

Code

library(tidyverse)
library(readxl)
library(here)
library(dplyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
tidy data (as needed, including sanity checks)
identify variables that need to be mutated
mutate variables and sanity check all mutations

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

abc_poll.csv ⭐
poultry_tidy.xlsx or organiceggpoultry.xls⭐⭐
FedFundsRate.csv⭐⭐⭐
hotel_bookings.csv⭐⭐⭐⭐
debt_in_trillions.xlsx ⭐⭐⭐⭐⭐

Briefly describe the data

Tidy Data (as needed)

Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.

Any additional comments?

Identify variables that need to be mutated

Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?

Document your work here.

Any additional comments?

Solutions

Reading the Data

The working directory for RStudio has been set such that “poultry_tidy.xlsx” can be found at the root of the working directory using the setwd() method.

Code

poultry <- read_excel(here("poultry_tidy.xlsx"))
poultry

# A tibble: 600 × 4
   Product  Year Month     Price_Dollar
   <chr>   <dbl> <chr>            <dbl>
 1 Whole    2013 January           2.38
 2 Whole    2013 February          2.38
 3 Whole    2013 March             2.38
 4 Whole    2013 April             2.38
 5 Whole    2013 May               2.38
 6 Whole    2013 June              2.38
 7 Whole    2013 July              2.38
 8 Whole    2013 August            2.38
 9 Whole    2013 September         2.38
10 Whole    2013 October           2.38
# ℹ 590 more rows

Data Description

High Level Description

The data set comprises of 600 rows with 4 columns.

Code

poultry

# A tibble: 600 × 4
   Product  Year Month     Price_Dollar
   <chr>   <dbl> <chr>            <dbl>
 1 Whole    2013 January           2.38
 2 Whole    2013 February          2.38
 3 Whole    2013 March             2.38
 4 Whole    2013 April             2.38
 5 Whole    2013 May               2.38
 6 Whole    2013 June              2.38
 7 Whole    2013 July              2.38
 8 Whole    2013 August            2.38
 9 Whole    2013 September         2.38
10 Whole    2013 October           2.38
# ℹ 590 more rows

The data set has a total of 2 <chr> type column and the remaining columns are of the <dbl> type. The Month and Year variables represent the month and year of observation respectively. Product and Price_Dollar are variables that represent the type and price in dollars of the product respectively. Each case represents the price for each type of product for that month and year.

How was the Data likely collected?

The dataset seems to provide the price of a certain unit of poultry product for a month and year combination. The data is likely to have been collected using official/unofficial sources providing product count for a poultry facility.

The following query gives the total distinct Product types:

Code

poultry %>% distinct(Product)

# A tibble: 5 × 1
  Product       
  <chr>         
1 Whole         
2 B/S Breast    
3 Bone-in Breast
4 Whole Legs    
5 Thighs

We see there are total 5 distinct product types - “Whole”, “B/S Breast”, “Bone-in Breast”, “Whole Legs” and “Thighs”.

Tidying the Data

In the current form, the data is long and narrow, however a shorter and wider form is much more readable. The Month and Price column values can be expanded into columns of their own using pivot_wider().

Anticipating the End Result

The current dimensions of the dataset can be obtained using the following query:

Code

dim(poultry)

[1] 600   4

We see that this is a 600 x 4 dataset. To pivot the data into a shorter and wider form, the Month column can be expanded. Consequently, the new dimensions would become 50 x 14. This form is much more readable and tidy.

Pivoting the Data

The pivot_wider() function is used to pivot the data into the shorter and wider form using the below query. Additionally, the below query arranges data in ascending order of Year.

Code

poultry_wide <- poultry %>% 
  pivot_wider(names_from = "Month", values_from = "Price_Dollar") %>%
  arrange(Year)
poultry_wide

# A tibble: 50 × 14
   Product  Year January February March April   May  June  July August September
   <chr>   <dbl>   <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>     <dbl>
 1 Whole    2004    1.98     1.98  2.09  2.12  2.14  2.16  2.17   2.17      2.17
 2 B/S Br…  2004    6.46     6.42  6.42  6.42  6.42  6.41  6.42   6.42      6.42
 3 Bone-i…  2004   NA       NA    NA    NA    NA    NA     3.90   3.90      3.90
 4 Whole …  2004    1.94     1.94  1.94  1.94  1.94  2.02  2.04   2.04      2.04
 5 Thighs   2004   NA        2.03  2.03  2.03  2.03  2.00  2.00   2.00      2.00
 6 Whole    2005    2.17     2.17  2.17  2.17  2.17  2.17  2.17   2.17      2.17
 7 B/S Br…  2005    6.44     6.46  6.46  6.46  6.46  6.46  6.46   6.46      6.46
 8 Bone-i…  2005    3.90     3.90  3.90  3.90  3.90  3.90  3.90   3.90      3.90
 9 Whole …  2005    2.04     2.04  2.04  2.04  2.04  2.04  2.04   2.04      2.04
10 Thighs   2005    2.13     2.22  2.22  2.22  2.22  2.22  2.22   2.22      2.22
# ℹ 40 more rows
# ℹ 3 more variables: October <dbl>, November <dbl>, December <dbl>

The dimensions of the pivoted dataset can be checked to match the anticipated end result as a sanity check.

Code

dim(poultry_wide)

[1] 50 14

Identifying Variables that Need to be Mutated

We can perform two mutations to make it easier to analyze and understand the dataset: - Reduce “B/S Breast” Product to “Boneless Breast”. - Generate Total_Annual_Price and Avg_Monthly_Price for each Product for the entire Year.

The Product variable can be mutated to reduce “B/S Breast” to “Boneless Breast”. The variables January:December can be mutated to generate the Total_Annual_Price and Avg_Monthly_Price variables.

Mutating the Variables

Reducing “B/S Breast” to “Boneless Breast”

The following query generates a mutation where “B/S Breast” is reduced to “Boneless Breast”.

Code

poultry_reduced <- poultry_wide %>% 
  mutate(Product = if_else(Product == "B/S Breast", "Boneless Breast", Product))
poultry_reduced

# A tibble: 50 × 14
   Product  Year January February March April   May  June  July August September
   <chr>   <dbl>   <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>     <dbl>
 1 Whole    2004    1.98     1.98  2.09  2.12  2.14  2.16  2.17   2.17      2.17
 2 Bonele…  2004    6.46     6.42  6.42  6.42  6.42  6.41  6.42   6.42      6.42
 3 Bone-i…  2004   NA       NA    NA    NA    NA    NA     3.90   3.90      3.90
 4 Whole …  2004    1.94     1.94  1.94  1.94  1.94  2.02  2.04   2.04      2.04
 5 Thighs   2004   NA        2.03  2.03  2.03  2.03  2.00  2.00   2.00      2.00
 6 Whole    2005    2.17     2.17  2.17  2.17  2.17  2.17  2.17   2.17      2.17
 7 Bonele…  2005    6.44     6.46  6.46  6.46  6.46  6.46  6.46   6.46      6.46
 8 Bone-i…  2005    3.90     3.90  3.90  3.90  3.90  3.90  3.90   3.90      3.90
 9 Whole …  2005    2.04     2.04  2.04  2.04  2.04  2.04  2.04   2.04      2.04
10 Thighs   2005    2.13     2.22  2.22  2.22  2.22  2.22  2.22   2.22      2.22
# ℹ 40 more rows
# ℹ 3 more variables: October <dbl>, November <dbl>, December <dbl>

Sanity Check

The following query filters poultry_reduced on “Boneless Breast” and poultry_wide on “B/S Breast” to give a side-by-side comparison for sanity check.

Code

poultry_reduced %>% filter(Product == "Boneless Breast")

# A tibble: 10 × 14
   Product  Year January February March April   May  June  July August September
   <chr>   <dbl>   <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>     <dbl>
 1 Bonele…  2004    6.46     6.42  6.42  6.42  6.42  6.41  6.42   6.42      6.42
 2 Bonele…  2005    6.44     6.46  6.46  6.46  6.46  6.46  6.46   6.46      6.46
 3 Bonele…  2006    6.46     6.46  6.46  6.46  6.46  6.46  6.46   6.46      6.46
 4 Bonele…  2007    6.46     6.46  6.46  6.46  6.46  6.46  6.46   6.46      6.46
 5 Bonele…  2008    6.46     6.46  6.46  6.46  6.46  6.46  6.46   6.46      6.46
 6 Bonele…  2009    6.46     6.46  6.46  6.46  6.46  6.46  6.46   6.46      6.46
 7 Bonele…  2010    6.46     6.46  6.46  6.46  6.46  6.46  6.46   6.42      6.38
 8 Bonele…  2011    6.38     6.38  6.38  6.38  6.38  6.38  6.38   6.38      6.38
 9 Bonele…  2012    6.38     7     7     7     7     7     7      7         7   
10 Bonele…  2013    7.04     7.04  7.04  7.04  7.04  7.04  7.04   7.04      7.04
# ℹ 3 more variables: October <dbl>, November <dbl>, December <dbl>

Code

poultry_wide %>% filter(Product == "B/S Breast")

# A tibble: 10 × 14
   Product  Year January February March April   May  June  July August September
   <chr>   <dbl>   <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>     <dbl>
 1 B/S Br…  2004    6.46     6.42  6.42  6.42  6.42  6.41  6.42   6.42      6.42
 2 B/S Br…  2005    6.44     6.46  6.46  6.46  6.46  6.46  6.46   6.46      6.46
 3 B/S Br…  2006    6.46     6.46  6.46  6.46  6.46  6.46  6.46   6.46      6.46
 4 B/S Br…  2007    6.46     6.46  6.46  6.46  6.46  6.46  6.46   6.46      6.46
 5 B/S Br…  2008    6.46     6.46  6.46  6.46  6.46  6.46  6.46   6.46      6.46
 6 B/S Br…  2009    6.46     6.46  6.46  6.46  6.46  6.46  6.46   6.46      6.46
 7 B/S Br…  2010    6.46     6.46  6.46  6.46  6.46  6.46  6.46   6.42      6.38
 8 B/S Br…  2011    6.38     6.38  6.38  6.38  6.38  6.38  6.38   6.38      6.38
 9 B/S Br…  2012    6.38     7     7     7     7     7     7      7         7   
10 B/S Br…  2013    7.04     7.04  7.04  7.04  7.04  7.04  7.04   7.04      7.04
# ℹ 3 more variables: October <dbl>, November <dbl>, December <dbl>

Generating Total_Annual_Price and Avg_Monthly_Price

The following query generates Total_Annual_Price and Avg_Monthly_Price variables by mutating the month variables - January:December.

Code

poultry_stat <- poultry_reduced %>% 
  mutate(
    Total_Annual_Price = rowSums(select(., January:December), na.rm = T),
    Avg_Monthly_Price = rowMeans(select(., January:December), na.rm = T))
poultry_stat

# A tibble: 50 × 16
   Product  Year January February March April   May  June  July August September
   <chr>   <dbl>   <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>     <dbl>
 1 Whole    2004    1.98     1.98  2.09  2.12  2.14  2.16  2.17   2.17      2.17
 2 Bonele…  2004    6.46     6.42  6.42  6.42  6.42  6.41  6.42   6.42      6.42
 3 Bone-i…  2004   NA       NA    NA    NA    NA    NA     3.90   3.90      3.90
 4 Whole …  2004    1.94     1.94  1.94  1.94  1.94  2.02  2.04   2.04      2.04
 5 Thighs   2004   NA        2.03  2.03  2.03  2.03  2.00  2.00   2.00      2.00
 6 Whole    2005    2.17     2.17  2.17  2.17  2.17  2.17  2.17   2.17      2.17
 7 Bonele…  2005    6.44     6.46  6.46  6.46  6.46  6.46  6.46   6.46      6.46
 8 Bone-i…  2005    3.90     3.90  3.90  3.90  3.90  3.90  3.90   3.90      3.90
 9 Whole …  2005    2.04     2.04  2.04  2.04  2.04  2.04  2.04   2.04      2.04
10 Thighs   2005    2.13     2.22  2.22  2.22  2.22  2.22  2.22   2.22      2.22
# ℹ 40 more rows
# ℹ 5 more variables: October <dbl>, November <dbl>, December <dbl>,
#   Total_Annual_Price <dbl>, Avg_Monthly_Price <dbl>

Sanity Check

We can perform a sanity check for the above mutation by picking a Product and Year combination and verifying if the Total_Annual_Price and the Avg_Monthly_Price variables match up.

Code

poultry_wide %>% 
  filter(Year == 2007 & Product == "Thighs") %>%
  summarize(total_annual_price = sum(c_across(January:December), na.rm = T),
            avg_monthly_price = mean(c_across(January:December), na.rm = T))

# A tibble: 1 × 2
  total_annual_price avg_monthly_price
               <dbl>             <dbl>
1               26.6              2.22

This matches with the mutations above and therefore checks in.