Code
library(tidyverse)
library(readxl)
library(here)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)library(tidyverse)
library(readxl)
library(here)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)Today’s challenge is to:
Read in one (or more) of the following datasets, using the correct R package and command.
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
Any additional comments?
Are there any variables that require mutation to be usable in your analysis stream? For example, are all time variables correctly coded as dates? Are all string variables reduced and cleaned to sensible categories? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Document your work here.
Any additional comments?
The working directory for RStudio has been set such that “poultry_tidy.xlsx” can be found at the root of the working directory using the setwd() method.
poultry <- read_excel(here("poultry_tidy.xlsx"))
poultry# A tibble: 600 × 4
Product Year Month Price_Dollar
<chr> <dbl> <chr> <dbl>
1 Whole 2013 January 2.38
2 Whole 2013 February 2.38
3 Whole 2013 March 2.38
4 Whole 2013 April 2.38
5 Whole 2013 May 2.38
6 Whole 2013 June 2.38
7 Whole 2013 July 2.38
8 Whole 2013 August 2.38
9 Whole 2013 September 2.38
10 Whole 2013 October 2.38
# ℹ 590 more rows
The data set comprises of 600 rows with 4 columns.
poultry# A tibble: 600 × 4
Product Year Month Price_Dollar
<chr> <dbl> <chr> <dbl>
1 Whole 2013 January 2.38
2 Whole 2013 February 2.38
3 Whole 2013 March 2.38
4 Whole 2013 April 2.38
5 Whole 2013 May 2.38
6 Whole 2013 June 2.38
7 Whole 2013 July 2.38
8 Whole 2013 August 2.38
9 Whole 2013 September 2.38
10 Whole 2013 October 2.38
# ℹ 590 more rows
The data set has a total of 2 <chr> type column and the remaining columns are of the <dbl> type. The Month and Year variables represent the month and year of observation respectively. Product and Price_Dollar are variables that represent the type and price in dollars of the product respectively. Each case represents the price for each type of product for that month and year.
The dataset seems to provide the price of a certain unit of poultry product for a month and year combination. The data is likely to have been collected using official/unofficial sources providing product count for a poultry facility.
The following query gives the total distinct Product types:
poultry %>% distinct(Product)# A tibble: 5 × 1
Product
<chr>
1 Whole
2 B/S Breast
3 Bone-in Breast
4 Whole Legs
5 Thighs
We see there are total 5 distinct product types - “Whole”, “B/S Breast”, “Bone-in Breast”, “Whole Legs” and “Thighs”.
In the current form, the data is long and narrow, however a shorter and wider form is much more readable. The Month and Price column values can be expanded into columns of their own using pivot_wider().
The current dimensions of the dataset can be obtained using the following query:
dim(poultry)[1] 600 4
We see that this is a 600 x 4 dataset. To pivot the data into a shorter and wider form, the Month column can be expanded. Consequently, the new dimensions would become 50 x 14. This form is much more readable and tidy.
The pivot_wider() function is used to pivot the data into the shorter and wider form using the below query. Additionally, the below query arranges data in ascending order of Year.
poultry_wide <- poultry %>%
pivot_wider(names_from = "Month", values_from = "Price_Dollar") %>%
arrange(Year)
poultry_wide# A tibble: 50 × 14
Product Year January February March April May June July August September
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Whole 2004 1.98 1.98 2.09 2.12 2.14 2.16 2.17 2.17 2.17
2 B/S Br… 2004 6.46 6.42 6.42 6.42 6.42 6.41 6.42 6.42 6.42
3 Bone-i… 2004 NA NA NA NA NA NA 3.90 3.90 3.90
4 Whole … 2004 1.94 1.94 1.94 1.94 1.94 2.02 2.04 2.04 2.04
5 Thighs 2004 NA 2.03 2.03 2.03 2.03 2.00 2.00 2.00 2.00
6 Whole 2005 2.17 2.17 2.17 2.17 2.17 2.17 2.17 2.17 2.17
7 B/S Br… 2005 6.44 6.46 6.46 6.46 6.46 6.46 6.46 6.46 6.46
8 Bone-i… 2005 3.90 3.90 3.90 3.90 3.90 3.90 3.90 3.90 3.90
9 Whole … 2005 2.04 2.04 2.04 2.04 2.04 2.04 2.04 2.04 2.04
10 Thighs 2005 2.13 2.22 2.22 2.22 2.22 2.22 2.22 2.22 2.22
# ℹ 40 more rows
# ℹ 3 more variables: October <dbl>, November <dbl>, December <dbl>
The dimensions of the pivoted dataset can be checked to match the anticipated end result as a sanity check.
dim(poultry_wide)[1] 50 14
We can perform two mutations to make it easier to analyze and understand the dataset: - Reduce “B/S Breast” Product to “Boneless Breast”. - Generate Total_Annual_Price and Avg_Monthly_Price for each Product for the entire Year.
The Product variable can be mutated to reduce “B/S Breast” to “Boneless Breast”. The variables January:December can be mutated to generate the Total_Annual_Price and Avg_Monthly_Price variables.
The following query generates a mutation where “B/S Breast” is reduced to “Boneless Breast”.
poultry_reduced <- poultry_wide %>%
mutate(Product = if_else(Product == "B/S Breast", "Boneless Breast", Product))
poultry_reduced# A tibble: 50 × 14
Product Year January February March April May June July August September
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Whole 2004 1.98 1.98 2.09 2.12 2.14 2.16 2.17 2.17 2.17
2 Bonele… 2004 6.46 6.42 6.42 6.42 6.42 6.41 6.42 6.42 6.42
3 Bone-i… 2004 NA NA NA NA NA NA 3.90 3.90 3.90
4 Whole … 2004 1.94 1.94 1.94 1.94 1.94 2.02 2.04 2.04 2.04
5 Thighs 2004 NA 2.03 2.03 2.03 2.03 2.00 2.00 2.00 2.00
6 Whole 2005 2.17 2.17 2.17 2.17 2.17 2.17 2.17 2.17 2.17
7 Bonele… 2005 6.44 6.46 6.46 6.46 6.46 6.46 6.46 6.46 6.46
8 Bone-i… 2005 3.90 3.90 3.90 3.90 3.90 3.90 3.90 3.90 3.90
9 Whole … 2005 2.04 2.04 2.04 2.04 2.04 2.04 2.04 2.04 2.04
10 Thighs 2005 2.13 2.22 2.22 2.22 2.22 2.22 2.22 2.22 2.22
# ℹ 40 more rows
# ℹ 3 more variables: October <dbl>, November <dbl>, December <dbl>
The following query filters poultry_reduced on “Boneless Breast” and poultry_wide on “B/S Breast” to give a side-by-side comparison for sanity check.
poultry_reduced %>% filter(Product == "Boneless Breast")# A tibble: 10 × 14
Product Year January February March April May June July August September
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Bonele… 2004 6.46 6.42 6.42 6.42 6.42 6.41 6.42 6.42 6.42
2 Bonele… 2005 6.44 6.46 6.46 6.46 6.46 6.46 6.46 6.46 6.46
3 Bonele… 2006 6.46 6.46 6.46 6.46 6.46 6.46 6.46 6.46 6.46
4 Bonele… 2007 6.46 6.46 6.46 6.46 6.46 6.46 6.46 6.46 6.46
5 Bonele… 2008 6.46 6.46 6.46 6.46 6.46 6.46 6.46 6.46 6.46
6 Bonele… 2009 6.46 6.46 6.46 6.46 6.46 6.46 6.46 6.46 6.46
7 Bonele… 2010 6.46 6.46 6.46 6.46 6.46 6.46 6.46 6.42 6.38
8 Bonele… 2011 6.38 6.38 6.38 6.38 6.38 6.38 6.38 6.38 6.38
9 Bonele… 2012 6.38 7 7 7 7 7 7 7 7
10 Bonele… 2013 7.04 7.04 7.04 7.04 7.04 7.04 7.04 7.04 7.04
# ℹ 3 more variables: October <dbl>, November <dbl>, December <dbl>
poultry_wide %>% filter(Product == "B/S Breast")# A tibble: 10 × 14
Product Year January February March April May June July August September
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 B/S Br… 2004 6.46 6.42 6.42 6.42 6.42 6.41 6.42 6.42 6.42
2 B/S Br… 2005 6.44 6.46 6.46 6.46 6.46 6.46 6.46 6.46 6.46
3 B/S Br… 2006 6.46 6.46 6.46 6.46 6.46 6.46 6.46 6.46 6.46
4 B/S Br… 2007 6.46 6.46 6.46 6.46 6.46 6.46 6.46 6.46 6.46
5 B/S Br… 2008 6.46 6.46 6.46 6.46 6.46 6.46 6.46 6.46 6.46
6 B/S Br… 2009 6.46 6.46 6.46 6.46 6.46 6.46 6.46 6.46 6.46
7 B/S Br… 2010 6.46 6.46 6.46 6.46 6.46 6.46 6.46 6.42 6.38
8 B/S Br… 2011 6.38 6.38 6.38 6.38 6.38 6.38 6.38 6.38 6.38
9 B/S Br… 2012 6.38 7 7 7 7 7 7 7 7
10 B/S Br… 2013 7.04 7.04 7.04 7.04 7.04 7.04 7.04 7.04 7.04
# ℹ 3 more variables: October <dbl>, November <dbl>, December <dbl>
The following query generates Total_Annual_Price and Avg_Monthly_Price variables by mutating the month variables - January:December.
poultry_stat <- poultry_reduced %>%
mutate(
Total_Annual_Price = rowSums(select(., January:December), na.rm = T),
Avg_Monthly_Price = rowMeans(select(., January:December), na.rm = T))
poultry_stat# A tibble: 50 × 16
Product Year January February March April May June July August September
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Whole 2004 1.98 1.98 2.09 2.12 2.14 2.16 2.17 2.17 2.17
2 Bonele… 2004 6.46 6.42 6.42 6.42 6.42 6.41 6.42 6.42 6.42
3 Bone-i… 2004 NA NA NA NA NA NA 3.90 3.90 3.90
4 Whole … 2004 1.94 1.94 1.94 1.94 1.94 2.02 2.04 2.04 2.04
5 Thighs 2004 NA 2.03 2.03 2.03 2.03 2.00 2.00 2.00 2.00
6 Whole 2005 2.17 2.17 2.17 2.17 2.17 2.17 2.17 2.17 2.17
7 Bonele… 2005 6.44 6.46 6.46 6.46 6.46 6.46 6.46 6.46 6.46
8 Bone-i… 2005 3.90 3.90 3.90 3.90 3.90 3.90 3.90 3.90 3.90
9 Whole … 2005 2.04 2.04 2.04 2.04 2.04 2.04 2.04 2.04 2.04
10 Thighs 2005 2.13 2.22 2.22 2.22 2.22 2.22 2.22 2.22 2.22
# ℹ 40 more rows
# ℹ 5 more variables: October <dbl>, November <dbl>, December <dbl>,
# Total_Annual_Price <dbl>, Avg_Monthly_Price <dbl>
We can perform a sanity check for the above mutation by picking a Product and Year combination and verifying if the Total_Annual_Price and the Avg_Monthly_Price variables match up.
poultry_wide %>%
filter(Year == 2007 & Product == "Thighs") %>%
summarize(total_annual_price = sum(c_across(January:December), na.rm = T),
avg_monthly_price = mean(c_across(January:December), na.rm = T))# A tibble: 1 × 2
total_annual_price avg_monthly_price
<dbl> <dbl>
1 26.6 2.22
This matches with the mutations above and therefore checks in.