Challenge 4

Author

Jaden Busch

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)

Read in data

I chose to use the poultry_tidy.xlsx dataset.

dataset <- read_excel("../challenge_datasets/poultry_tidy.xlsx")
dataset
# A tibble: 600 × 4
   Product  Year Month     Price_Dollar
   <chr>   <dbl> <chr>            <dbl>
 1 Whole    2013 January           2.38
 2 Whole    2013 February          2.38
 3 Whole    2013 March             2.38
 4 Whole    2013 April             2.38
 5 Whole    2013 May               2.38
 6 Whole    2013 June              2.38
 7 Whole    2013 July              2.38
 8 Whole    2013 August            2.38
 9 Whole    2013 September         2.38
10 Whole    2013 October           2.38
# ℹ 590 more rows

Briefly describe the data

The data set records price information for 5 different parts of poultry from 2004 to 2013.

Tidy Data (as needed)

While the variables do need to be mutated, the data set does not need to be tidied. Each row represents a single observance of a poultry part and its corresponding time and price, which is the layout that we want.

Identify variables that need to be mutated

The first column “Product” is given as a character string, but the data is categorical. There are exactly 5 products:

unique(dataset$Product)
[1] "Whole"          "B/S Breast"     "Bone-in Breast" "Whole Legs"    
[5] "Thighs"        

So, using mutate and factor we can transform this column of the dataset into a factor:

dataset <- mutate(dataset, Product = factor(Product, levels = c("Whole", "B/S Breast", "Bone-in Breast", "Whole Legs", "Thighs")))
dataset
# A tibble: 600 × 4
   Product  Year Month     Price_Dollar
   <fct>   <dbl> <chr>            <dbl>
 1 Whole    2013 January           2.38
 2 Whole    2013 February          2.38
 3 Whole    2013 March             2.38
 4 Whole    2013 April             2.38
 5 Whole    2013 May               2.38
 6 Whole    2013 June              2.38
 7 Whole    2013 July              2.38
 8 Whole    2013 August            2.38
 9 Whole    2013 September         2.38
10 Whole    2013 October           2.38
# ℹ 590 more rows

Next, we see that the Year and Month columns are used in conjunction to describe the date that the product’s price was recorded. Instead, it would be nice to have a single Date column which shows the time at which the price was recorded for our analysis. To do this, we can use the make_date function of lubridate. In order to convert the name of the month into its number, we can use match to get the index of the name of the Month given in the dataset in the month.name vector. Then, we can simply pass in the Year and Month to make_date, which will produce the corresponding date assuming the day was the 1st of the month. Lastly, we drop the Year and Month columns as they are now redundant.

dataset <- dataset %>% mutate(Date = make_date(Year, month = match(Month, month.name))) %>% select(Product, Date, Price_Dollar)
dataset
# A tibble: 600 × 3
   Product Date       Price_Dollar
   <fct>   <date>            <dbl>
 1 Whole   2013-01-01         2.38
 2 Whole   2013-02-01         2.38
 3 Whole   2013-03-01         2.38
 4 Whole   2013-04-01         2.38
 5 Whole   2013-05-01         2.38
 6 Whole   2013-06-01         2.38
 7 Whole   2013-07-01         2.38
 8 Whole   2013-08-01         2.38
 9 Whole   2013-09-01         2.38
10 Whole   2013-10-01         2.38
# ℹ 590 more rows

Now, all the data is tidy and the variables are in their best forms for data analysis. No change is needed on the Price_Dollar, as the values are continuous so storing as a double is logical.