Challenge 7 Instructions

challenge_7
eggs
Visualizing Multiple Dimensions
Author

Sean Conway

Published

January 10, 2024

library(tidyverse)
library(ggplot2)
library(readxl)
library(here)
library(dplyr)
library(khroma)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. tidy data as needed.
  3. mutate variables as needed.
  4. Make at least two graphs using ggplot functionality (color, shape, line, facet, etc), in particular trying to use functionality that you haven’t used before.
  • Explain why you choose the specific graph type.
  1. If you haven’t tried in previous weeks, work this week to make your graphs “publication” ready with titles, captions, and pretty axis labels and other viewer-friendly features.

Solutions

Reading the Data

The working directory for RStudio has been set such that “poultry_tidy.xlsx” can be found at the root of the working directory using the setwd() method.

poultry <- read_excel(here("poultry_tidy.xlsx"))
poultry
# A tibble: 600 × 4
   Product  Year Month     Price_Dollar
   <chr>   <dbl> <chr>            <dbl>
 1 Whole    2013 January           2.38
 2 Whole    2013 February          2.38
 3 Whole    2013 March             2.38
 4 Whole    2013 April             2.38
 5 Whole    2013 May               2.38
 6 Whole    2013 June              2.38
 7 Whole    2013 July              2.38
 8 Whole    2013 August            2.38
 9 Whole    2013 September         2.38
10 Whole    2013 October           2.38
# ℹ 590 more rows

Data Description

High Level Description

The data set comprises of 600 rows with 4 columns.

poultry
# A tibble: 600 × 4
   Product  Year Month     Price_Dollar
   <chr>   <dbl> <chr>            <dbl>
 1 Whole    2013 January           2.38
 2 Whole    2013 February          2.38
 3 Whole    2013 March             2.38
 4 Whole    2013 April             2.38
 5 Whole    2013 May               2.38
 6 Whole    2013 June              2.38
 7 Whole    2013 July              2.38
 8 Whole    2013 August            2.38
 9 Whole    2013 September         2.38
10 Whole    2013 October           2.38
# ℹ 590 more rows

The data set has a total of 2 <chr> type column and the remaining columns are of the <dbl> type. The Month and Year variables represent the month and year of observation respectively. Product and Price_Dollar are variables that represent the type and price in dollars of the product respectively. Each case represents the price for each type of product for that month and year.

How was the Data likely collected?

The dataset seems to provide the price of a certain unit of poultry product for a month and year combination. The data is likely to have been collected using official/unofficial sources providing product count for a poultry facility.

The following query gives the total distinct Product types:

poultry %>% distinct(Product)
# A tibble: 5 × 1
  Product       
  <chr>         
1 Whole         
2 B/S Breast    
3 Bone-in Breast
4 Whole Legs    
5 Thighs        

We see there are total 5 distinct product types - “Whole”, “B/S Breast”, “Bone-in Breast”, “Whole Legs” and “Thighs”.

Tidying the Data

Filling in Missing Values

The missing values can be filled upwards from the poultry data since it is observered that for a product in a year the values in the initial months are missing.

poultry_clean <- poultry %>%
  fill(Price_Dollar, .direction = "up")

In the current form, the data is long and narrow, however a shorter and wider form is much more readable. The Month and Price column values can be expanded into columns of their own using pivot_wider().

Anticipating the End Result

The current dimensions of the dataset can be obtained using the following query:

dim(poultry_clean)
[1] 600   4

We see that this is a 600 x 4 dataset. To pivot the data into a shorter and wider form, the Month column can be expanded. Consequently, the new dimensions would become 50 x 14. This form is much more readable and tidy.

Pivoting the Data

The pivot_wider() function is used to pivot the data into the shorter and wider form using the below query. Additionally, the below query arranges data in ascending order of Year.

poultry_wide <- poultry_clean %>% 
  pivot_wider(names_from = "Month", values_from = "Price_Dollar") %>%
  arrange(Year)
poultry_wide
# A tibble: 50 × 14
   Product  Year January February March April   May  June  July August September
   <chr>   <dbl>   <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>     <dbl>
 1 Whole    2004    1.98     1.98  2.09  2.12  2.14  2.16  2.17   2.17      2.17
 2 B/S Br…  2004    6.46     6.42  6.42  6.42  6.42  6.41  6.42   6.42      6.42
 3 Bone-i…  2004    3.90     3.90  3.90  3.90  3.90  3.90  3.90   3.90      3.90
 4 Whole …  2004    1.94     1.94  1.94  1.94  1.94  2.02  2.04   2.04      2.04
 5 Thighs   2004    2.03     2.03  2.03  2.03  2.03  2.00  2.00   2.00      2.00
 6 Whole    2005    2.17     2.17  2.17  2.17  2.17  2.17  2.17   2.17      2.17
 7 B/S Br…  2005    6.44     6.46  6.46  6.46  6.46  6.46  6.46   6.46      6.46
 8 Bone-i…  2005    3.90     3.90  3.90  3.90  3.90  3.90  3.90   3.90      3.90
 9 Whole …  2005    2.04     2.04  2.04  2.04  2.04  2.04  2.04   2.04      2.04
10 Thighs   2005    2.13     2.22  2.22  2.22  2.22  2.22  2.22   2.22      2.22
# ℹ 40 more rows
# ℹ 3 more variables: October <dbl>, November <dbl>, December <dbl>

The dimensions of the pivoted dataset can be checked to match the anticipated end result as a sanity check.

dim(poultry_wide)
[1] 50 14

Visualizing the Data

The following plots have not been plotted for the “poultry_tidy” dataset in any of the previous challenges, especially the facet_wrap functionality which I’ll be plotting for the first time.

poultry_clean$Month <- factor(poultry_clean$Month, levels = month.name)

mean_prices <- poultry_clean %>%
  group_by(Product, Year) %>%
  summarize(mean_price = mean(Price_Dollar, na.rm = T))

# Create the plot
ggplot(mean_prices, aes(x = Year, y = mean_price, color = Product)) +
  geom_line(size = 1, linetype = "dashed") +
  facet_wrap(~ Product, scales = "free") +
  scale_color_okabeito(name="Chicken Cuts") +
  labs(x = "Year", y = "Mean Price", title = "Trend of Mean Prices\n Across Years by Product") +
  ggthemes::theme_few() +
  theme(plot.title = element_text(hjust=0.5),
        axis.text.x = element_text(angle = 45, hjust = 1))

Why Choose a Line Plot?

A line plot allows viewing the trend of mean prices of a particular product across multiple years. Additionally, using the facet_wrap ggplot function allows grouping the plots by the products. In this plot, the dashed line with a size of 1 has been plotted along with the theme_few function of the ggthemes package. The title has also been centered to make it easier to read.

ggplot(poultry_clean, aes(x = as.factor(Year), y = Price_Dollar, color = Product)) +
  geom_point(position = position_jitter(width = 0.3, height = 0), size = 2, alpha=.65) +
  labs(title = "Scatter Plot of Prices (2004-2013)\n for All Months by Product",
       x = "Year",
       y = "Price") +
  scale_color_okabeito(name="Chicken Cuts") +
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5))

Why Choose a Scatter Plot?

A scatter plot allows viewing the entire table (instead of a consolidated value like a mean) in a single plot. Not only is the trend for a particular product visible across years but the trend within a particular year is visible too. By setting the height=0 of the jitter position, points are not overlapped with other products. In this plot, experimentation has been done with the geom_point function to make the plot look as clear and visually appealing as possible. Additionally, the plot title is center justified to improve the visual.