library(tidyverse)
library(ggplot2)
library(readxl)
library(here)
library(dplyr)
library(khroma)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)Challenge 7 Instructions
Challenge Overview
Today’s challenge is to:
- read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data as needed.
- mutate variables as needed.
- Make at least two graphs using ggplot functionality (color, shape, line, facet, etc), in particular trying to use functionality that you haven’t used before.
- Explain why you choose the specific graph type.
- If you haven’t tried in previous weeks, work this week to make your graphs “publication” ready with titles, captions, and pretty axis labels and other viewer-friendly features.
Solutions
Reading the Data
The working directory for RStudio has been set such that “poultry_tidy.xlsx” can be found at the root of the working directory using the setwd() method.
poultry <- read_excel(here("poultry_tidy.xlsx"))
poultry# A tibble: 600 × 4
Product Year Month Price_Dollar
<chr> <dbl> <chr> <dbl>
1 Whole 2013 January 2.38
2 Whole 2013 February 2.38
3 Whole 2013 March 2.38
4 Whole 2013 April 2.38
5 Whole 2013 May 2.38
6 Whole 2013 June 2.38
7 Whole 2013 July 2.38
8 Whole 2013 August 2.38
9 Whole 2013 September 2.38
10 Whole 2013 October 2.38
# ℹ 590 more rows
Data Description
High Level Description
The data set comprises of 600 rows with 4 columns.
poultry# A tibble: 600 × 4
Product Year Month Price_Dollar
<chr> <dbl> <chr> <dbl>
1 Whole 2013 January 2.38
2 Whole 2013 February 2.38
3 Whole 2013 March 2.38
4 Whole 2013 April 2.38
5 Whole 2013 May 2.38
6 Whole 2013 June 2.38
7 Whole 2013 July 2.38
8 Whole 2013 August 2.38
9 Whole 2013 September 2.38
10 Whole 2013 October 2.38
# ℹ 590 more rows
The data set has a total of 2 <chr> type column and the remaining columns are of the <dbl> type. The Month and Year variables represent the month and year of observation respectively. Product and Price_Dollar are variables that represent the type and price in dollars of the product respectively. Each case represents the price for each type of product for that month and year.
How was the Data likely collected?
The dataset seems to provide the price of a certain unit of poultry product for a month and year combination. The data is likely to have been collected using official/unofficial sources providing product count for a poultry facility.
The following query gives the total distinct Product types:
poultry %>% distinct(Product)# A tibble: 5 × 1
Product
<chr>
1 Whole
2 B/S Breast
3 Bone-in Breast
4 Whole Legs
5 Thighs
We see there are total 5 distinct product types - “Whole”, “B/S Breast”, “Bone-in Breast”, “Whole Legs” and “Thighs”.
Tidying the Data
Filling in Missing Values
The missing values can be filled upwards from the poultry data since it is observered that for a product in a year the values in the initial months are missing.
poultry_clean <- poultry %>%
fill(Price_Dollar, .direction = "up")In the current form, the data is long and narrow, however a shorter and wider form is much more readable. The Month and Price column values can be expanded into columns of their own using pivot_wider().
Anticipating the End Result
The current dimensions of the dataset can be obtained using the following query:
dim(poultry_clean)[1] 600 4
We see that this is a 600 x 4 dataset. To pivot the data into a shorter and wider form, the Month column can be expanded. Consequently, the new dimensions would become 50 x 14. This form is much more readable and tidy.
Pivoting the Data
The pivot_wider() function is used to pivot the data into the shorter and wider form using the below query. Additionally, the below query arranges data in ascending order of Year.
poultry_wide <- poultry_clean %>%
pivot_wider(names_from = "Month", values_from = "Price_Dollar") %>%
arrange(Year)
poultry_wide# A tibble: 50 × 14
Product Year January February March April May June July August September
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Whole 2004 1.98 1.98 2.09 2.12 2.14 2.16 2.17 2.17 2.17
2 B/S Br… 2004 6.46 6.42 6.42 6.42 6.42 6.41 6.42 6.42 6.42
3 Bone-i… 2004 3.90 3.90 3.90 3.90 3.90 3.90 3.90 3.90 3.90
4 Whole … 2004 1.94 1.94 1.94 1.94 1.94 2.02 2.04 2.04 2.04
5 Thighs 2004 2.03 2.03 2.03 2.03 2.03 2.00 2.00 2.00 2.00
6 Whole 2005 2.17 2.17 2.17 2.17 2.17 2.17 2.17 2.17 2.17
7 B/S Br… 2005 6.44 6.46 6.46 6.46 6.46 6.46 6.46 6.46 6.46
8 Bone-i… 2005 3.90 3.90 3.90 3.90 3.90 3.90 3.90 3.90 3.90
9 Whole … 2005 2.04 2.04 2.04 2.04 2.04 2.04 2.04 2.04 2.04
10 Thighs 2005 2.13 2.22 2.22 2.22 2.22 2.22 2.22 2.22 2.22
# ℹ 40 more rows
# ℹ 3 more variables: October <dbl>, November <dbl>, December <dbl>
The dimensions of the pivoted dataset can be checked to match the anticipated end result as a sanity check.
dim(poultry_wide)[1] 50 14
Visualizing the Data
The following plots have not been plotted for the “poultry_tidy” dataset in any of the previous challenges, especially the facet_wrap functionality which I’ll be plotting for the first time.
poultry_clean$Month <- factor(poultry_clean$Month, levels = month.name)
mean_prices <- poultry_clean %>%
group_by(Product, Year) %>%
summarize(mean_price = mean(Price_Dollar, na.rm = T))
# Create the plot
ggplot(mean_prices, aes(x = Year, y = mean_price, color = Product)) +
geom_line(size = 1, linetype = "dashed") +
facet_wrap(~ Product, scales = "free") +
scale_color_okabeito(name="Chicken Cuts") +
labs(x = "Year", y = "Mean Price", title = "Trend of Mean Prices\n Across Years by Product") +
ggthemes::theme_few() +
theme(plot.title = element_text(hjust=0.5),
axis.text.x = element_text(angle = 45, hjust = 1))Why Choose a Line Plot?
A line plot allows viewing the trend of mean prices of a particular product across multiple years. Additionally, using the facet_wrap ggplot function allows grouping the plots by the products. In this plot, the dashed line with a size of 1 has been plotted along with the theme_few function of the ggthemes package. The title has also been centered to make it easier to read.
ggplot(poultry_clean, aes(x = as.factor(Year), y = Price_Dollar, color = Product)) +
geom_point(position = position_jitter(width = 0.3, height = 0), size = 2, alpha=.65) +
labs(title = "Scatter Plot of Prices (2004-2013)\n for All Months by Product",
x = "Year",
y = "Price") +
scale_color_okabeito(name="Chicken Cuts") +
theme_minimal() +
theme(plot.title = element_text(hjust=0.5))Why Choose a Scatter Plot?
A scatter plot allows viewing the entire table (instead of a consolidated value like a mean) in a single plot. Not only is the trend for a particular product visible across years but the trend within a particular year is visible too. By setting the height=0 of the jitter position, points are not overlapped with other products. In this plot, experimentation has been done with the geom_point function to make the plot look as clear and visually appealing as possible. Additionally, the plot title is center justified to improve the visual.