Challenge 3 Instructions

challenge_3
animal_weights
eggs
australian_marriage
usa_households
sce_labor
Tidy Data: Pivoting
Author

Sean Conway

Published

December 24, 2023

Code
library(tidyverse)
library(here)
library(readr)
library(knitr)
library(dplyr)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Challenge Overview

Today’s challenge is to:

  1. read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
  2. identify what needs to be done to tidy the current data
  3. anticipate the shape of pivoted data
  4. pivot the data into tidy format using pivot_longer() or pivot_wider().

Read in data

Read in one (or more) of the following datasets, using the correct R package and command.

  • animal_weights.csv ⭐
  • eggs_tidy.csv ⭐⭐ or organiceggpoultry.xls ⭐⭐⭐
  • australian_marriage*.xls ⭐⭐⭐
  • USA Households*.xlsx ⭐⭐⭐⭐
  • sce_labor_chart_data_public.xlsx 🌟🌟🌟🌟🌟

Briefly describe the data

Describe the data, and be sure to comment on why you are planning to pivot it to make it “tidy”.

Anticipate the End Result

The first step in pivoting the data is to try to come up with a concrete vision of what the end product should look like - that way you will know whether or not your pivoting was successful.

One easy way to do this is to think about the dimensions of your current dataset and identify how you want a new one to look.

Example: pivoting a dataset

Lets see if this works with a simple example.

Code
df<-tibble(country = rep(c("Mexico", "USA", "France"),2),
           year = rep(c(1980,1990), 3), 
           trade = rep(c("NAFTA", "NAFTA", "EU"),2),
           outgoing = rnorm(6, mean=1000, sd=500),
           incoming = rlogis(6, location=1000, 
                             scale = 400))
df
# A tibble: 6 × 5
  country  year trade outgoing incoming
  <chr>   <dbl> <chr>    <dbl>    <dbl>
1 Mexico   1980 NAFTA    -171.    1845.
2 USA      1990 NAFTA     829.     926.
3 France   1980 EU       1148.    1410.
4 Mexico   1990 NAFTA    1239.    1489.
5 USA      1980 NAFTA     312.     702.
6 France   1990 EU       1721.     813.

In this example, we want the names outgoing and ingoing to be in a single column indicating the trade direction, with the values being stored in a column indicating the value of the trade.

Pivot the Data

Now we will pivot the data!

Example

Code
df1<-pivot_longer(df, col = c(outgoing, incoming),
                 names_to="trade_direction",
                 values_to = "trade_value")
df1
# A tibble: 12 × 5
   country  year trade trade_direction trade_value
   <chr>   <dbl> <chr> <chr>                 <dbl>
 1 Mexico   1980 NAFTA outgoing              -171.
 2 Mexico   1980 NAFTA incoming              1845.
 3 USA      1990 NAFTA outgoing               829.
 4 USA      1990 NAFTA incoming               926.
 5 France   1980 EU    outgoing              1148.
 6 France   1980 EU    incoming              1410.
 7 Mexico   1990 NAFTA outgoing              1239.
 8 Mexico   1990 NAFTA incoming              1489.
 9 USA      1980 NAFTA outgoing               312.
10 USA      1980 NAFTA incoming               702.
11 France   1990 EU    outgoing              1721.
12 France   1990 EU    incoming               813.

Yes, once it is pivoted long, it will be much easier to work with.

Solutions

Reading the Data

The working directory for RStudio has been set such that “eggs_tidy.csv” can be found at the root of the working directory using the setwd() method.

Code
eggs <- read_csv(here("eggs_tidy.csv"))
eggs
# A tibble: 120 × 6
   month      year large_half_dozen large_dozen extra_large_half_dozen
   <chr>     <dbl>            <dbl>       <dbl>                  <dbl>
 1 January    2004             126         230                    132 
 2 February   2004             128.        226.                   134.
 3 March      2004             131         225                    137 
 4 April      2004             131         225                    137 
 5 May        2004             131         225                    137 
 6 June       2004             134.        231.                   137 
 7 July       2004             134.        234.                   137 
 8 August     2004             134.        234.                   137 
 9 September  2004             130.        234.                   136.
10 October    2004             128.        234.                   136.
# ℹ 110 more rows
# ℹ 1 more variable: extra_large_dozen <dbl>

Data Description

High Level Description

The data set comprises of 120 rows with 6 columns.

Code
eggs
# A tibble: 120 × 6
   month      year large_half_dozen large_dozen extra_large_half_dozen
   <chr>     <dbl>            <dbl>       <dbl>                  <dbl>
 1 January    2004             126         230                    132 
 2 February   2004             128.        226.                   134.
 3 March      2004             131         225                    137 
 4 April      2004             131         225                    137 
 5 May        2004             131         225                    137 
 6 June       2004             134.        231.                   137 
 7 July       2004             134.        234.                   137 
 8 August     2004             134.        234.                   137 
 9 September  2004             130.        234.                   136.
10 October    2004             128.        234.                   136.
# ℹ 110 more rows
# ℹ 1 more variable: extra_large_dozen <dbl>

The data set has a total of 1 <chr> type column and the remaining columns are of the <dbl> type. The month and year variables represent the month and year of observation respectively. large_half_dozen, large_dozen, extra_large_half_dozen and extra_large_dozen are variables that represent the type of eggs. Each case represents the count for each type of egg collected for that month and year.

How was the Data likely collected?

The dataset seems to provide a count of the total number of eggs for each of the 4 types collected for a month and year combination. The dataset is pre-cleaned since no NA values are seen. The data is likely to have been collected using official/unofficial sources providing egg count for a poultry facility.

Why Pivot the Data?

In its current form, while the data is easy to read owing to its wide nature (6 variables), a longer and narrower form is much more suitable for data analysis. Additionally, a longer and narrower form is much more scalable in that the addition of a new type of egg would not require the addition of an entire new column. All egg types can be grouped into a single egg_type column with the corresponding values in egg_quantity column.

The pivot_longer operation can be applied to the data to result in a longer and narrower form.

Anticipating the End Result

The current dimensions of the dataset can be obtained using the following query:

Code
dim(eggs)
[1] 120   6

We see that this is a 120 x 6 dataset. To pivot the data into a longer and narrower form, each variable representing the type of egg should be compacted into a single egg_type column with quantities represented in a egg_quantity column. Consequently, the new dimensions would become 480 x 4. This form is much more suited for data analysis.

Pivoting the Data

The pivot_longer function transforms the data into a longer and narrower form. The names of the columns to be changed to observation values are specified using the cols parameter. The names_to and values_to parameters are used for the name of the superseding columns and represent the previous column names and their values respectively.

Code
eggs_long_narrow <- pivot_longer(eggs, 
                                 cols = c(large_half_dozen, large_dozen, extra_large_half_dozen, extra_large_dozen),
                                 names_to = "egg_type",
                                 values_to = "egg_quantity")
eggs_long_narrow
# A tibble: 480 × 4
   month     year egg_type               egg_quantity
   <chr>    <dbl> <chr>                         <dbl>
 1 January   2004 large_half_dozen               126 
 2 January   2004 large_dozen                    230 
 3 January   2004 extra_large_half_dozen         132 
 4 January   2004 extra_large_dozen              230 
 5 February  2004 large_half_dozen               128.
 6 February  2004 large_dozen                    226.
 7 February  2004 extra_large_half_dozen         134.
 8 February  2004 extra_large_dozen              230 
 9 March     2004 large_half_dozen               131 
10 March     2004 large_dozen                    225 
# ℹ 470 more rows

The data is now much easier to work with. The dimensions of 480 x 4 match our pivot estimates from the previous section. The egg_type and egg_quantity variables are of the <chr> and <dbl> types respectively.