Code
library(tidyverse)
library(here)
library(readr)
library(knitr)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)library(tidyverse)
library(here)
library(readr)
library(knitr)
library(dplyr)
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)Today’s challenge is to:
pivot_longer() or pivot_wider().Read in one (or more) of the following datasets, using the correct R package and command.
Describe the data, and be sure to comment on why you are planning to pivot it to make it “tidy”.
The first step in pivoting the data is to try to come up with a concrete vision of what the end product should look like - that way you will know whether or not your pivoting was successful.
One easy way to do this is to think about the dimensions of your current dataset and identify how you want a new one to look.
Lets see if this works with a simple example.
df<-tibble(country = rep(c("Mexico", "USA", "France"),2),
year = rep(c(1980,1990), 3),
trade = rep(c("NAFTA", "NAFTA", "EU"),2),
outgoing = rnorm(6, mean=1000, sd=500),
incoming = rlogis(6, location=1000,
scale = 400))
df# A tibble: 6 × 5
country year trade outgoing incoming
<chr> <dbl> <chr> <dbl> <dbl>
1 Mexico 1980 NAFTA -171. 1845.
2 USA 1990 NAFTA 829. 926.
3 France 1980 EU 1148. 1410.
4 Mexico 1990 NAFTA 1239. 1489.
5 USA 1980 NAFTA 312. 702.
6 France 1990 EU 1721. 813.
In this example, we want the names outgoing and ingoing to be in a single column indicating the trade direction, with the values being stored in a column indicating the value of the trade.
Now we will pivot the data!
df1<-pivot_longer(df, col = c(outgoing, incoming),
names_to="trade_direction",
values_to = "trade_value")
df1# A tibble: 12 × 5
country year trade trade_direction trade_value
<chr> <dbl> <chr> <chr> <dbl>
1 Mexico 1980 NAFTA outgoing -171.
2 Mexico 1980 NAFTA incoming 1845.
3 USA 1990 NAFTA outgoing 829.
4 USA 1990 NAFTA incoming 926.
5 France 1980 EU outgoing 1148.
6 France 1980 EU incoming 1410.
7 Mexico 1990 NAFTA outgoing 1239.
8 Mexico 1990 NAFTA incoming 1489.
9 USA 1980 NAFTA outgoing 312.
10 USA 1980 NAFTA incoming 702.
11 France 1990 EU outgoing 1721.
12 France 1990 EU incoming 813.
Yes, once it is pivoted long, it will be much easier to work with.
The working directory for RStudio has been set such that “eggs_tidy.csv” can be found at the root of the working directory using the setwd() method.
eggs <- read_csv(here("eggs_tidy.csv"))
eggs# A tibble: 120 × 6
month year large_half_dozen large_dozen extra_large_half_dozen
<chr> <dbl> <dbl> <dbl> <dbl>
1 January 2004 126 230 132
2 February 2004 128. 226. 134.
3 March 2004 131 225 137
4 April 2004 131 225 137
5 May 2004 131 225 137
6 June 2004 134. 231. 137
7 July 2004 134. 234. 137
8 August 2004 134. 234. 137
9 September 2004 130. 234. 136.
10 October 2004 128. 234. 136.
# ℹ 110 more rows
# ℹ 1 more variable: extra_large_dozen <dbl>
The data set comprises of 120 rows with 6 columns.
eggs# A tibble: 120 × 6
month year large_half_dozen large_dozen extra_large_half_dozen
<chr> <dbl> <dbl> <dbl> <dbl>
1 January 2004 126 230 132
2 February 2004 128. 226. 134.
3 March 2004 131 225 137
4 April 2004 131 225 137
5 May 2004 131 225 137
6 June 2004 134. 231. 137
7 July 2004 134. 234. 137
8 August 2004 134. 234. 137
9 September 2004 130. 234. 136.
10 October 2004 128. 234. 136.
# ℹ 110 more rows
# ℹ 1 more variable: extra_large_dozen <dbl>
The data set has a total of 1 <chr> type column and the remaining columns are of the <dbl> type. The month and year variables represent the month and year of observation respectively. large_half_dozen, large_dozen, extra_large_half_dozen and extra_large_dozen are variables that represent the type of eggs. Each case represents the count for each type of egg collected for that month and year.
The dataset seems to provide a count of the total number of eggs for each of the 4 types collected for a month and year combination. The dataset is pre-cleaned since no NA values are seen. The data is likely to have been collected using official/unofficial sources providing egg count for a poultry facility.
In its current form, while the data is easy to read owing to its wide nature (6 variables), a longer and narrower form is much more suitable for data analysis. Additionally, a longer and narrower form is much more scalable in that the addition of a new type of egg would not require the addition of an entire new column. All egg types can be grouped into a single egg_type column with the corresponding values in egg_quantity column.
The pivot_longer operation can be applied to the data to result in a longer and narrower form.
The current dimensions of the dataset can be obtained using the following query:
dim(eggs)[1] 120 6
We see that this is a 120 x 6 dataset. To pivot the data into a longer and narrower form, each variable representing the type of egg should be compacted into a single egg_type column with quantities represented in a egg_quantity column. Consequently, the new dimensions would become 480 x 4. This form is much more suited for data analysis.
The pivot_longer function transforms the data into a longer and narrower form. The names of the columns to be changed to observation values are specified using the cols parameter. The names_to and values_to parameters are used for the name of the superseding columns and represent the previous column names and their values respectively.
eggs_long_narrow <- pivot_longer(eggs,
cols = c(large_half_dozen, large_dozen, extra_large_half_dozen, extra_large_dozen),
names_to = "egg_type",
values_to = "egg_quantity")
eggs_long_narrow# A tibble: 480 × 4
month year egg_type egg_quantity
<chr> <dbl> <chr> <dbl>
1 January 2004 large_half_dozen 126
2 January 2004 large_dozen 230
3 January 2004 extra_large_half_dozen 132
4 January 2004 extra_large_dozen 230
5 February 2004 large_half_dozen 128.
6 February 2004 large_dozen 226.
7 February 2004 extra_large_half_dozen 134.
8 February 2004 extra_large_dozen 230
9 March 2004 large_half_dozen 131
10 March 2004 large_dozen 225
# ℹ 470 more rows
The data is now much easier to work with. The dimensions of 480 x 4 match our pivot estimates from the previous section. The egg_type and egg_quantity variables are of the <chr> and <dbl> types respectively.