Tidyverse is a collection of R packages that provide useful functions for common tasks in data science, including data import, tidying, manipulation, visualization, and programming.
Packages for each of these tasks include:
Import: readr for reading files in various formats
Tidy: tidyr for tidying data
Transformation: dplyr and its related packages (eg, stringr for strings, forcats for factors, lubridate for date and time)
Visualization: ggplot2 for plots
Programming: tibble
for tibbles, magrittr for
the %>% pipe operator
Here, I demonstrate how to use the across() function
from the dplyr package with a Kaggle dataset
of airline passenger reviews of Ryanair, a low-cost airline in
Europe.
Data were downloaded from Kaggle as a CSV file, pushed to a GitHub repository, and then read in.
ryanair_reviews_raw <- read_csv('https://media.githubusercontent.com/media/alexandersimon1/Data607/main/Tidyverse%20CREATE/ryanair_reviews.csv', show_col_types = FALSE)
This is what the data look like:1
glimpse(ryanair_reviews)
## Rows: 2,249
## Columns: 19
## $ Date_publish <date> 2024-02-03, 2024-01-26, 2024-01-20, 2024-01-07,…
## $ Comment_title <chr> "\"bang on time and smooth flights\"", "\"Anothe…
## $ Comment <chr> "Flew back from Faro to London Luton Friday 2nd …
## $ Rating_seat_comfort <dbl> 4, 3, 5, 3, 4, 2, 2, NA, 1, 1, 3, 1, 3, 1, 1, 4,…
## $ Rating_service_cabin <dbl> 5, 5, 5, 2, 5, 2, 5, NA, NA, 1, 5, 1, 2, 1, 2, 5…
## $ Rating_food <dbl> 3, 3, 4, 1, NA, 2, 2, NA, NA, NA, NA, NA, 1, 1, …
## $ Rating_service_ground <dbl> 4, 5, 5, 3, 4, 1, 1, 1, 1, 1, 3, 1, 2, 1, 1, 5, …
## $ Rating_entertainment <dbl> NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, NA, 1…
## $ Rating_wifi <dbl> NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, NA, 1…
## $ Rating_value <dbl> 4, 5, 5, 3, 5, 1, 1, 1, 1, 1, 5, 1, 3, 1, 2, 5, …
## $ Rating_overall <dbl> 10, 10, 10, 6, 10, 1, 5, 1, 1, 1, 8, 1, 3, 1, 1,…
## $ Recommended <chr> "yes", "yes", "yes", "yes", "yes", "no", "yes", …
## $ Flight_date <chr> "February 2024", "January 2024", "October 2023",…
## $ Origin <chr> "Faro", "Belfast", "Edinburgh", "Faro", "Dublin"…
## $ Destination <chr> "Luton", "Alicante", "Paris Beauvais", "Liverpoo…
## $ Aircraft <chr> "Boeing 737 900", NA, "Boeing 737-800", "Boeing …
## $ Traveler_type <chr> "Family Leisure", "Couple Leisure", "Couple Leis…
## $ Seat_type <chr> "Economy Class", "Economy Class", "Economy Class…
## $ Trip_verified <chr> "Not Verified", "Trip Verified", "Trip Verified"…
R is optimized for operations over columns. across()
simplifies the application of a function to mulitple columns. Although
across() can be implemented in other ways, using it results
in cleaner, more concise, and more scalable code.
across(.cols, .fns, ..., .names = NULL, .unpack = FALSE)
Arguments:
.cols - The columns of the dataframe to transform.
Columns can be specified explicitly or by a range (eg,
c(x:z) for columns ‘x’, ‘y’, and ‘z’)
.fns - The function(s) to be applied to each column
specified in .cols
.names - Rule(s) to customize the names of new
columns resulting from the functions specified in
.fns
.unpack - Expand compressed columns when TRUE. See
tidyr documentation for pack() and unpack()
for more information
Without across() - A separate command is needed for each rating. The code is repetitive, hard to read, and prone to errors (eg, typos).
ryanair_reviews %>%
summarise(
Rating_seat_comfort_mean = round(mean(Rating_seat_comfort, na.rm = TRUE), 2),
Rating_service_cabin_mean = round(mean(Rating_service_cabin, na.rm = TRUE), 2),
Rating_food_mean = round(mean(Rating_food, na.rm = TRUE), 2),
Rating_service_ground_mean = round(mean(Rating_service_ground, na.rm = TRUE), 2),
Rating_entertainment_mean = round(mean(Rating_entertainment, na.rm = TRUE), 2),
Rating_wifi_mean = round(mean(Rating_wifi, na.rm = TRUE), 2),
Rating_value_mean = round(mean(Rating_value, na.rm = TRUE), 2),
Rating_overall_mean = round(mean(Rating_overall, na.rm = TRUE), 2)
) %>%
kbl() %>%
kable_minimal() %>%
scroll_box(width = "100%")
| Rating_seat_comfort_mean | Rating_service_cabin_mean | Rating_food_mean | Rating_service_ground_mean | Rating_entertainment_mean | Rating_wifi_mean | Rating_value_mean | Rating_overall_mean |
|---|---|---|---|---|---|---|---|
| 2.37 | 2.75 | 1.92 | 2.16 | 1.16 | 1.12 | 2.73 | 4.38 |
With across() - The code is much simpler, regardless of the number of columns.
ryanair_reviews %>%
summarise(
across(
.cols = starts_with("Rating"),
.fns = ~ round(mean(.x, na.rm = TRUE), 2), # ".x" is a placeholder for the column
.names = "{col}_mean" # append "_mean" to each column name
)
) %>%
kbl() %>%
kable_minimal() %>%
scroll_box(width = "100%")
| Rating_seat_comfort_mean | Rating_service_cabin_mean | Rating_food_mean | Rating_service_ground_mean | Rating_entertainment_mean | Rating_wifi_mean | Rating_value_mean | Rating_overall_mean |
|---|---|---|---|---|---|---|---|
| 2.37 | 2.75 | 1.92 | 2.16 | 1.16 | 1.12 | 2.73 | 4.38 |
The .fns argument of across() can have
mulitple functions.
Without across() - Again, the code block is long and repetitive.
ryanair_reviews %>%
summarise(
Rating_seat_comfort_min = min(Rating_seat_comfort, na.rm = TRUE),
Rating_seat_comfort_max = max(Rating_seat_comfort, na.rm = TRUE),
Rating_service_cabin_min = min(Rating_service_cabin, na.rm = TRUE),
Rating_service_cabin_max = max(Rating_service_cabin, na.rm = TRUE),
Rating_food_min = min(Rating_food, na.rm = TRUE),
Rating_food_max = max(Rating_food, na.rm = TRUE),
Rating_service_ground_min = min(Rating_service_ground, na.rm = TRUE),
Rating_service_ground_max = max(Rating_service_ground, na.rm = TRUE),
Rating_entertainment_min = min(Rating_entertainment, na.rm = TRUE),
Rating_entertainment_max = max(Rating_entertainment, na.rm = TRUE),
Rating_wifi_min = min(Rating_wifi, na.rm = TRUE),
Rating_wifi_max = max(Rating_wifi, na.rm = TRUE),
Rating_value_min = min(Rating_value, na.rm = TRUE),
Rating_value_max = max(Rating_value, na.rm = TRUE),
Rating_overall_min = min(Rating_overall, na.rm = TRUE),
Rating_overall_max = max(Rating_overall, na.rm = TRUE)
) %>%
kbl() %>%
kable_minimal() %>%
scroll_box(width = "100%")
| Rating_seat_comfort_min | Rating_seat_comfort_max | Rating_service_cabin_min | Rating_service_cabin_max | Rating_food_min | Rating_food_max | Rating_service_ground_min | Rating_service_ground_max | Rating_entertainment_min | Rating_entertainment_max | Rating_wifi_min | Rating_wifi_max | Rating_value_min | Rating_value_max | Rating_overall_min | Rating_overall_max |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5 | 0 | 5 | 0 | 5 | 1 | 5 | 1 | 5 | 1 | 5 | 0 | 5 | 1 | 10 |
With across() - The code is much simpler and easily scalable to more columns and/or functions.
ryanair_reviews %>%
summarise(
across(
.cols = starts_with("Rating"),
.fns = list(min = ~ min(.x, na.rm = TRUE),
max = ~ max(.x, na.rm = TRUE)),
.names = "{col}_{fn}" # append function name ({fn}) to each column name
)
) %>%
kbl() %>%
kable_minimal() %>%
scroll_box(width = "100%")
| Rating_seat_comfort_min | Rating_seat_comfort_max | Rating_service_cabin_min | Rating_service_cabin_max | Rating_food_min | Rating_food_max | Rating_service_ground_min | Rating_service_ground_max | Rating_entertainment_min | Rating_entertainment_max | Rating_wifi_min | Rating_wifi_max | Rating_value_min | Rating_value_max | Rating_overall_min | Rating_overall_max |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5 | 0 | 5 | 0 | 5 | 1 | 5 | 1 | 5 | 1 | 5 | 0 | 5 | 1 | 10 |
Although there are alternative approaches to coding the examples
above, across() is an attractive function for creating
scalable and easy-to-read code to transform multiple columns in a
dataframe.
“Apply a function (or functions) across multiple columns” https://dplyr.tidyverse.org/reference/across.html
“Column-wise operations in dplyr” https://www.r4epi.com/column-wise-operations-in-dplyr
Data tidying isn’t shown because it isn’t the focus of the vignette. Please see the R markdown file.↩︎