Introduction

Tidyverse is a collection of R packages that provide useful functions for common tasks in data science, including data import, tidying, manipulation, visualization, and programming.

Packages for each of these tasks include:

Here, I demonstrate how to use the across() function from the dplyr package with a Kaggle dataset of airline passenger reviews of Ryanair, a low-cost airline in Europe.


Data

Data were downloaded from Kaggle as a CSV file, pushed to a GitHub repository, and then read in.

ryanair_reviews_raw <- read_csv('https://media.githubusercontent.com/media/alexandersimon1/Data607/main/Tidyverse%20CREATE/ryanair_reviews.csv', show_col_types = FALSE)

This is what the data look like:1

glimpse(ryanair_reviews)
## Rows: 2,249
## Columns: 19
## $ Date_publish          <date> 2024-02-03, 2024-01-26, 2024-01-20, 2024-01-07,…
## $ Comment_title         <chr> "\"bang on time and smooth flights\"", "\"Anothe…
## $ Comment               <chr> "Flew back from Faro to London Luton Friday 2nd …
## $ Rating_seat_comfort   <dbl> 4, 3, 5, 3, 4, 2, 2, NA, 1, 1, 3, 1, 3, 1, 1, 4,…
## $ Rating_service_cabin  <dbl> 5, 5, 5, 2, 5, 2, 5, NA, NA, 1, 5, 1, 2, 1, 2, 5…
## $ Rating_food           <dbl> 3, 3, 4, 1, NA, 2, 2, NA, NA, NA, NA, NA, 1, 1, …
## $ Rating_service_ground <dbl> 4, 5, 5, 3, 4, 1, 1, 1, 1, 1, 3, 1, 2, 1, 1, 5, …
## $ Rating_entertainment  <dbl> NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, NA, 1…
## $ Rating_wifi           <dbl> NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, NA, 1…
## $ Rating_value          <dbl> 4, 5, 5, 3, 5, 1, 1, 1, 1, 1, 5, 1, 3, 1, 2, 5, …
## $ Rating_overall        <dbl> 10, 10, 10, 6, 10, 1, 5, 1, 1, 1, 8, 1, 3, 1, 1,…
## $ Recommended           <chr> "yes", "yes", "yes", "yes", "yes", "no", "yes", …
## $ Flight_date           <chr> "February 2024", "January 2024", "October 2023",…
## $ Origin                <chr> "Faro", "Belfast", "Edinburgh", "Faro", "Dublin"…
## $ Destination           <chr> "Luton", "Alicante", "Paris Beauvais", "Liverpoo…
## $ Aircraft              <chr> "Boeing 737 900", NA, "Boeing 737-800", "Boeing …
## $ Traveler_type         <chr> "Family Leisure", "Couple Leisure", "Couple Leis…
## $ Seat_type             <chr> "Economy Class", "Economy Class", "Economy Class…
## $ Trip_verified         <chr> "Not Verified", "Trip Verified", "Trip Verified"…


What is the purpose of across()?

R is optimized for operations over columns. across() simplifies the application of a function to mulitple columns. Although across() can be implemented in other ways, using it results in cleaner, more concise, and more scalable code.


Usage

across(.cols, .fns, ..., .names = NULL, .unpack = FALSE)

Arguments:


Examples

Example 1. Calculate the mean rating of each rating category

Without across() - A separate command is needed for each rating. The code is repetitive, hard to read, and prone to errors (eg, typos).

ryanair_reviews %>%
  summarise(
    Rating_seat_comfort_mean = round(mean(Rating_seat_comfort, na.rm = TRUE), 2),
    Rating_service_cabin_mean = round(mean(Rating_service_cabin, na.rm = TRUE), 2),
    Rating_food_mean = round(mean(Rating_food, na.rm = TRUE), 2),
    Rating_service_ground_mean = round(mean(Rating_service_ground, na.rm = TRUE), 2),    
    Rating_entertainment_mean = round(mean(Rating_entertainment, na.rm = TRUE), 2),
    Rating_wifi_mean = round(mean(Rating_wifi, na.rm = TRUE), 2),
    Rating_value_mean = round(mean(Rating_value, na.rm = TRUE), 2),
    Rating_overall_mean = round(mean(Rating_overall, na.rm = TRUE), 2)
  ) %>%
  kbl() %>%
  kable_minimal() %>%
  scroll_box(width = "100%")
Rating_seat_comfort_mean Rating_service_cabin_mean Rating_food_mean Rating_service_ground_mean Rating_entertainment_mean Rating_wifi_mean Rating_value_mean Rating_overall_mean
2.37 2.75 1.92 2.16 1.16 1.12 2.73 4.38


With across() - The code is much simpler, regardless of the number of columns.

ryanair_reviews %>%
  summarise(
    across(
      .cols = starts_with("Rating"),
      .fns = ~ round(mean(.x, na.rm = TRUE), 2),  # ".x" is a placeholder for the column
      .names = "{col}_mean"  # append "_mean" to each column name
    )
  ) %>%
  kbl() %>%
  kable_minimal() %>%
  scroll_box(width = "100%")
Rating_seat_comfort_mean Rating_service_cabin_mean Rating_food_mean Rating_service_ground_mean Rating_entertainment_mean Rating_wifi_mean Rating_value_mean Rating_overall_mean
2.37 2.75 1.92 2.16 1.16 1.12 2.73 4.38


Example 2. Determine the minimum and maximum rating of each rating category

The .fns argument of across() can have mulitple functions.

Without across() - Again, the code block is long and repetitive.

ryanair_reviews %>%
  summarise(
    Rating_seat_comfort_min = min(Rating_seat_comfort, na.rm = TRUE),
    Rating_seat_comfort_max = max(Rating_seat_comfort, na.rm = TRUE),    
    Rating_service_cabin_min = min(Rating_service_cabin, na.rm = TRUE),
    Rating_service_cabin_max = max(Rating_service_cabin, na.rm = TRUE),    
    Rating_food_min = min(Rating_food, na.rm = TRUE),
    Rating_food_max = max(Rating_food, na.rm = TRUE),    
    Rating_service_ground_min = min(Rating_service_ground, na.rm = TRUE),
    Rating_service_ground_max = max(Rating_service_ground, na.rm = TRUE),    
    Rating_entertainment_min = min(Rating_entertainment, na.rm = TRUE),
    Rating_entertainment_max = max(Rating_entertainment, na.rm = TRUE),    
    Rating_wifi_min = min(Rating_wifi, na.rm = TRUE),
    Rating_wifi_max = max(Rating_wifi, na.rm = TRUE),    
    Rating_value_min = min(Rating_value, na.rm = TRUE),
    Rating_value_max = max(Rating_value, na.rm = TRUE),    
    Rating_overall_min = min(Rating_overall, na.rm = TRUE),
    Rating_overall_max = max(Rating_overall, na.rm = TRUE)   
  ) %>%
  kbl() %>%
  kable_minimal() %>%
  scroll_box(width = "100%")
Rating_seat_comfort_min Rating_seat_comfort_max Rating_service_cabin_min Rating_service_cabin_max Rating_food_min Rating_food_max Rating_service_ground_min Rating_service_ground_max Rating_entertainment_min Rating_entertainment_max Rating_wifi_min Rating_wifi_max Rating_value_min Rating_value_max Rating_overall_min Rating_overall_max
0 5 0 5 0 5 1 5 1 5 1 5 0 5 1 10


With across() - The code is much simpler and easily scalable to more columns and/or functions.

ryanair_reviews %>%
  summarise(
    across(
      .cols = starts_with("Rating"),
      .fns = list(min = ~ min(.x, na.rm = TRUE), 
                  max = ~ max(.x, na.rm = TRUE)),
      .names = "{col}_{fn}"  # append function name ({fn}) to each column name
    )
  ) %>%
  kbl() %>%
  kable_minimal() %>%
  scroll_box(width = "100%")
Rating_seat_comfort_min Rating_seat_comfort_max Rating_service_cabin_min Rating_service_cabin_max Rating_food_min Rating_food_max Rating_service_ground_min Rating_service_ground_max Rating_entertainment_min Rating_entertainment_max Rating_wifi_min Rating_wifi_max Rating_value_min Rating_value_max Rating_overall_min Rating_overall_max
0 5 0 5 0 5 1 5 1 5 1 5 0 5 1 10


Summary

Although there are alternative approaches to coding the examples above, across() is an attractive function for creating scalable and easy-to-read code to transform multiple columns in a dataframe.


References

“Apply a function (or functions) across multiple columns” https://dplyr.tidyverse.org/reference/across.html

“Column-wise operations in dplyr” https://www.r4epi.com/column-wise-operations-in-dplyr


  1. Data tidying isn’t shown because it isn’t the focus of the vignette. Please see the R markdown file.↩︎