DATA607 TidyVerse CREATE Assignment

Introduction

Tidyverse is a collection of R packages that provide useful functions for common tasks in data science, including data import, tidying, manipulation, visualization, and programming.

Packages for each of these tasks include:

Import: readr for reading files in various formats
Tidy: tidyr for tidying data
Transformation: dplyr and its related packages (eg, stringr for strings, forcats for factors, lubridate for date and time)
Visualization: ggplot2 for plots
Programming: tibble for tibbles, magrittr for the %>% pipe operator

Here, I demonstrate how to use the across() function from the dplyr package with a Kaggle dataset of airline passenger reviews of Ryanair, a low-cost airline in Europe.

Data

Data were downloaded from Kaggle as a CSV file, pushed to a GitHub repository, and then read in.

ryanair_reviews_raw <- read_csv('https://media.githubusercontent.com/media/alexandersimon1/Data607/main/Tidyverse%20CREATE/ryanair_reviews.csv', show_col_types = FALSE)

This is what the data look like:¹

glimpse(ryanair_reviews)

## Rows: 2,249
## Columns: 19
## $ Date_publish          <date> 2024-02-03, 2024-01-26, 2024-01-20, 2024-01-07,…
## $ Comment_title         <chr> "\"bang on time and smooth flights\"", "\"Anothe…
## $ Comment               <chr> "Flew back from Faro to London Luton Friday 2nd …
## $ Rating_seat_comfort   <dbl> 4, 3, 5, 3, 4, 2, 2, NA, 1, 1, 3, 1, 3, 1, 1, 4,…
## $ Rating_service_cabin  <dbl> 5, 5, 5, 2, 5, 2, 5, NA, NA, 1, 5, 1, 2, 1, 2, 5…
## $ Rating_food           <dbl> 3, 3, 4, 1, NA, 2, 2, NA, NA, NA, NA, NA, 1, 1, …
## $ Rating_service_ground <dbl> 4, 5, 5, 3, 4, 1, 1, 1, 1, 1, 3, 1, 2, 1, 1, 5, …
## $ Rating_entertainment  <dbl> NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, NA, 1…
## $ Rating_wifi           <dbl> NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, NA, 1…
## $ Rating_value          <dbl> 4, 5, 5, 3, 5, 1, 1, 1, 1, 1, 5, 1, 3, 1, 2, 5, …
## $ Rating_overall        <dbl> 10, 10, 10, 6, 10, 1, 5, 1, 1, 1, 8, 1, 3, 1, 1,…
## $ Recommended           <chr> "yes", "yes", "yes", "yes", "yes", "no", "yes", …
## $ Flight_date           <chr> "February 2024", "January 2024", "October 2023",…
## $ Origin                <chr> "Faro", "Belfast", "Edinburgh", "Faro", "Dublin"…
## $ Destination           <chr> "Luton", "Alicante", "Paris Beauvais", "Liverpoo…
## $ Aircraft              <chr> "Boeing 737 900", NA, "Boeing 737-800", "Boeing …
## $ Traveler_type         <chr> "Family Leisure", "Couple Leisure", "Couple Leis…
## $ Seat_type             <chr> "Economy Class", "Economy Class", "Economy Class…
## $ Trip_verified         <chr> "Not Verified", "Trip Verified", "Trip Verified"…

What is the purpose of across()?

R is optimized for operations over columns. across() simplifies the application of a function to mulitple columns. Although across() can be implemented in other ways, using it results in cleaner, more concise, and more scalable code.

Usage

across(.cols, .fns, ..., .names = NULL, .unpack = FALSE)

Arguments:

.cols - The columns of the dataframe to transform. Columns can be specified explicitly or by a range (eg, c(x:z) for columns ‘x’, ‘y’, and ‘z’)
.fns - The function(s) to be applied to each column specified in .cols
.names - Rule(s) to customize the names of new columns resulting from the functions specified in .fns
.unpack - Expand compressed columns when TRUE. See tidyr documentation for pack() and unpack() for more information

Examples

Example 1. Calculate the mean rating of each rating category

Without across() - A separate command is needed for each rating. The code is repetitive, hard to read, and prone to errors (eg, typos).

ryanair_reviews %>%
  summarise(
    Rating_seat_comfort_mean = round(mean(Rating_seat_comfort, na.rm = TRUE), 2),
    Rating_service_cabin_mean = round(mean(Rating_service_cabin, na.rm = TRUE), 2),
    Rating_food_mean = round(mean(Rating_food, na.rm = TRUE), 2),
    Rating_service_ground_mean = round(mean(Rating_service_ground, na.rm = TRUE), 2),    
    Rating_entertainment_mean = round(mean(Rating_entertainment, na.rm = TRUE), 2),
    Rating_wifi_mean = round(mean(Rating_wifi, na.rm = TRUE), 2),
    Rating_value_mean = round(mean(Rating_value, na.rm = TRUE), 2),
    Rating_overall_mean = round(mean(Rating_overall, na.rm = TRUE), 2)
  ) %>%
  kbl() %>%
  kable_minimal() %>%
  scroll_box(width = "100%")

Rating_seat_comfort_mean	Rating_service_cabin_mean	Rating_food_mean	Rating_service_ground_mean	Rating_entertainment_mean	Rating_wifi_mean	Rating_value_mean	Rating_overall_mean
2.37	2.75	1.92	2.16	1.16	1.12	2.73	4.38

With across() - The code is much simpler, regardless of the number of columns.

ryanair_reviews %>%
  summarise(
    across(
      .cols = starts_with("Rating"),
      .fns = ~ round(mean(.x, na.rm = TRUE), 2),  # ".x" is a placeholder for the column
      .names = "{col}_mean"  # append "_mean" to each column name
    )
  ) %>%
  kbl() %>%
  kable_minimal() %>%
  scroll_box(width = "100%")

Rating_seat_comfort_mean	Rating_service_cabin_mean	Rating_food_mean	Rating_service_ground_mean	Rating_entertainment_mean	Rating_wifi_mean	Rating_value_mean	Rating_overall_mean
2.37	2.75	1.92	2.16	1.16	1.12	2.73	4.38

Example 2. Determine the minimum and maximum rating of each rating category

The .fns argument of across() can have mulitple functions.

Without across() - Again, the code block is long and repetitive.

ryanair_reviews %>%
  summarise(
    Rating_seat_comfort_min = min(Rating_seat_comfort, na.rm = TRUE),
    Rating_seat_comfort_max = max(Rating_seat_comfort, na.rm = TRUE),    
    Rating_service_cabin_min = min(Rating_service_cabin, na.rm = TRUE),
    Rating_service_cabin_max = max(Rating_service_cabin, na.rm = TRUE),    
    Rating_food_min = min(Rating_food, na.rm = TRUE),
    Rating_food_max = max(Rating_food, na.rm = TRUE),    
    Rating_service_ground_min = min(Rating_service_ground, na.rm = TRUE),
    Rating_service_ground_max = max(Rating_service_ground, na.rm = TRUE),    
    Rating_entertainment_min = min(Rating_entertainment, na.rm = TRUE),
    Rating_entertainment_max = max(Rating_entertainment, na.rm = TRUE),    
    Rating_wifi_min = min(Rating_wifi, na.rm = TRUE),
    Rating_wifi_max = max(Rating_wifi, na.rm = TRUE),    
    Rating_value_min = min(Rating_value, na.rm = TRUE),
    Rating_value_max = max(Rating_value, na.rm = TRUE),    
    Rating_overall_min = min(Rating_overall, na.rm = TRUE),
    Rating_overall_max = max(Rating_overall, na.rm = TRUE)   
  ) %>%
  kbl() %>%
  kable_minimal() %>%
  scroll_box(width = "100%")

Rating_seat_comfort_min	Rating_seat_comfort_max	Rating_service_cabin_min	Rating_service_cabin_max	Rating_food_min	Rating_food_max	Rating_service_ground_min	Rating_service_ground_max	Rating_entertainment_min	Rating_entertainment_max	Rating_wifi_min	Rating_wifi_max	Rating_value_min	Rating_value_max	Rating_overall_min	Rating_overall_max
0	5	0	5	0	5	1	5	1	5	1	5	0	5	1	10

With across() - The code is much simpler and easily scalable to more columns and/or functions.

ryanair_reviews %>%
  summarise(
    across(
      .cols = starts_with("Rating"),
      .fns = list(min = ~ min(.x, na.rm = TRUE), 
                  max = ~ max(.x, na.rm = TRUE)),
      .names = "{col}_{fn}"  # append function name ({fn}) to each column name
    )
  ) %>%
  kbl() %>%
  kable_minimal() %>%
  scroll_box(width = "100%")

Rating_seat_comfort_min	Rating_seat_comfort_max	Rating_service_cabin_min	Rating_service_cabin_max	Rating_food_min	Rating_food_max	Rating_service_ground_min	Rating_service_ground_max	Rating_entertainment_min	Rating_entertainment_max	Rating_wifi_min	Rating_wifi_max	Rating_value_min	Rating_value_max	Rating_overall_min	Rating_overall_max
0	5	0	5	0	5	1	5	1	5	1	5	0	5	1	10

Summary

Although there are alternative approaches to coding the examples above, across() is an attractive function for creating scalable and easy-to-read code to transform multiple columns in a dataframe.

References

“Apply a function (or functions) across multiple columns” https://dplyr.tidyverse.org/reference/across.html

“Column-wise operations in dplyr” https://www.r4epi.com/column-wise-operations-in-dplyr

Data tidying isn’t shown because it isn’t the focus of the vignette. Please see the R markdown file.↩︎