Lab 2: Data Tidying

Author

Amanda Rose Knudsen

Overview

In this assignment you will work to tidy, clean, and analyze two different datasets, the first is a small dataset contained in a csv file called flightdelays.csv, and the second called MixedDrinkRecipes-Prep.csv.

The most important book chapters which cover the techniques you will practice here are R4DS Chapters 5 and 7. Also helpful are the tidyr vignette on pivoting and the ggplot help page on the geom_dotplot.

Submit your completed assignment on the course brightspace page by uploading your .qmd file and a compiled pdf or link to a compiled html, which you could host on your github or rpubs page as you wish.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Part 1: Airplane flight delays

Consider the following dataset:

		Los_Angeles	Phoenix	San_Diego	San_Francisco	Seattle
ALASKA	On_Time	497	221	212	503	1841
	Delayed	62	12	20	102	305
AM WEST	On_Time	694	4840	383	320	301
	Delayed	117	415	65	129	61

The above table describes arrival delays for two different airlines across several destinations. The numbers correspond to the number of flights that were in either the delayed category or the on time category.

Problems

Problem 1: Read the information from flightdelays.csv into R, and use tidyr and dplyr to convert this data into a tidy/tall format with names and complete data for all columns. Your final data frame should have City, On_Time_Flights and Delayed_Flights as columns (the exact names are up to you). In addition to pivot_longer, pivot_wider and rename, you might find the tidyr function fill helpful for completing this task efficiently. Although this is a small dataset that you could easily reshape by hand, you should solve this problem using tidyverse functions that do the work for you.

flightdelays = read_csv("Lab02_Data/flightdelays.csv")

flightdelays_clean <- flightdelays |> 
  rename(Airline = "...1", Status = "...2") |> 
  fill(Airline) |> 
  pivot_longer(
    cols = Los_Angeles:Seattle,
    names_to = "City",
    values_to = "Count_of_Flights"
  ) |> 
  pivot_wider(
    names_from = Status,
    values_from = Count_of_Flights
  ) |> 
  group_by(City) |> 
  summarize(
    On_Time_Flights = sum(On_Time),
    Delayed_Flights = sum(Delayed)
  ) |> 
  select(
    City, On_Time_Flights, Delayed_Flights
    ) # the 3 columns as requested in Problem 1

flightdelays_clean

# A tibble: 5 × 3
  City          On_Time_Flights Delayed_Flights
  <chr>                   <dbl>           <dbl>
1 Los_Angeles              1191             179
2 Phoenix                  5061             427
3 San_Diego                 595              85
4 San_Francisco             823             231
5 Seattle                  2142             366

Problem 2: Take the data-frame that you tidied and cleaned in Problem 1 and create additional columns which contain the fraction of on-time and delayed flights at each airport. Then create a Cleveland Multiway Dot Plot (see this tutorial page for a description for how) to visualize the difference in flight delays between the two airlines at each city in the dataset. Compare the airlines and airports using the dot-plot- what are your conclusions?

Optional: If you want to make a fancier visualization consider adding text labels containing the airline names above the dots using geom_text and position = position_nudge(...) with appropriate arguments.

Note from Amanda: Problem 1 had us eliminate Airline (instruction was to have columns for City, On Time Flights, and Delayed Flights), so rather than re-using that dataframe I’m going to create a new one that retains the Airline…

flightdelays_proportions_without_Airline <- flightdelays_clean |> 
  mutate(
    Total_Flights = On_Time_Flights + Delayed_Flights
  ) |> 
  mutate(
    Fraction_On_Time = round(On_Time_Flights / Total_Flights, digits = 3),
    Fraction_Delayed = round(Delayed_Flights / Total_Flights, digits = 3)
  ) # for the proportions of total flights per City

flightdelays_proportions_without_Airline # total flights per City

# A tibble: 5 × 6
  City          On_Time_Flights Delayed_Flights Total_Flights Fraction_On_Time
  <chr>                   <dbl>           <dbl>         <dbl>            <dbl>
1 Los_Angeles              1191             179          1370            0.869
2 Phoenix                  5061             427          5488            0.922
3 San_Diego                 595              85           680            0.875
4 San_Francisco             823             231          1054            0.781
5 Seattle                  2142             366          2508            0.854
# ℹ 1 more variable: Fraction_Delayed <dbl>

flightdelays_proportions_with_Airline <- flightdelays |> 
  rename(Airline = "...1", Status = "...2") |> 
  fill(Airline) |> 
  pivot_longer(
    cols = Los_Angeles:Seattle,
    names_to = "City",
    values_to = "Count_of_Flights"
  ) |> 
  pivot_wider(
    names_from = Status,
    values_from = Count_of_Flights
  ) |> 
  mutate(
    Total_Flights = On_Time + Delayed
  ) |> 
  mutate(
    Fraction_On_Time = round(On_Time / Total_Flights, digits = 3),
    Fraction_Delayed = round(Delayed / Total_Flights, digits = 3)
  )

print(flightdelays_proportions_with_Airline)

# A tibble: 10 × 7
   Airline City  On_Time Delayed Total_Flights Fraction_On_Time Fraction_Delayed
   <chr>   <chr>   <dbl>   <dbl>         <dbl>            <dbl>            <dbl>
 1 ALASKA  Los_…     497      62           559            0.889            0.111
 2 ALASKA  Phoe…     221      12           233            0.948            0.052
 3 ALASKA  San_…     212      20           232            0.914            0.086
 4 ALASKA  San_…     503     102           605            0.831            0.169
 5 ALASKA  Seat…    1841     305          2146            0.858            0.142
 6 AM WEST Los_…     694     117           811            0.856            0.144
 7 AM WEST Phoe…    4840     415          5255            0.921            0.079
 8 AM WEST San_…     383      65           448            0.855            0.145
 9 AM WEST San_…     320     129           449            0.713            0.287
10 AM WEST Seat…     301      61           362            0.831            0.169

right_label <- flightdelays_proportions_with_Airline |> 
  group_by(City) |> 
  arrange(desc(Fraction_Delayed)) |> 
  top_n(1)

Selecting by Fraction_Delayed

left_label <- flightdelays_proportions_with_Airline |> 
  group_by(City) |> 
  arrange(desc(Fraction_Delayed)) |> 
  slice(2)
  
ggplot(flightdelays_proportions_with_Airline, aes(Fraction_Delayed, City)) +
  geom_line(aes(group = City)) +
  geom_point(aes(color = Airline), size = 1.5) +
  geom_text(
    data = right_label, 
    aes(color = Airline, label = round(
      Fraction_Delayed, 2)), size = 3, hjust = -.5) +
  geom_text(
    data = left_label, 
    aes(color = Airline, label = round(
      Fraction_Delayed, 2)), size = 3, hjust = 1.5) +
  scale_x_continuous(limits = c(-0.05, 0.35)) + 
  labs(title = "Airlines' Proportion of Delayed Flights in 5 Cities")

When we look at the proportion of delayed flights across five cities, we can see a better representation than if we compared the raw numbers for number of flights delayed. If we were to solely look at the number of flights delayed in each airport to compare the ‘performance’ of each airline, we would not get the same story we do when we look at the proportion of delayed flights. The ‘raw numbers’ in ‘Flights Delayed’ appear to indicate Phoenix is problematic for AM WEST, with the largest number of delayed flights of all the airports, recorded at 415 delayed flights. However, the number of on-time flights for AM WEST at Phoenix is 4,840: the proportion of delayed flights, as visible in the Cleveland Dot Plot above, is less than 10% for both airlines (.08 or 8% for AM WEST and .05 or 5% for ALASKA).

Using the Cleveland Dot Plot and the Fraction of Delayed Flights per airport in each city, we can see a clearer picture of where the highest proportion of delayed flights occurs for both ALASKA and AM WEST: San Francisco. We wouldn’t see this information if we merely looked at the counts of delayed flights and compared them: For ALASKA, 17% of its flights are delayed in San Francisco representing 102 flights; for AM WEST, a significant 29% of its flights are delayed in San Francisco, representing 129 flights. Using indicators of fractions of delayed flights gives us a better picture of performance when comparing airlines’ delayed flights.

right_label2 <- flightdelays_proportions_with_Airline |> 
  group_by(City) |> 
  arrange(desc(Delayed)) |> 
  top_n(1)

Selecting by Fraction_Delayed

left_label2 <- flightdelays_proportions_with_Airline |> 
  group_by(City) |> 
  arrange(desc(Delayed)) |> 
  slice(2)

plot_delay_counts <- ggplot(flightdelays_proportions_with_Airline, 
                            aes(Delayed, City)) +
  geom_line(aes(group = City)) +
  geom_point(aes(color = Airline), size = 1.5) +
  geom_text(data = right_label2, 
            aes(color = Airline, label = Delayed), size = 4, hjust = -.5) +
  geom_text(data = left_label2, 
            aes(color = Airline, label = Delayed), size = 4, hjust = 1.5) +
  scale_x_continuous(limits = c(-20, 475)) + 
  labs(title = "Airlines' Count of Delayed Flights in 5 Cities")

plot_delay_counts

plot_data <- ggplot_build(plot_delay_counts) |> 
  pluck("data" , 1)

print(plot_data)

   group   x y PANEL flipped_aes colour linewidth linetype alpha
1      1  62 1     1       FALSE  black       0.5        1    NA
2      1 117 1     1       FALSE  black       0.5        1    NA
3      2  12 2     1       FALSE  black       0.5        1    NA
4      2 415 2     1       FALSE  black       0.5        1    NA
5      3  20 3     1       FALSE  black       0.5        1    NA
6      3  65 3     1       FALSE  black       0.5        1    NA
7      4 102 4     1       FALSE  black       0.5        1    NA
8      4 129 4     1       FALSE  black       0.5        1    NA
9      5  61 5     1       FALSE  black       0.5        1    NA
10     5 305 5     1       FALSE  black       0.5        1    NA

As you can see in the above table of data, the ‘correct’ numbers (including 305 for Seattle / ALASKA airline) are part of the data that creates the ggplot. I’m pretty stumped at why it’s not showing up the way I’d like it to. But, that’s OK, because this is more of an ‘extra’ view – the proportional delays is what I’d like to highlight instead, and that plot turned out as I wanted it to.

Part 2: Mixed Drink Recipes

mixeddrink = read_csv("Lab02_Data/MixedDrinkRecipes-prep.csv")

In the second part of this assignment we will be working with a dataset containing ingredients for different types of mixed drinks. This dataset is untidy and messy- it is in a wide data format and contains some inconsistencies that should be fixed.

Problems

Problem 3 Load the mixed drink recipe dataset into R from the file MixedDrinkRecipes-prep.csv, which you can download from my github page by clicking here. The variables ingredient1 through ingredient6 list the ingredients of the cocktail listed in the name column. Notice that there are many NA values in the ingredient columns, indicating that most cocktails have under 6 ingredients.

Tidy this dataset using pivot_longer to create a new data frame where each there is a row corresponding to each ingredient of all the cocktails, and an additional variable specifying the “rank” of that cocktail in the original recipe, i.e. it should look like this:

name	category	Ingredient_Rank	Ingredient
Gauguin	Cocktail Classics	1	Light Rum
Gauguin	Cocktail Classics	2	Passion Fruit Syrup
Gauguin	Cocktail Classics	3	Lemon Juice
Gauguin	Cocktail Classics	4	Lime Juice
Fort Lauderdale	Cocktail Classics	1	Light Rum

where the data-type of Ingredient_Rank is an integer. Hint: Use the parse_number() function in mutate after your initial pivot.

mixeddrink_pivoted <- mixeddrink |> 
  pivot_longer(
    cols = starts_with("ingr"),
    names_to = "Ingredient_Rank",
    values_to = "Ingredient",
    values_drop_na = TRUE
  ) |> 
  mutate(
    Ingredient_Rank = parse_number(Ingredient_Rank)
  )

mixeddrink_pivoted

# A tibble: 3,934 × 4
   name            category              Ingredient_Rank Ingredient         
   <chr>           <chr>                           <dbl> <chr>              
 1 Gauguin         Cocktail Classics                   1 Light Rum          
 2 Gauguin         Cocktail Classics                   2 Passion Fruit Syrup
 3 Gauguin         Cocktail Classics                   3 Lemon Juice        
 4 Gauguin         Cocktail Classics                   4 Lime Juice         
 5 Fort Lauderdale Cocktail Classics                   1 Light Rum          
 6 Fort Lauderdale Cocktail Classics                   2 Sweet Vermouth     
 7 Fort Lauderdale Cocktail Classics                   3 Juice of Orange    
 8 Fort Lauderdale Cocktail Classics                   4 Juice of a Lime    
 9 Apple Pie       Cordials and Liqueurs               1 Apple schnapps     
10 Apple Pie       Cordials and Liqueurs               2 Cinnamon schnapps  
# ℹ 3,924 more rows

Problem 4: Some of the ingredients in the ingredient list have different names, but are nearly the same thing. An example include Lemon Juice versus Juice of a lemon. Make a list of the ingredients appearing in the ingredient list ranked by how commonly they occur along with the number of occurrences, and print the first 10 elements of the list here. Then check more ingredients (I suggest looking at more ingredients and even sorting them alphabetically using arrange(asc(ingredient))) and see if you can spot pairs of ingredients that are similar but have different names. Use if_else( click here for if_else ) in combination with mutate to make it so that the pairs of ingredients you found have the same name. You don’t have to find all pairs, but find at least 5 pairs of ingredients to rename. Because the purpose of this renaming is to facilitate a hypothetical future analysis, you can choose your own criteria for similarity as long as it is somewhat justifiable.

Notice that there are some ingredients that appear to be two or more ingredients strung together with commas. These would be candidates for more cleaning though this exercise doesn’t ask you to fix them.

Problem 3, Part 1:

ingredient_list <- mixeddrink_pivoted |> 
  group_by(Ingredient) |> 
  summarize(
    number_of_occurences = n()
  ) |> 
  arrange(desc(number_of_occurences))

head(ingredient_list, n = 10)

# A tibble: 10 × 2
   Ingredient        number_of_occurences
   <chr>                            <int>
 1 Gin                                176
 2 Fresh lemon juice                  138
 3 Simple Syrup                       115
 4 Light Rum                          114
 5 Vodka                              114
 6 Dry Vermouth                       107
 7 Fresh Lime Juice                   107
 8 Triple Sec                         107
 9 Powdered Sugar                      90
10 Grenadine                           85

Problem 3, Part 2:

ingredient_list_alpha <- ingredient_list |> 
  arrange(Ingredient)

ingredient_list_alpha

# A tibble: 673 × 2
   Ingredient                              number_of_occurences
   <chr>                                                  <int>
 1 100-proof Vodka                                            1
 2 151-Proof Rum                                              5
 3 17-year-old J. Wray and Nephew Ltd. Rum                    1
 4 7-Up                                                       1
 5 Absinthe                                                   8
 6 Absinthe Substitute                                        7
 7 Absinthe or pastis                                         4
 8 Acai berry flavored vodka                                  1
 9 African rum                                                1
10 Agave nectar                                              10
# ℹ 663 more rows

Problem 3, Part 3:

mixeddrink_pivoted_clean <- mixeddrink_pivoted |> 
  mutate(
    Ingredient = if_else(
      Ingredient == "Amaretto di Saronno", "Amaretto", Ingredient
    )
  ) |> 
  mutate(
    Ingredient = if_else(
      Ingredient == "Amaro Nonino", "Amaro", Ingredient
    )
  ) |> 
  mutate(
    Ingredient = if_else(
      Ingredient == "Angostura bitters", "Angostura Bitters", Ingredient
    )
  ) |> 
  mutate(
    Ingredient = if_else(
      Ingredient == "Juice of a Lime" | 
        Ingredient == "Fresh Lime Juice","Lime Juice", Ingredient
    )
  ) |> 
  mutate (
    Ingredient = if_else(
      Ingredient == "Juice of Orange" | 
        Ingredient == "Fresh orange juice", "Orange Juice", Ingredient
    )
  ) |> 
  mutate(
    Ingredient = if_else(
      Ingredient == "Juice of a Lemon" | 
        Ingredient == "Fresh lemon juice", "Lemon Juice", Ingredient
    )
  ) 

mixeddrink_pivoted_clean

# A tibble: 3,934 × 4
   name            category              Ingredient_Rank Ingredient         
   <chr>           <chr>                           <dbl> <chr>              
 1 Gauguin         Cocktail Classics                   1 Light Rum          
 2 Gauguin         Cocktail Classics                   2 Passion Fruit Syrup
 3 Gauguin         Cocktail Classics                   3 Lemon Juice        
 4 Gauguin         Cocktail Classics                   4 Lime Juice         
 5 Fort Lauderdale Cocktail Classics                   1 Light Rum          
 6 Fort Lauderdale Cocktail Classics                   2 Sweet Vermouth     
 7 Fort Lauderdale Cocktail Classics                   3 Orange Juice       
 8 Fort Lauderdale Cocktail Classics                   4 Lime Juice         
 9 Apple Pie       Cordials and Liqueurs               1 Apple schnapps     
10 Apple Pie       Cordials and Liqueurs               2 Cinnamon schnapps  
# ℹ 3,924 more rows

Fun! Now we can see the information in another way.

Problem 5: Some operations are easier to do on wide data rather than tall data. Find the 10 most common pairs of ingredients occurring in the top 2 ingredients in a recipe. It is much easier to do this with a wide dataset, so use pivot_wider to change the data so that each row contains all of the ingredients of a single cocktail, just like in the format of the original data-set. Then use count on the 1 and 2 columns to determine the most common pairs (see chapter 3 for a refresher on count).

Note: You may be interested to read about the widyr package here: widyr page. It is designed to solve problems like this one and uses internal pivot steps to accomplish it so that the final result is tidy. I’m actually unaware of any easy ways of solving problem 5 without pivoting to a wide dataset.

most_common_ingredient_pairs <- mixeddrink_pivoted_clean |> 
  pivot_wider(
    id_cols = name,
    names_from = Ingredient_Rank,
    values_from = Ingredient
  ) |> 
  rename(Ingredient1 = "1") |> 
  rename(Ingredient2 = "2") |> 
  summarize(
    Pair_Count = n(),
    .by = c(Ingredient1, Ingredient2)
  ) |> 
  arrange(desc(Pair_Count))

most_common_ingredient_pairs

# A tibble: 688 × 3
   Ingredient1     Ingredient2     Pair_Count
   <chr>           <chr>                <int>
 1 Lemon Juice     Powdered Sugar          24
 2 Gin             Dry Vermouth            23
 3 Light Rum       Lime Juice              14
 4 Whole Egg       Powdered Sugar          13
 5 Gin             Triple Sec               9
 6 Bourbon whiskey Lemon Juice              8
 7 Light Rum       Sweet Vermouth           7
 8 Gin             Sweet Vermouth           7
 9 Brandy          Sweet Vermouth           7
10 Light Rum       Pineapple Juice          7
# ℹ 678 more rows

We can see that the most common pairs of ingredients in the Cocktail list, after performing all the above operations, are Lemon Juice and Powdered Sugar, Gin and Dry Vermouth, and Light Rum and Lime Juice.