In this assignment you will work to tidy, clean, and analyze two different datasets, the first is a small dataset contained in a csv file called flightdelays.csv, and the second called MixedDrinkRecipes-Prep.csv.
The most important book chapters which cover the techniques you will practice here are R4DS Chapters 5 and 7. Also helpful are the tidyr vignette on pivoting and the ggplot help page on the geom_dotplot.
Submit your completed assignment on the course brightspace page by uploading your .qmd file and a compiled pdf or link to a compiled html, which you could host on your github or rpubs page as you wish.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Part 1: Airplane flight delays
Consider the following dataset:
Los_Angeles
Phoenix
San_Diego
San_Francisco
Seattle
ALASKA
On_Time
497
221
212
503
1841
Delayed
62
12
20
102
305
AM WEST
On_Time
694
4840
383
320
301
Delayed
117
415
65
129
61
The above table describes arrival delays for two different airlines across several destinations. The numbers correspond to the number of flights that were in either the delayed category or the on time category.
Problems
Problem 1: Read the information from flightdelays.csv into R, and use tidyr and dplyr to convert this data into a tidy/tall format with names and complete data for all columns. Your final data frame should have City, On_Time_Flights and Delayed_Flights as columns (the exact names are up to you). In addition to pivot_longer, pivot_wider and rename, you might find the tidyr function fill helpful for completing this task efficiently. Although this is a small dataset that you could easily reshape by hand, you should solve this problem using tidyverse functions that do the work for you.
Problem 2: Take the data-frame that you tidied and cleaned in Problem 1 and create additional columns which contain the fraction of on-time and delayed flights at each airport. Then create a Cleveland Multiway Dot Plot (see this tutorial page for a description for how) to visualize the difference in flight delays between the two airlines at each city in the dataset. Compare the airlines and airports using the dot-plot- what are your conclusions?
Optional: If you want to make a fancier visualization consider adding text labels containing the airline names above the dots using geom_text and position = position_nudge(...) with appropriate arguments.
Note from Amanda: Problem 1 had us eliminate Airline (instruction was to have columns for City, On Time Flights, and Delayed Flights), so rather than re-using that dataframe I’m going to create a new one that retains the Airline…
flightdelays_proportions_without_Airline <- flightdelays_clean |>mutate(Total_Flights = On_Time_Flights + Delayed_Flights ) |>mutate(Fraction_On_Time =round(On_Time_Flights / Total_Flights, digits =3),Fraction_Delayed =round(Delayed_Flights / Total_Flights, digits =3) ) # for the proportions of total flights per Cityflightdelays_proportions_without_Airline # total flights per City
When we look at the proportion of delayed flights across five cities, we can see a better representation than if we compared the raw numbers for number of flights delayed. If we were to solely look at the number of flights delayed in each airport to compare the ‘performance’ of each airline, we would not get the same story we do when we look at the proportion of delayed flights. The ‘raw numbers’ in ‘Flights Delayed’ appear to indicate Phoenix is problematic for AM WEST, with the largest number of delayed flights of all the airports, recorded at 415 delayed flights. However, the number of on-time flights for AM WEST at Phoenix is 4,840: the proportion of delayed flights, as visible in the Cleveland Dot Plot above, is less than 10% for both airlines (.08 or 8% for AM WEST and .05 or 5% for ALASKA).
Using the Cleveland Dot Plot and the Fraction of Delayed Flights per airport in each city, we can see a clearer picture of where the highest proportion of delayed flights occurs for both ALASKA and AM WEST: San Francisco. We wouldn’t see this information if we merely looked at the counts of delayed flights and compared them: For ALASKA, 17% of its flights are delayed in San Francisco representing 102 flights; for AM WEST, a significant 29% of its flights are delayed in San Francisco, representing 129 flights. Using indicators of fractions of delayed flights gives us a better picture of performance when comparing airlines’ delayed flights.
group x y PANEL flipped_aes colour linewidth linetype alpha
1 1 62 1 1 FALSE black 0.5 1 NA
2 1 117 1 1 FALSE black 0.5 1 NA
3 2 12 2 1 FALSE black 0.5 1 NA
4 2 415 2 1 FALSE black 0.5 1 NA
5 3 20 3 1 FALSE black 0.5 1 NA
6 3 65 3 1 FALSE black 0.5 1 NA
7 4 102 4 1 FALSE black 0.5 1 NA
8 4 129 4 1 FALSE black 0.5 1 NA
9 5 61 5 1 FALSE black 0.5 1 NA
10 5 305 5 1 FALSE black 0.5 1 NA
As you can see in the above table of data, the ‘correct’ numbers (including 305 for Seattle / ALASKA airline) are part of the data that creates the ggplot. I’m pretty stumped at why it’s not showing up the way I’d like it to. But, that’s OK, because this is more of an ‘extra’ view – the proportional delays is what I’d like to highlight instead, and that plot turned out as I wanted it to.
In the second part of this assignment we will be working with a dataset containing ingredients for different types of mixed drinks. This dataset is untidy and messy- it is in a wide data format and contains some inconsistencies that should be fixed.
Problems
Problem 3 Load the mixed drink recipe dataset into R from the file MixedDrinkRecipes-prep.csv, which you can download from my github page by clicking here. The variables ingredient1 through ingredient6 list the ingredients of the cocktail listed in the name column. Notice that there are many NA values in the ingredient columns, indicating that most cocktails have under 6 ingredients.
Tidy this dataset using pivot_longer to create a new data frame where each there is a row corresponding to each ingredient of all the cocktails, and an additional variable specifying the “rank” of that cocktail in the original recipe, i.e. it should look like this:
name
category
Ingredient_Rank
Ingredient
Gauguin
Cocktail Classics
1
Light Rum
Gauguin
Cocktail Classics
2
Passion Fruit Syrup
Gauguin
Cocktail Classics
3
Lemon Juice
Gauguin
Cocktail Classics
4
Lime Juice
Fort Lauderdale
Cocktail Classics
1
Light Rum
where the data-type of Ingredient_Rank is an integer. Hint: Use the parse_number() function in mutate after your initial pivot.
# A tibble: 3,934 × 4
name category Ingredient_Rank Ingredient
<chr> <chr> <dbl> <chr>
1 Gauguin Cocktail Classics 1 Light Rum
2 Gauguin Cocktail Classics 2 Passion Fruit Syrup
3 Gauguin Cocktail Classics 3 Lemon Juice
4 Gauguin Cocktail Classics 4 Lime Juice
5 Fort Lauderdale Cocktail Classics 1 Light Rum
6 Fort Lauderdale Cocktail Classics 2 Sweet Vermouth
7 Fort Lauderdale Cocktail Classics 3 Juice of Orange
8 Fort Lauderdale Cocktail Classics 4 Juice of a Lime
9 Apple Pie Cordials and Liqueurs 1 Apple schnapps
10 Apple Pie Cordials and Liqueurs 2 Cinnamon schnapps
# ℹ 3,924 more rows
Problem 4: Some of the ingredients in the ingredient list have different names, but are nearly the same thing. An example include Lemon Juice versus Juice of a lemon. Make a list of the ingredients appearing in the ingredient list ranked by how commonly they occur along with the number of occurrences, and print the first 10 elements of the list here. Then check more ingredients (I suggest looking at more ingredients and even sorting them alphabetically using arrange(asc(ingredient))) and see if you can spot pairs of ingredients that are similar but have different names. Use if_else( click here for if_else ) in combination with mutate to make it so that the pairs of ingredients you found have the same name. You don’t have to find all pairs, but find at least 5 pairs of ingredients to rename. Because the purpose of this renaming is to facilitate a hypothetical future analysis, you can choose your own criteria for similarity as long as it is somewhat justifiable.
Notice that there are some ingredients that appear to be two or more ingredients strung together with commas. These would be candidates for more cleaning though this exercise doesn’t ask you to fix them.
Problem 3, Part 1:
ingredient_list <- mixeddrink_pivoted |>group_by(Ingredient) |>summarize(number_of_occurences =n() ) |>arrange(desc(number_of_occurences))head(ingredient_list, n =10)
# A tibble: 3,934 × 4
name category Ingredient_Rank Ingredient
<chr> <chr> <dbl> <chr>
1 Gauguin Cocktail Classics 1 Light Rum
2 Gauguin Cocktail Classics 2 Passion Fruit Syrup
3 Gauguin Cocktail Classics 3 Lemon Juice
4 Gauguin Cocktail Classics 4 Lime Juice
5 Fort Lauderdale Cocktail Classics 1 Light Rum
6 Fort Lauderdale Cocktail Classics 2 Sweet Vermouth
7 Fort Lauderdale Cocktail Classics 3 Orange Juice
8 Fort Lauderdale Cocktail Classics 4 Lime Juice
9 Apple Pie Cordials and Liqueurs 1 Apple schnapps
10 Apple Pie Cordials and Liqueurs 2 Cinnamon schnapps
# ℹ 3,924 more rows
Fun! Now we can see the information in another way.
Problem 5: Some operations are easier to do on wide data rather than tall data. Find the 10 most common pairs of ingredients occurring in the top 2 ingredients in a recipe. It is much easier to do this with a wide dataset, so use pivot_wider to change the data so that each row contains all of the ingredients of a single cocktail, just like in the format of the original data-set. Then use count on the 1 and 2 columns to determine the most common pairs (see chapter 3 for a refresher on count).
Note: You may be interested to read about the widyr package here: widyr page. It is designed to solve problems like this one and uses internal pivot steps to accomplish it so that the final result is tidy. I’m actually unaware of any easy ways of solving problem 5 without pivoting to a wide dataset.
# A tibble: 688 × 3
Ingredient1 Ingredient2 Pair_Count
<chr> <chr> <int>
1 Lemon Juice Powdered Sugar 24
2 Gin Dry Vermouth 23
3 Light Rum Lime Juice 14
4 Whole Egg Powdered Sugar 13
5 Gin Triple Sec 9
6 Bourbon whiskey Lemon Juice 8
7 Light Rum Sweet Vermouth 7
8 Gin Sweet Vermouth 7
9 Brandy Sweet Vermouth 7
10 Light Rum Pineapple Juice 7
# ℹ 678 more rows
We can see that the most common pairs of ingredients in the Cocktail list, after performing all the above operations, are Lemon Juice and Powdered Sugar, Gin and Dry Vermouth, and Light Rum and Lime Juice.