Tidyverse Extend

#— #title: “Tidyverse Create” #author: “Lawrence Yu” #date: “2025-05-04” #output: openintro::lab_report #—

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.4.3

library(openintro)

Sample Vignette

This is an example of how to use tidyverse packages such as dplyr, forcats, and ggplot2. Our goal is to answer the question: does education level impact a person’s steak preference?

Start by loading the data. The resulting dataframe will require a bit of tidying before it can be used. The data is from a steak survey at https://github.com/fivethirtyeight/data/tree/master/steak-survey.

steak_url <- 'https://raw.githubusercontent.com/Megabuster/Data607/refs/heads/main/data/tidyverse/steak-risk-survey.csv'
raw_steak <- read.csv(steak_url, header = TRUE, sep = ',')

Using Dplyr

Using rename from dplyr in tidyverse, we can rename the otherwise long and unwieldy column names. Filter, also from dplyr, for respondents that say they eat steak because we are looking for their opinions. Subset from dplyr allows us to then remove the eat_steak column as it no longer provides meaningful information.

steak_df <- raw_steak %>% 
  rename(
    respondent_id = 1,
    lottery_risk = 2,
    smoke = 3,
    alcohol = 4,
    gamble = 5,
    skydiving = 6,
    drive_limit = 7,
    cheat = 8, 
    eat_steak = 9,
    steak_prepared = 10,
    gender = 11,
    age = 12,
    income = 13,
    education = 14,
    location = 15
  ) %>%
  filter(eat_steak == 'Yes') %>%
  subset(select = -eat_steak)

Checking education levels, we can see that most respondents have at least some college background. Only 1 person was listed as having below a high school degree. There are also 5 preparation styles for steak in this dataset.

as.data.frame(table(steak_df$education))

##                               Var1 Freq
## 1                                    22
## 2                  Bachelor degree  132
## 3                  Graduate degree  109
## 4               High school degree   32
## 5     Less than high school degree    1
## 6 Some college or Associate degree  134

as.data.frame(table(steak_df$steak_prepared))

##          Var1 Freq
## 1      Medium  132
## 2 Medium rare  166
## 3 Medium Well   74
## 4        Rare   23
## 5        Well   35

Using Forcats

We will use fct_lump_lowfreq from forcats to group the unknown and less than high school degree respondents together as they represent a small amount of the respondents. Ggplot from ggplot2 within tidyverse can be used to show how these counts compare.

steak_df <- steak_df %>% 
  mutate(education = fct_lump_min(education, table(steak_df$education)['High school degree'])) 
steak_df %>%
  ggplot(aes(y = education)) +
  geom_bar()

We can then check to see which education and steak style combinations are most common.

steak_df %>% count(steak_prepared) %>% arrange(desc(n))

##   steak_prepared   n
## 1    Medium rare 166
## 2         Medium 132
## 3    Medium Well  74
## 4           Well  35
## 5           Rare  23

steak_df %>% 
  group_by(education) %>% 
  count(steak_prepared) %>% 
  arrange(desc(n))

## # A tibble: 25 × 3
## # Groups:   education [5]
##    education                        steak_prepared     n
##    <fct>                            <chr>          <int>
##  1 Some college or Associate degree Medium rare       51
##  2 Bachelor degree                  Medium rare       47
##  3 Graduate degree                  Medium rare       46
##  4 Bachelor degree                  Medium            45
##  5 Some college or Associate degree Medium            39
##  6 Graduate degree                  Medium            33
##  7 Some college or Associate degree Medium Well       26
##  8 Bachelor degree                  Medium Well       21
##  9 Graduate degree                  Medium Well       18
## 10 High school degree               Medium rare       14
## # ℹ 15 more rows

Medium rare and medium were the most preferred steak preparation styles. This was apparent when counting out each combination as medium rare was the most common order for some college/associate degree, bachelor degree, and graduate degree. This implies that people who go to college prefer medium rare the most.

Using Ggplot

This result can be better understood with visualizations from ggplot.

steak_df %>% 
  group_by(education) %>% 
  filter(steak_prepared == 'Medium rare') %>%
  ggplot(aes(y = education)) +
  stat_count() +
  labs(title = 'Medium rare steak choosers by education level')

The counts for medium rare look similar to the education distribution of the whole. We should check to make sure the high amount of medium rare selectors that have gone to college is not just because of the population distribution.

medium_rare_count <- steak_df %>% 
  group_by(education) %>% 
  filter(steak_prepared == 'Medium rare') %>% 
  count() %>%
  group_by(education) %>% 
  summarise(total = sum(n))

education_count <- steak_df %>% 
  count(education) %>%
  group_by(education) %>% 
  summarise(total = sum(n))
  
education_count$mr_count = medium_rare_count$total

education_count <- education_count %>%
  mutate(percentage = mr_count / total) %>%
  arrange(desc(percentage))

education_count

## # A tibble: 5 × 4
##   education                        total mr_count percentage
##   <fct>                            <int>    <int>      <dbl>
## 1 High school degree                  32       14      0.438
## 2 Graduate degree                    109       46      0.422
## 3 Some college or Associate degree   134       51      0.381
## 4 Bachelor degree                    132       47      0.356
## 5 Other                               23        8      0.348

education_count %>%
  ggplot(aes(x = percentage, y = education)) +
  geom_col() + 
  labs(title = 'Medium rare steak choosers by education level percentage')

These last results are much closer, but also paint a different picture. Among the respondents of this data set, high school graduates were the most likely to select medium rare as their favorite way of having steak. Each group hovered around 35-40% in favor of medium rare.

Conclusions

It is common to be able to use multiple tidyverse packages within a workflow. Using dplyr, forcats, and ggplot, we were able to tidy the original steak data using dplyr, reorganized education values using forcats, and plotted education for medium rare steak choosers using ggplot.

We hoped to answer if education level impacted a person’s steak preference. Initially it appeared that medium rare steaks were more popular with those with college experience or a college degree. However, that was because the sample population was skewed toward college. When we accounted for this sk

Extension: Gender Differences in Steak Preference

By Tyler Graham

To build on Lawrence Yu’s excellent(and very amusing) analysis, we now explore whether gender also influences steak preparation preference…

# Remove rows with missing or empty gender
steak_df <- steak_df %>%
  filter(gender != "" & !is.na(gender))

# Grouping by gender and steak_prepared to count combinations
steak_df %>%
  group_by(gender, steak_prepared) %>%
  summarise(count = n(), .groups = "drop") %>%
  ggplot(aes(x = steak_prepared, y = count, fill = gender)) +
  geom_col(position = "dodge") +
  labs(
    title = "Steak Preparation Preference by Gender",
    x = "Steak Preparation Style",
    y = "Count",
    fill = "Gender"
  ) +
  theme_minimal()

This bar chart lets us visually compare steak preferences across gender groups.

# Total number of respondents by gender
gender_counts <- steak_df %>%
  count(gender)

# Number of medium rare steak choosers by gender
medium_rare_by_gender <- steak_df %>%
  filter(steak_prepared == "Medium rare") %>%
  count(gender) %>%
  rename(mr_count = n)

# Join and calculate percentage
gender_percentages <- left_join(gender_counts, medium_rare_by_gender, by = "gender") %>%
  mutate(
    mr_count = replace_na(mr_count, 0),
    percentage = mr_count / n
  )

# Plot the percentage of each gender that prefers medium rare
gender_percentages %>%
  ggplot(aes(x = gender, y = percentage, fill = gender)) +
  geom_col() +
  geom_text(aes(label = scales::percent(percentage, accuracy = 1)),
            vjust = -0.5) +
  labs(
    title = "Medium Rare Steak Preference by Gender (Percentage)",
    x = "Gender",
    y = "Percent Choosing Medium Rare"
  ) +
  scale_y_continuous(labels = scales::percent) +
  theme_minimal()

As shown, both male and female respondents preferred medium rare at similar rates—around 40% of each group. This suggests that steak preference may not vary significantly by gender, at least within this dataset.

Conclusion

This extension shows how tidyverse tools can be used to layer in additional demographic factors—like gender—into a tidy data analysis. While overall counts tell part of the story, normalizing by group size is essential to uncover true preferences.