#— #title: “Tidyverse Create” #author: “Lawrence Yu” #date: “2025-05-04” #output: openintro::lab_report #—
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
library(openintro)
This is an example of how to use tidyverse packages such as dplyr, forcats, and ggplot2. Our goal is to answer the question: does education level impact a person’s steak preference?
Start by loading the data. The resulting dataframe will require a bit of tidying before it can be used. The data is from a steak survey at https://github.com/fivethirtyeight/data/tree/master/steak-survey.
steak_url <- 'https://raw.githubusercontent.com/Megabuster/Data607/refs/heads/main/data/tidyverse/steak-risk-survey.csv'
raw_steak <- read.csv(steak_url, header = TRUE, sep = ',')
Using rename
from dplyr in tidyverse, we can rename the
otherwise long and unwieldy column names. Filter
, also from
dplyr, for respondents that say they eat steak because we are looking
for their opinions. Subset
from dplyr allows us to then
remove the eat_steak column as it no longer provides meaningful
information.
steak_df <- raw_steak %>%
rename(
respondent_id = 1,
lottery_risk = 2,
smoke = 3,
alcohol = 4,
gamble = 5,
skydiving = 6,
drive_limit = 7,
cheat = 8,
eat_steak = 9,
steak_prepared = 10,
gender = 11,
age = 12,
income = 13,
education = 14,
location = 15
) %>%
filter(eat_steak == 'Yes') %>%
subset(select = -eat_steak)
Checking education levels, we can see that most respondents have at least some college background. Only 1 person was listed as having below a high school degree. There are also 5 preparation styles for steak in this dataset.
as.data.frame(table(steak_df$education))
## Var1 Freq
## 1 22
## 2 Bachelor degree 132
## 3 Graduate degree 109
## 4 High school degree 32
## 5 Less than high school degree 1
## 6 Some college or Associate degree 134
as.data.frame(table(steak_df$steak_prepared))
## Var1 Freq
## 1 Medium 132
## 2 Medium rare 166
## 3 Medium Well 74
## 4 Rare 23
## 5 Well 35
We will use fct_lump_lowfreq
from forcats to group the
unknown and less than high school degree respondents together as they
represent a small amount of the respondents. Ggplot
from
ggplot2 within tidyverse can be used to show how these counts
compare.
steak_df <- steak_df %>%
mutate(education = fct_lump_min(education, table(steak_df$education)['High school degree']))
steak_df %>%
ggplot(aes(y = education)) +
geom_bar()
We can then check to see which education and steak style combinations are most common.
steak_df %>% count(steak_prepared) %>% arrange(desc(n))
## steak_prepared n
## 1 Medium rare 166
## 2 Medium 132
## 3 Medium Well 74
## 4 Well 35
## 5 Rare 23
steak_df %>%
group_by(education) %>%
count(steak_prepared) %>%
arrange(desc(n))
## # A tibble: 25 × 3
## # Groups: education [5]
## education steak_prepared n
## <fct> <chr> <int>
## 1 Some college or Associate degree Medium rare 51
## 2 Bachelor degree Medium rare 47
## 3 Graduate degree Medium rare 46
## 4 Bachelor degree Medium 45
## 5 Some college or Associate degree Medium 39
## 6 Graduate degree Medium 33
## 7 Some college or Associate degree Medium Well 26
## 8 Bachelor degree Medium Well 21
## 9 Graduate degree Medium Well 18
## 10 High school degree Medium rare 14
## # ℹ 15 more rows
Medium rare and medium were the most preferred steak preparation styles. This was apparent when counting out each combination as medium rare was the most common order for some college/associate degree, bachelor degree, and graduate degree. This implies that people who go to college prefer medium rare the most.
This result can be better understood with visualizations from
ggplot
.
steak_df %>%
group_by(education) %>%
filter(steak_prepared == 'Medium rare') %>%
ggplot(aes(y = education)) +
stat_count() +
labs(title = 'Medium rare steak choosers by education level')
The counts for medium rare look similar to the education distribution of the whole. We should check to make sure the high amount of medium rare selectors that have gone to college is not just because of the population distribution.
medium_rare_count <- steak_df %>%
group_by(education) %>%
filter(steak_prepared == 'Medium rare') %>%
count() %>%
group_by(education) %>%
summarise(total = sum(n))
education_count <- steak_df %>%
count(education) %>%
group_by(education) %>%
summarise(total = sum(n))
education_count$mr_count = medium_rare_count$total
education_count <- education_count %>%
mutate(percentage = mr_count / total) %>%
arrange(desc(percentage))
education_count
## # A tibble: 5 × 4
## education total mr_count percentage
## <fct> <int> <int> <dbl>
## 1 High school degree 32 14 0.438
## 2 Graduate degree 109 46 0.422
## 3 Some college or Associate degree 134 51 0.381
## 4 Bachelor degree 132 47 0.356
## 5 Other 23 8 0.348
education_count %>%
ggplot(aes(x = percentage, y = education)) +
geom_col() +
labs(title = 'Medium rare steak choosers by education level percentage')
These last results are much closer, but also paint a different picture. Among the respondents of this data set, high school graduates were the most likely to select medium rare as their favorite way of having steak. Each group hovered around 35-40% in favor of medium rare.
It is common to be able to use multiple tidyverse packages within a workflow. Using dplyr, forcats, and ggplot, we were able to tidy the original steak data using dplyr, reorganized education values using forcats, and plotted education for medium rare steak choosers using ggplot.
We hoped to answer if education level impacted a person’s steak preference. Initially it appeared that medium rare steaks were more popular with those with college experience or a college degree. However, that was because the sample population was skewed toward college. When we accounted for this sk
By Tyler Graham
To build on Lawrence Yu’s excellent(and very amusing) analysis, we now explore whether gender also influences steak preparation preference…
# Remove rows with missing or empty gender
steak_df <- steak_df %>%
filter(gender != "" & !is.na(gender))
# Grouping by gender and steak_prepared to count combinations
steak_df %>%
group_by(gender, steak_prepared) %>%
summarise(count = n(), .groups = "drop") %>%
ggplot(aes(x = steak_prepared, y = count, fill = gender)) +
geom_col(position = "dodge") +
labs(
title = "Steak Preparation Preference by Gender",
x = "Steak Preparation Style",
y = "Count",
fill = "Gender"
) +
theme_minimal()
This bar chart lets us visually compare steak preferences across gender
groups.
# Total number of respondents by gender
gender_counts <- steak_df %>%
count(gender)
# Number of medium rare steak choosers by gender
medium_rare_by_gender <- steak_df %>%
filter(steak_prepared == "Medium rare") %>%
count(gender) %>%
rename(mr_count = n)
# Join and calculate percentage
gender_percentages <- left_join(gender_counts, medium_rare_by_gender, by = "gender") %>%
mutate(
mr_count = replace_na(mr_count, 0),
percentage = mr_count / n
)
# Plot the percentage of each gender that prefers medium rare
gender_percentages %>%
ggplot(aes(x = gender, y = percentage, fill = gender)) +
geom_col() +
geom_text(aes(label = scales::percent(percentage, accuracy = 1)),
vjust = -0.5) +
labs(
title = "Medium Rare Steak Preference by Gender (Percentage)",
x = "Gender",
y = "Percent Choosing Medium Rare"
) +
scale_y_continuous(labels = scales::percent) +
theme_minimal()
As shown, both male and female respondents preferred medium rare at
similar rates—around 40% of each group. This suggests that steak
preference may not vary significantly by gender, at least within this
dataset.
This extension shows how tidyverse tools can be used to layer in additional demographic factors—like gender—into a tidy data analysis. While overall counts tell part of the story, normalizing by group size is essential to uncover true preferences.