There is a well-known gender gap in certain college majors. Some fields have far more women graduates, while others are dominated by men. In this code-through tutorial, we will use R and the Tidyverse to explore and visualize these differences using a dataset of recent college graduates by major. The dataset (from FiveThirtyEight) includes the number of men and women graduating in each major, among other details.
What will we do in this tutorial?
Throughout the tutorial, narrative explanations are provided before and after each code chunk to clarify what the code is doing and what the results mean. This will help you not only see the R code in action but also understand the insights from the output.
First, we need to load the necessary R packages and import the
dataset. We will use the tidyverse collection of
packages (which includes readr for reading data and
dplyr, ggplot2, tidyr for data
manipulation and visualization). The dataset of recent college graduates
by major is available as a CSV file on FiveThirtyEight’s GitHub
repository. We will read it directly from the URL.
# Load necessary packages
library(tidyverse) # Loads dplyr, ggplot2, tidyr, readr, etc.
# Read the dataset of recent college graduates
url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv"
recent_grads <- read_csv(url)
# Take a quick look at the data structure
glimpse(recent_grads)
## Rows: 173
## Columns: 21
## $ Rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
## $ Major_code <dbl> 2419, 2416, 2415, 2417, 2405, 2418, 6202, 5001, 2…
## $ Major <chr> "PETROLEUM ENGINEERING", "MINING AND MINERAL ENGI…
## $ Total <dbl> 2339, 756, 856, 1258, 32260, 2573, 3777, 1792, 91…
## $ Men <dbl> 2057, 679, 725, 1123, 21239, 2200, 2110, 832, 803…
## $ Women <dbl> 282, 77, 131, 135, 11021, 373, 1667, 960, 10907, …
## $ Major_category <chr> "Engineering", "Engineering", "Engineering", "Eng…
## $ ShareWomen <dbl> 0.1205643, 0.1018519, 0.1530374, 0.1073132, 0.341…
## $ Sample_size <dbl> 36, 7, 3, 16, 289, 17, 51, 10, 1029, 631, 399, 14…
## $ Employed <dbl> 1976, 640, 648, 758, 25694, 1857, 2912, 1526, 764…
## $ Full_time <dbl> 1849, 556, 558, 1069, 23170, 2038, 2924, 1085, 71…
## $ Part_time <dbl> 270, 170, 133, 150, 5180, 264, 296, 553, 13101, 1…
## $ Full_time_year_round <dbl> 1207, 388, 340, 692, 16697, 1449, 2482, 827, 5463…
## $ Unemployed <dbl> 37, 85, 16, 40, 1672, 400, 308, 33, 4650, 3895, 2…
## $ Unemployment_rate <dbl> 0.018380527, 0.117241379, 0.024096386, 0.05012531…
## $ Median <dbl> 110000, 75000, 73000, 70000, 65000, 65000, 62000,…
## $ P25th <dbl> 95000, 55000, 50000, 43000, 50000, 50000, 53000, …
## $ P75th <dbl> 125000, 90000, 105000, 80000, 75000, 102000, 7200…
## $ College_jobs <dbl> 1534, 350, 456, 529, 18314, 1142, 1768, 972, 5284…
## $ Non_college_jobs <dbl> 364, 257, 176, 102, 4440, 657, 314, 500, 16384, 1…
## $ Low_wage_jobs <dbl> 193, 50, 0, 0, 972, 244, 259, 220, 3253, 3170, 98…
# Display the first six rows of the dataset
head(recent_grads)
## # A tibble: 6 × 21
## Rank Major_code Major Total Men Women Major_category ShareWomen Sample_size
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1 2419 PETR… 2339 2057 282 Engineering 0.121 36
## 2 2 2416 MINI… 756 679 77 Engineering 0.102 7
## 3 3 2415 META… 856 725 131 Engineering 0.153 3
## 4 4 2417 NAVA… 1258 1123 135 Engineering 0.107 16
## 5 5 2405 CHEM… 32260 21239 11021 Engineering 0.342 289
## 6 6 2418 NUCL… 2573 2200 373 Engineering 0.145 17
## # ℹ 12 more variables: Employed <dbl>, Full_time <dbl>, Part_time <dbl>,
## # Full_time_year_round <dbl>, Unemployed <dbl>, Unemployment_rate <dbl>,
## # Median <dbl>, P25th <dbl>, P75th <dbl>, College_jobs <dbl>,
## # Non_college_jobs <dbl>, Low_wage_jobs <dbl>
Explanation: We loaded the tidyverse library (which
conveniently loads sub-packages like ggplot2,
dplyr, readr, and tidyr in one
go). Then we used read_csv() to import the dataset from the
provided URL. The glimpse(recent_grads) command shows us
the structure of the dataframe, including the columns and data types, so
we can understand what information is available. Key columns include
Major, Major_category, Total
graduates, Men, Women, and others like median
salary. For this analysis, our focus will be on the number of men and
women in each major.
Using head(), we print a sample of the dataset. Each row
represents a college major. We can see columns such as the major name,
its category, the total number of graduates in that field, and how many
of those are men and women. This gives us an initial sense of the data:
some majors have a relatively balanced number of men and women, while
others show an obvious disparity at first glance.
To quantify the gender dominance in each major, we will create a new
column female_to_male_ratio defined as the number of women
divided by the number of men in that major. A ratio greater than 1 means
there are more women than men (female-dominated major), while a ratio
less than 1 means more men than women (male-dominated major).
# Add a new column for female-to-male ratio
recent_grads <- recent_grads %>%
mutate(
female_to_male_ratio = if_else(
Men == 0,
NA_real_,
Women / Men
)
)
# Check the updated data to see the new column
head(recent_grads)
## # A tibble: 6 × 22
## Rank Major_code Major Total Men Women Major_category ShareWomen Sample_size
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1 2419 PETR… 2339 2057 282 Engineering 0.121 36
## 2 2 2416 MINI… 756 679 77 Engineering 0.102 7
## 3 3 2415 META… 856 725 131 Engineering 0.153 3
## 4 4 2417 NAVA… 1258 1123 135 Engineering 0.107 16
## 5 5 2405 CHEM… 32260 21239 11021 Engineering 0.342 289
## 6 6 2418 NUCL… 2573 2200 373 Engineering 0.145 17
## # ℹ 13 more variables: Employed <dbl>, Full_time <dbl>, Part_time <dbl>,
## # Full_time_year_round <dbl>, Unemployed <dbl>, Unemployment_rate <dbl>,
## # Median <dbl>, P25th <dbl>, P75th <dbl>, College_jobs <dbl>,
## # Non_college_jobs <dbl>, Low_wage_jobs <dbl>, female_to_male_ratio <dbl>
Explanation: We used mutate() from dplyr to
create female_to_male_ratio. This is simply
Women / Men for each row (major). We expect to see values
greater than 1 for majors where women outnumber men, and values less
than 1 for the opposite.
Now that we have the female-to-male ratio, let’s identify the extremes: the majors with the highest ratios (most female-dominated) and those with the lowest ratios (most male-dominated). This will give us a concrete idea of which specific majors have the largest gender imbalances.
# Top 5 majors with highest female-to-male ratios (female-dominated majors)
top5_female_dom <- recent_grads %>%
arrange(desc(female_to_male_ratio)) %>%
select(Major, Major_category, Women, Men, female_to_male_ratio) %>%
mutate(female_to_male_ratio = round(female_to_male_ratio, 2)) %>%
head(5)
top5_female_dom
## # A tibble: 5 × 5
## Major Major_category Women Men female_to_male_ratio
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 EARLY CHILDHOOD EDUCATION Education 36422 1167 31.2
## 2 COMMUNICATION DISORDERS SCIE… Health 37054 1225 30.2
## 3 MEDICAL ASSISTING SERVICES Health 10320 803 12.8
## 4 ELEMENTARY EDUCATION Education 157833 13029 12.1
## 5 FAMILY AND CONSUMER SCIENCES Industrial Ar… 52835 5166 10.2
Explanation: Here we sorted the recent_grads data in
descending order of the ratio and then took the first five entries. We
selected a few relevant columns for clarity: the major name, its
category, the number of women and men, and the ratio (rounded to two
decimal places for readability). The output shows the five majors with
the highest female-to-male ratios.
# Top 5 majors with lowest female-to-male ratios (male-dominated majors)
top5_male_dom <- recent_grads %>%
arrange(female_to_male_ratio) %>%
select(Major, Major_category, Women, Men, female_to_male_ratio) %>%
mutate(female_to_male_ratio = round(female_to_male_ratio, 2)) %>%
head(5)
top5_male_dom
## # A tibble: 5 × 5
## Major Major_category Women Men female_to_male_ratio
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 MILITARY TECHNOLOGIES Industrial Ar… 0 124 0
## 2 MECHANICAL ENGINEERING RELATE… Engineering 371 4419 0.08
## 3 CONSTRUCTION SERVICES Industrial Ar… 1678 16820 0.1
## 4 MINING AND MINERAL ENGINEERING Engineering 77 679 0.11
## 5 NAVAL ARCHITECTURE AND MARINE… Engineering 135 1123 0.12
Explanation: We performed a similar procedure but sorted in ascending order of the ratio to find the smallest values. The output lists the five majors with the lowest ratios of women to men.
By comparing these two lists, it is clear that different fields attract very different gender mixes. Education and caregiving-related majors are dominated by women, while engineering and tech-oriented majors are dominated by men.
Individual majors have their specific ratios, but it is also informative to aggregate by major category (grouping similar majors together). The dataset groups majors into categories like Engineering, Education, Humanities & Arts, and so on. We will calculate the overall proportion of graduates who are women in each major category and then create a bar chart to compare these categories.
# Summarize gender totals by major category
category_summary <- recent_grads %>%
group_by(Major_category) %>%
summarise(
total_women = sum(Women, na.rm = TRUE),
total_men = sum(Men, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(
proportion_women = total_women / (total_women + total_men),
majority_gender = if_else(proportion_women >= 0.5, "Female majority", "Male majority")
)
category_summary
## # A tibble: 16 × 5
## Major_category total_women total_men proportion_women majority_gender
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Agriculture & Natural… 35263 40357 0.466 Male majority
## 2 Arts 222740 134390 0.624 Female majority
## 3 Biology & Life Science 268943 184919 0.593 Female majority
## 4 Business 634524 667852 0.487 Male majority
## 5 Communications & Jour… 260680 131921 0.664 Female majority
## 6 Computers & Mathemati… 90283 208725 0.302 Male majority
## 7 Education 455603 103526 0.815 Female majority
## 8 Engineering 129276 408307 0.240 Male majority
## 9 Health 387713 75517 0.837 Female majority
## 10 Humanities & Liberal … 440622 272846 0.618 Female majority
## 11 Industrial Arts & Con… 126011 103781 0.548 Female majority
## 12 Interdisciplinary 9479 2817 0.771 Female majority
## 13 Law & Public Policy 87978 91129 0.491 Male majority
## 14 Physical Sciences 90089 95390 0.486 Male majority
## 15 Psychology & Social W… 382892 98115 0.796 Female majority
## 16 Social Science 273132 256834 0.515 Female majority
Explanation: Using group_by() and
summarise(), we aggregated the data by
Major_category. For each category, we calculated the total
number of female graduates and total number of male graduates across all
majors in that category. Then we computed proportion_women
as the fraction of graduates in that category who are women. We also
added a majority_gender label.
library(scales) # for percentage formatting of axis
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
ggplot(category_summary, aes(x = reorder(Major_category, proportion_women),
y = proportion_women,
fill = majority_gender)) +
geom_col() +
coord_flip() +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
labs(
title = "Proportion of Women by Major Category",
x = "Major Category",
y = "Percent of Graduates Who Are Women",
fill = "Majority Gender"
) +
theme_minimal()
Explanation: In this ggplot, each bar represents a major
category, and its height indicates the percentage of graduates who are
women in that category. Bars are colored by
majority_gender, so we can quickly see which categories are
female-majority and which are male-majority.
Next, we will create a scatter plot to examine the distribution of
men and women across individual majors. This plot will have each major
as a point, with the x-axis representing the number of male graduates
and the y-axis representing the number of female graduates in that
major. We will also add a reference line where y = x (a
45-degree line). Points on this line would represent majors with equal
numbers of men and women.
Additionally, we will annotate two particular majors: - Early Childhood Education – known to be one of the most female-dominated majors. - Mechanical Engineering – a highly male-dominated major.
# Identify specific majors to label
labels_df <- recent_grads %>%
filter(Major %in% c("Early Childhood Education", "Mechanical Engineering"))
# Remove rows where Men or Women is missing, so we do not get warnings
recent_grads_clean <- recent_grads %>%
filter(!is.na(Men), !is.na(Women))
# Scatter plot of number of women vs. men in each major
ggplot(recent_grads_clean, aes(x = Men, y = Women)) +
geom_point(alpha = 0.6) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray") +
geom_text(data = labels_df, aes(label = Major),
vjust = -0.5, hjust = 0.5, size = 3.5) +
labs(
title = "Gender Balance in College Majors: Women vs. Men",
x = "Number of Men Graduates (per Major)",
y = "Number of Women Graduates (per Major)"
) +
theme_minimal()
Explanation: Each point represents a single major, positioned according to its number of male graduates (x-axis) and female graduates (y-axis). The diagonal line is a reference for equal numbers of men and women. Points above the line represent majors with more women than men, and points below the line represent majors with more men than women. The labels for Early Childhood Education and Mechanical Engineering highlight majors that are extreme examples of this gender gap.
In this tutorial, we analyzed the gender composition of college majors and found striking differences across fields. Some majors are overwhelmingly female, while others are overwhelmingly male. Field categories such as Education and Health have a high proportion of female graduates, whereas categories like Engineering and Computer & Physical Sciences tend to be male-majority. Only a minority of majors are near gender parity.
By visualizing the data, we gained a clearer understanding of how pronounced the gender divides are in various fields of study. This analysis provides a foundation for further exploration, such as investigating why certain majors attract more men or women, or examining how these gaps have changed over time.