Introduction

There is a well-known gender gap in certain college majors. Some fields have far more women graduates, while others are dominated by men. In this code-through tutorial, we will use R and the Tidyverse to explore and visualize these differences using a dataset of recent college graduates by major. The dataset (from FiveThirtyEight) includes the number of men and women graduating in each major, among other details.

What will we do in this tutorial?

Throughout the tutorial, narrative explanations are provided before and after each code chunk to clarify what the code is doing and what the results mean. This will help you not only see the R code in action but also understand the insights from the output.

Loading Packages and Data

First, we need to load the necessary R packages and import the dataset. We will use the tidyverse collection of packages (which includes readr for reading data and dplyr, ggplot2, tidyr for data manipulation and visualization). The dataset of recent college graduates by major is available as a CSV file on FiveThirtyEight’s GitHub repository. We will read it directly from the URL.

# Load necessary packages
library(tidyverse)  # Loads dplyr, ggplot2, tidyr, readr, etc.

# Read the dataset of recent college graduates
url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv"
recent_grads <- read_csv(url)

# Take a quick look at the data structure
glimpse(recent_grads)
## Rows: 173
## Columns: 21
## $ Rank                 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
## $ Major_code           <dbl> 2419, 2416, 2415, 2417, 2405, 2418, 6202, 5001, 2…
## $ Major                <chr> "PETROLEUM ENGINEERING", "MINING AND MINERAL ENGI…
## $ Total                <dbl> 2339, 756, 856, 1258, 32260, 2573, 3777, 1792, 91…
## $ Men                  <dbl> 2057, 679, 725, 1123, 21239, 2200, 2110, 832, 803…
## $ Women                <dbl> 282, 77, 131, 135, 11021, 373, 1667, 960, 10907, …
## $ Major_category       <chr> "Engineering", "Engineering", "Engineering", "Eng…
## $ ShareWomen           <dbl> 0.1205643, 0.1018519, 0.1530374, 0.1073132, 0.341…
## $ Sample_size          <dbl> 36, 7, 3, 16, 289, 17, 51, 10, 1029, 631, 399, 14…
## $ Employed             <dbl> 1976, 640, 648, 758, 25694, 1857, 2912, 1526, 764…
## $ Full_time            <dbl> 1849, 556, 558, 1069, 23170, 2038, 2924, 1085, 71…
## $ Part_time            <dbl> 270, 170, 133, 150, 5180, 264, 296, 553, 13101, 1…
## $ Full_time_year_round <dbl> 1207, 388, 340, 692, 16697, 1449, 2482, 827, 5463…
## $ Unemployed           <dbl> 37, 85, 16, 40, 1672, 400, 308, 33, 4650, 3895, 2…
## $ Unemployment_rate    <dbl> 0.018380527, 0.117241379, 0.024096386, 0.05012531…
## $ Median               <dbl> 110000, 75000, 73000, 70000, 65000, 65000, 62000,…
## $ P25th                <dbl> 95000, 55000, 50000, 43000, 50000, 50000, 53000, …
## $ P75th                <dbl> 125000, 90000, 105000, 80000, 75000, 102000, 7200…
## $ College_jobs         <dbl> 1534, 350, 456, 529, 18314, 1142, 1768, 972, 5284…
## $ Non_college_jobs     <dbl> 364, 257, 176, 102, 4440, 657, 314, 500, 16384, 1…
## $ Low_wage_jobs        <dbl> 193, 50, 0, 0, 972, 244, 259, 220, 3253, 3170, 98…
# Display the first six rows of the dataset
head(recent_grads)
## # A tibble: 6 × 21
##    Rank Major_code Major Total   Men Women Major_category ShareWomen Sample_size
##   <dbl>      <dbl> <chr> <dbl> <dbl> <dbl> <chr>               <dbl>       <dbl>
## 1     1       2419 PETR…  2339  2057   282 Engineering         0.121          36
## 2     2       2416 MINI…   756   679    77 Engineering         0.102           7
## 3     3       2415 META…   856   725   131 Engineering         0.153           3
## 4     4       2417 NAVA…  1258  1123   135 Engineering         0.107          16
## 5     5       2405 CHEM… 32260 21239 11021 Engineering         0.342         289
## 6     6       2418 NUCL…  2573  2200   373 Engineering         0.145          17
## # ℹ 12 more variables: Employed <dbl>, Full_time <dbl>, Part_time <dbl>,
## #   Full_time_year_round <dbl>, Unemployed <dbl>, Unemployment_rate <dbl>,
## #   Median <dbl>, P25th <dbl>, P75th <dbl>, College_jobs <dbl>,
## #   Non_college_jobs <dbl>, Low_wage_jobs <dbl>

Explanation: We loaded the tidyverse library (which conveniently loads sub-packages like ggplot2, dplyr, readr, and tidyr in one go). Then we used read_csv() to import the dataset from the provided URL. The glimpse(recent_grads) command shows us the structure of the dataframe, including the columns and data types, so we can understand what information is available. Key columns include Major, Major_category, Total graduates, Men, Women, and others like median salary. For this analysis, our focus will be on the number of men and women in each major.

Using head(), we print a sample of the dataset. Each row represents a college major. We can see columns such as the major name, its category, the total number of graduates in that field, and how many of those are men and women. This gives us an initial sense of the data: some majors have a relatively balanced number of men and women, while others show an obvious disparity at first glance.

Adding a Female-to-Male Ratio

To quantify the gender dominance in each major, we will create a new column female_to_male_ratio defined as the number of women divided by the number of men in that major. A ratio greater than 1 means there are more women than men (female-dominated major), while a ratio less than 1 means more men than women (male-dominated major).

# Add a new column for female-to-male ratio
recent_grads <- recent_grads %>%
  mutate(
    female_to_male_ratio = if_else(
      Men == 0,
      NA_real_,          
      Women / Men
    )
  )

# Check the updated data to see the new column
head(recent_grads)
## # A tibble: 6 × 22
##    Rank Major_code Major Total   Men Women Major_category ShareWomen Sample_size
##   <dbl>      <dbl> <chr> <dbl> <dbl> <dbl> <chr>               <dbl>       <dbl>
## 1     1       2419 PETR…  2339  2057   282 Engineering         0.121          36
## 2     2       2416 MINI…   756   679    77 Engineering         0.102           7
## 3     3       2415 META…   856   725   131 Engineering         0.153           3
## 4     4       2417 NAVA…  1258  1123   135 Engineering         0.107          16
## 5     5       2405 CHEM… 32260 21239 11021 Engineering         0.342         289
## 6     6       2418 NUCL…  2573  2200   373 Engineering         0.145          17
## # ℹ 13 more variables: Employed <dbl>, Full_time <dbl>, Part_time <dbl>,
## #   Full_time_year_round <dbl>, Unemployed <dbl>, Unemployment_rate <dbl>,
## #   Median <dbl>, P25th <dbl>, P75th <dbl>, College_jobs <dbl>,
## #   Non_college_jobs <dbl>, Low_wage_jobs <dbl>, female_to_male_ratio <dbl>

Explanation: We used mutate() from dplyr to create female_to_male_ratio. This is simply Women / Men for each row (major). We expect to see values greater than 1 for majors where women outnumber men, and values less than 1 for the opposite.

Exploring Gender Ratios in Majors

Now that we have the female-to-male ratio, let’s identify the extremes: the majors with the highest ratios (most female-dominated) and those with the lowest ratios (most male-dominated). This will give us a concrete idea of which specific majors have the largest gender imbalances.

Top 5 Female-Dominated Majors

# Top 5 majors with highest female-to-male ratios (female-dominated majors)
top5_female_dom <- recent_grads %>%
  arrange(desc(female_to_male_ratio)) %>%
  select(Major, Major_category, Women, Men, female_to_male_ratio) %>%
  mutate(female_to_male_ratio = round(female_to_male_ratio, 2)) %>%
  head(5)

top5_female_dom
## # A tibble: 5 × 5
##   Major                         Major_category  Women   Men female_to_male_ratio
##   <chr>                         <chr>           <dbl> <dbl>                <dbl>
## 1 EARLY CHILDHOOD EDUCATION     Education       36422  1167                 31.2
## 2 COMMUNICATION DISORDERS SCIE… Health          37054  1225                 30.2
## 3 MEDICAL ASSISTING SERVICES    Health          10320   803                 12.8
## 4 ELEMENTARY EDUCATION          Education      157833 13029                 12.1
## 5 FAMILY AND CONSUMER SCIENCES  Industrial Ar…  52835  5166                 10.2

Explanation: Here we sorted the recent_grads data in descending order of the ratio and then took the first five entries. We selected a few relevant columns for clarity: the major name, its category, the number of women and men, and the ratio (rounded to two decimal places for readability). The output shows the five majors with the highest female-to-male ratios.

Top 5 Male-Dominated Majors

# Top 5 majors with lowest female-to-male ratios (male-dominated majors)
top5_male_dom <- recent_grads %>%
  arrange(female_to_male_ratio) %>%
  select(Major, Major_category, Women, Men, female_to_male_ratio) %>%
  mutate(female_to_male_ratio = round(female_to_male_ratio, 2)) %>%
  head(5)

top5_male_dom
## # A tibble: 5 × 5
##   Major                          Major_category Women   Men female_to_male_ratio
##   <chr>                          <chr>          <dbl> <dbl>                <dbl>
## 1 MILITARY TECHNOLOGIES          Industrial Ar…     0   124                 0   
## 2 MECHANICAL ENGINEERING RELATE… Engineering      371  4419                 0.08
## 3 CONSTRUCTION SERVICES          Industrial Ar…  1678 16820                 0.1 
## 4 MINING AND MINERAL ENGINEERING Engineering       77   679                 0.11
## 5 NAVAL ARCHITECTURE AND MARINE… Engineering      135  1123                 0.12

Explanation: We performed a similar procedure but sorted in ascending order of the ratio to find the smallest values. The output lists the five majors with the lowest ratios of women to men.

By comparing these two lists, it is clear that different fields attract very different gender mixes. Education and caregiving-related majors are dominated by women, while engineering and tech-oriented majors are dominated by men.

Visualizing Gender Distribution by Major Category

Individual majors have their specific ratios, but it is also informative to aggregate by major category (grouping similar majors together). The dataset groups majors into categories like Engineering, Education, Humanities & Arts, and so on. We will calculate the overall proportion of graduates who are women in each major category and then create a bar chart to compare these categories.

Calculating Proportion of Women per Category

# Summarize gender totals by major category
category_summary <- recent_grads %>%
  group_by(Major_category) %>%
  summarise(
    total_women = sum(Women, na.rm = TRUE),
    total_men = sum(Men, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(
    proportion_women = total_women / (total_women + total_men),
    majority_gender = if_else(proportion_women >= 0.5, "Female majority", "Male majority")
  )

category_summary
## # A tibble: 16 × 5
##    Major_category         total_women total_men proportion_women majority_gender
##    <chr>                        <dbl>     <dbl>            <dbl> <chr>          
##  1 Agriculture & Natural…       35263     40357            0.466 Male majority  
##  2 Arts                        222740    134390            0.624 Female majority
##  3 Biology & Life Science      268943    184919            0.593 Female majority
##  4 Business                    634524    667852            0.487 Male majority  
##  5 Communications & Jour…      260680    131921            0.664 Female majority
##  6 Computers & Mathemati…       90283    208725            0.302 Male majority  
##  7 Education                   455603    103526            0.815 Female majority
##  8 Engineering                 129276    408307            0.240 Male majority  
##  9 Health                      387713     75517            0.837 Female majority
## 10 Humanities & Liberal …      440622    272846            0.618 Female majority
## 11 Industrial Arts & Con…      126011    103781            0.548 Female majority
## 12 Interdisciplinary             9479      2817            0.771 Female majority
## 13 Law & Public Policy          87978     91129            0.491 Male majority  
## 14 Physical Sciences            90089     95390            0.486 Male majority  
## 15 Psychology & Social W…      382892     98115            0.796 Female majority
## 16 Social Science              273132    256834            0.515 Female majority

Explanation: Using group_by() and summarise(), we aggregated the data by Major_category. For each category, we calculated the total number of female graduates and total number of male graduates across all majors in that category. Then we computed proportion_women as the fraction of graduates in that category who are women. We also added a majority_gender label.

Bar Plot: Proportion of Women by Major Category

library(scales)  # for percentage formatting of axis
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
ggplot(category_summary, aes(x = reorder(Major_category, proportion_women),
                             y = proportion_women,
                             fill = majority_gender)) +
  geom_col() +
  coord_flip() +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  labs(
    title = "Proportion of Women by Major Category",
    x = "Major Category",
    y = "Percent of Graduates Who Are Women",
    fill = "Majority Gender"
  ) +
  theme_minimal()

Explanation: In this ggplot, each bar represents a major category, and its height indicates the percentage of graduates who are women in that category. Bars are colored by majority_gender, so we can quickly see which categories are female-majority and which are male-majority.

Scatter Plot: Women vs Men in Each Major

Next, we will create a scatter plot to examine the distribution of men and women across individual majors. This plot will have each major as a point, with the x-axis representing the number of male graduates and the y-axis representing the number of female graduates in that major. We will also add a reference line where y = x (a 45-degree line). Points on this line would represent majors with equal numbers of men and women.

Additionally, we will annotate two particular majors: - Early Childhood Education – known to be one of the most female-dominated majors. - Mechanical Engineering – a highly male-dominated major.

# Identify specific majors to label
labels_df <- recent_grads %>%
  filter(Major %in% c("Early Childhood Education", "Mechanical Engineering"))

# Remove rows where Men or Women is missing, so we do not get warnings
recent_grads_clean <- recent_grads %>%
  filter(!is.na(Men), !is.na(Women))

# Scatter plot of number of women vs. men in each major
ggplot(recent_grads_clean, aes(x = Men, y = Women)) +
  geom_point(alpha = 0.6) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray") +
  geom_text(data = labels_df, aes(label = Major),
            vjust = -0.5, hjust = 0.5, size = 3.5) +
  labs(
    title = "Gender Balance in College Majors: Women vs. Men",
    x = "Number of Men Graduates (per Major)",
    y = "Number of Women Graduates (per Major)"
  ) +
  theme_minimal()

Explanation: Each point represents a single major, positioned according to its number of male graduates (x-axis) and female graduates (y-axis). The diagonal line is a reference for equal numbers of men and women. Points above the line represent majors with more women than men, and points below the line represent majors with more men than women. The labels for Early Childhood Education and Mechanical Engineering highlight majors that are extreme examples of this gender gap.

Conclusion and Key Takeaways

In this tutorial, we analyzed the gender composition of college majors and found striking differences across fields. Some majors are overwhelmingly female, while others are overwhelmingly male. Field categories such as Education and Health have a high proportion of female graduates, whereas categories like Engineering and Computer & Physical Sciences tend to be male-majority. Only a minority of majors are near gender parity.

By visualizing the data, we gained a clearer understanding of how pronounced the gender divides are in various fields of study. This analysis provides a foundation for further exploration, such as investigating why certain majors attract more men or women, or examining how these gaps have changed over time.