Key Notes

For better readability of the Rmd file, the default code chunk options have been set to hide the code. If you wish to view the entire code for each statistic or graph, please click the “Show” button located on the right side of each section.

Executive Summary

This report investigates the correlation between women’s tertiary education enrollment and their participation in STEM fields. Despite the global rise in women’s educational attainment, our analysis reveals that this quantitative expansion has not translated into a proportional increase in STEM participation.

Using data from the World Bank and OECD, we identified a persistent “structural gap” across countries regardless of their education levels. Specifically, Korea exemplifies this paradox: possessing top-tier female enrollment rates yet exhibiting one of the largest gender gaps in STEM. Our findings suggest that deeply rooted occupational segregation—where fields like Education and Humanities remain female-dominated while Engineering is male-dominated—is the primary driver of this inequality.

We conclude that policy interventions must go beyond merely increasing educational access and instead address socio-cultural structures and labor market conditions to dismantle the gender division of labor.

Research Question

The starting point of our investigation, as well as the research question we seek to clarify, is as follows:
“Does an increase in women’s tertiary education enrollment lead to a corresponding increase in women’s participation in STEM fields?”

STEM fields are core industries that determine a nation’s technological competitiveness and economic growth. Nevertheless, in many countries, women’s participation in STEM remains notably low. This issue has been consistently highlighted, as it not only affects gender-based wage structures but also undermines industrial diversity and long-term development potential. Meanwhile, as women’s rights have steadily improved, women’s tertiary education enrollment rates have continued to rise worldwide. Within this trend, it is natural to expect that the expansion of educational opportunities would directly lead to an increased proportion of women in STEM fields.

However, in the case of Korea, despite women’s tertiary education enrollment rates being nearly equal to those of men, a substantial gender gap persists in STEM fields. Therefore, it is important to empirically verify whether expanded educational opportunities actually translate into participation in STEM. Furthermore, through such analysis, we must distinguish whether the gender gap in STEM is primarily a matter of educational attainment, or whether it stems from more fundamental factors such as major and industry structures, sociocultural expectations, and labor-market conditions. Only by identifying this distinction can we develop concrete recommendations for future society. We expect that our investigation and analysis will provide an important foundation for setting policy directions aimed at mitigating gender inequality in STEM fields. Accordingly, to analyze the above question
“Does an increase in women’s tertiary education enrollment also lead to an increase in women’s participation in STEM fields?”
we set the following three specific sub-questions and tasks:

1. How does higher education enrollment generally relate to entry into STEM fields by gender?
2. For women, does a higher rate of tertiary education also lead to a higher rate of graduation in STEM fields?
3. What factors influence women’s participation in STEM fields more than education rates?

To address these research tasks, we utilized four distinct datasets sourced from the World Bank and OECD, covering tertiary enrollment rates and STEM graduation statistics. The specific contents of each are detailed below.

Data Cleaning

# Load Libraries
library(tidyverse)
library(ggrepel)
library(scales)

# Load Datasets 
df_enroll_all    <- read_csv("school_enrollment_tertiary_all.csv")
df_enroll_female <- read_csv("school_enrollment_tertiary_female.csv")
df_stem          <- read_csv("oecd_stem_gender_2013-2023.csv")
df_kpg           <- read_csv("kor_pol_grc_grad_2023.csv")

# Define list of aggregated regions to exclude
# Since the World Bank dataset contains a vast number of non-country aggregate groups, we defined a separate list during the common preprocessing stage to ensure code efficiency
remove_list <- c("World", "OECD members", "High income", "Low income", 
  "Middle income", "Upper middle income", "Lower middle income",
  "East Asia & Pacific", "Euro area", "European Union", 
  "North America", "Sub-Saharan Africa", "Arab World",
  "Latin America & Caribbean", "Europe & Central Asia",
  "East Asia & Pacific (excluding high income)",
  "Europe & Central Asia (excluding high income)",
  "Latin America & Caribbean (excluding high income)",
  "Middle East & North Africa",
  "South Asia",
  "Early-demographic dividend", "Late-demographic dividend",
  "Post-demographic dividend", "Pre-demographic dividend",
  "IDA & IBRD total", "IDA total", "IDA blend", "IDA only", "IBRD only",
  "Least developed countries: UN classification",
  "Heavily indebted poor countries (HIPC)",
  "Fragile and conflict affected situations",
  "Small states", "Other small states", "Pacific island small states",
  "Caribbean small states",
  "Low & middle income",
  "Middle East, North Africa, Afghanistan & Pakistan",
  "East Asia & Pacific (IDA & IBRD countries)",
  "Europe & Central Asia (IDA & IBRD countries)",
  "Latin America & the Caribbean (IDA & IBRD countries)",
  "Middle East & North Africa (IDA & IBRD countries)",
  "South Asia (IDA & IBRD)",
  "Sub-Saharan Africa (IDA & IBRD countries)",
  "Africa Eastern and Southern", "Africa Western and Central")

All data preprocessing and visualization were conducted using the tidyverse package in R.

To ensure data integrity and conduct a strict country-level analysis, we first addressed the issue of aggregated regions present in the raw datasets. Since the data included non-country entities such as “World,” “OECD members,” and “High income,” we defined a comprehensive remove_list to filter these out. Notably, this comprehensive removal list was essential for the World Bank enrollment datasets, which contained a vast number of non-country aggregates. In contrast, the OECD STEM dataset included significantly fewer such entities; therefore, we handled the filtering for the STEM data directly within its individual code chunk for efficiency. Additionally, we refined the dataset by focusing specifically on the year 2023 and removing observations with missing values to ensure analytical consistency.

For the specific requirements of our analysis, further structural transformations were applied. To enable the correlation analysis in Figures 1 and 2, we merged the World Bank enrollment data with the OECD STEM datasets using ISO country codes via an inner_join. Furthermore, to facilitate the group comparison in Figure 1, we categorized countries into “Low,” “Medium,” and “High” tiers based on the quantiles of their total tertiary enrollment rates using the cut() function. Finally, for Figure 3, we reshaped the data using pivot_wider and derived a new Gap metric (calculated as the female percentage minus 50%) to intuitively visualize both the direction and magnitude of gender segregation across different fields of study.

Additionally, more detailed data cleaning steps were noted as small comments within each code block.

The following sections provide a brief description and basic statistics for each dataset.

Data Description

1. Global Tertiary Enrollment Overview

school_enrollment_tertiary_all.csv

Sourced from the World Bank, the Tertiary Enrollment dataset provides the gross tertiary enrollment ratio for the total population, regardless of gender. For our analysis, we specifically focused on the Country Name and the enrollment figures for the year 2023. This dataset serves as a crucial baseline to categorize countries into distinct education levels (“Low,” “Medium,” and “High”). By establishing these groups, we aim to empirically verify whether the gender gap in STEM is a universal phenomenon that persists independently of a country’s general educational development.

# Data Processing for dataset_1
# Data Cleaning (Using distinct variable names to avoid conflicts)
clean_enroll_all_eda <- df_enroll_all %>%
  select(Name = `Country Name`, Enroll_Total = `2023`) %>%
  filter(!is.na(Enroll_Total)) %>%
  filter(!Name %in% remove_list)

# Calculate Descriptive Statistics
stats_summary <- clean_enroll_all_eda %>%
  summarise(
    Count = n(),
    Mean = round(mean(Enroll_Total), 2),
    SD = round(sd(Enroll_Total), 2),
    Median = round(median(Enroll_Total), 2),
    Min = round(min(Enroll_Total), 2),
    Max = round(max(Enroll_Total), 2)
  )

# Output Table
knitr::kable(stats_summary, caption = "Descriptive Statistics: Total Tertiary Enrollment (2023)")

Descriptive Statistics: Total Tertiary Enrollment (2023)
Count	Mean	SD	Median	Min	Max
126	52.87	30.92	53.98	4.94	165.11

# Histogram Visualization
p0 <- ggplot(clean_enroll_all_eda, aes(x = Enroll_Total)) +
  geom_histogram(binwidth = 10, fill = "#69b3a2", color = "white", alpha = 0.8) +
  geom_vline(aes(xintercept = mean(Enroll_Total)), color = "red", linetype = "dashed", size = 1) +
  labs(title = "Distribution of Total Tertiary Enrollment Rates (2023)",
       subtitle = "Red dashed line indicates the global mean",
       x = "Tertiary Enrollment Rate (% Gross)", 
       y = "Count of Countries") +
  theme_minimal()

p0

2. Female Tertiary Enrollment Trends

school_enrollment_tertiary_female.csv

This dataset, also sourced from the World Bank, specifically focuses on the female tertiary enrollment ratio using data from 2023. It functions as the key independent variable in our analysis, serving as the foundation to test our primary hypothesis: “Does higher educational access for women lead to higher STEM participation?” By isolating female enrollment rates, we aim to quantitatively measure the correlation between the expansion of general educational opportunities for women and their specific entry into STEM fields.

# Data Processing for dataset_2
# Use _eda suffix to avoid variable name conflicts

# Data Cleaning (Female Dataset)
clean_enroll_female_eda <- df_enroll_female %>%
  select(Name = `Country Name`, Enroll_Rate = `2023`) %>%
  filter(!is.na(Enroll_Rate)) %>%
  filter(!Name %in% remove_list) # 기존에 정의된 remove_list 사용

# Calculate Descriptive Statistics
stats_female_summary <- clean_enroll_female_eda %>%
  summarise(
    Count = n(),
    Mean = round(mean(Enroll_Rate), 2),
    SD = round(sd(Enroll_Rate), 2),
    Median = round(median(Enroll_Rate), 2),
    Min = round(min(Enroll_Rate), 2),
    Max = round(max(Enroll_Rate), 2)
  )

# Output Table
knitr::kable(stats_female_summary, caption = "Descriptive Statistics: Female Tertiary Enrollment (2023)")

Descriptive Statistics: Female Tertiary Enrollment (2023)
Count	Mean	SD	Median	Min	Max
125	60.92	35.49	63.11	4.66	171.11

# Histogram Visualization
p_female_hist <- ggplot(clean_enroll_female_eda, aes(x = Enroll_Rate)) +
  geom_histogram(binwidth = 10, fill = "#FF9999", color = "white", alpha = 0.8) +
  geom_vline(aes(xintercept = mean(Enroll_Rate)), color = "darkred", linetype = "dashed", size = 1) +
  labs(title = "Distribution of Female Tertiary Enrollment Rates (2023)",
       subtitle = "Red dashed line indicates the global mean for females",
       x = "Female Tertiary Enrollment Rate (% Gross)", 
       y = "Count of Countries") +
  theme_minimal()

p_female_hist

After a rigorous data-cleaning process, the number of countries included in the analysis was 126. However, we found that Lebanon’s gender-specific data were missing for the female subset. Consequently, the analysis by overall gender covered 126 countries, while the analysis restricted to female data covered 125 countries.

3. OECD STEM Graduation Rates

oecd_stem_gender_2013-2023.csv

Sourced from the OECD, the STEM Graduates dataset measures the share of graduates in Science, Technology, Engineering, and Mathematics fields by gender. In our research framework, this dataset serves as the primary dependent variable. By merging these figures with the World Bank enrollment data, we are able to calculate the gender gap and identify significant outliers, such as Japan and Korea. The analysis primarily utilizes the Reference area (Country), Sex (Male/Female), and Observation value to quantify and compare STEM participation rates across different nations.

# Data Processing for dataset_3
# Use '_stem_eda' suffix to avoid variable name conflicts

# Data Cleaning
clean_stem_female_eda <- df_stem %>%
  filter(TIME_PERIOD == 2023, Sex == "Female") %>%
  # Select only total mobility data (excluding international students)
  filter(Mobility == "_T" | Mobility == "Total") %>%
  # Remove aggregate groups (e.g., OECD average, EU, G20)
  filter(!`Reference area` %in% c("OECD", "European Union (25 countries)", "G20", "EU27 (from 2020)")) %>%
  select(Country = `Reference area`, Value = `OBS_VALUE`) %>%
  mutate(Value = as.numeric(Value)) %>%
  filter(!is.na(Value))

# Calculate Descriptive Statistics
stats_stem_summary <- clean_stem_female_eda %>%
  summarise(
    Count = n(),
    Mean = round(mean(Value), 2),
    SD = round(sd(Value), 2),
    Median = round(median(Value), 2),
    Min = round(min(Value), 2),
    Max = round(max(Value), 2)
  )

# Output Table
knitr::kable(stats_stem_summary, caption = "Descriptive Statistics: Share of Female Graduates in STEM (2023, OECD)")

Descriptive Statistics: Share of Female Graduates in STEM (2023, OECD)
Count	Mean	SD	Median	Min	Max
41	33.7	5.51	34.42	18.08	43.29

# Boxplot Visualization
p_stem_box <- ggplot(clean_stem_female_eda, aes(x = "", y = Value)) +
  # Median, Quartiles
  geom_boxplot(fill = "lavender", color = "purple", width = 0.5, outlier.shape = NA) +
  # Jitter
  geom_jitter(width = 0.1, size = 2, alpha = 0.6, color = "darkslateblue") +
  
  # Indicate Mean value (Red Diamond)
  stat_summary(fun = mean, geom = "point", shape = 18, size = 4, color = "red") +
  
  labs(title = "Distribution of Female STEM Graduates in OECD Countries (2023)",
       subtitle = "Boxplot with individual country points (Red diamond = Mean)",
       x = "", 
       y = "Share of Female Graduates in STEM (%)") +
  theme_minimal()

p_stem_box

4. Field-Specific Graduate Statistics (Focus Countries)

kor_pol_grc_grad_2023.csv

Finally, the Graduate Statistics dataset provides high-granularity data specifically for our three focus countries: Korea, Poland, and Greece. Unlike the previous broad indicators, this dataset breaks down graduate statistics by detailed Field of education, such as Education, ICT, and Engineering. While the earlier datasets established the existence of a gender gap, this specific data helps us explain why it exists by revealing the structural segregation of majors within these nations. By analyzing variables such as Sex and Observation value across these specific fields, we can pinpoint the exact academic disciplines that drive the observed gender disparity.

# Data Processing for dataset_4
# Descriptive Statistics (Disaggregated by Country), Dumbbell Chart (Aggregated)
# Use '_eda' and '_combined' suffixes to avoid variable name conflicts

# Data Cleaning
clean_kpg_eda <- df_kpg %>%
  filter(Mobility == "Total") %>%
  filter(Sex %in% c("Female", "Male")) %>%
  select(Country = `Reference area`, 
         Field = `Field of education`, 
         Sex, 
         Percentage = `OBS_VALUE`) %>%
  mutate(Percentage = as.numeric(Percentage)) %>%
  filter(!is.na(Percentage))

# Descriptive Statistics (Disaggregated by Country)
stats_kpg_individual <- clean_kpg_eda %>%
  group_by(Country, Sex) %>%
  summarise(
    Mean = round(mean(Percentage), 2),
    SD = round(sd(Percentage), 2),
    Min = round(min(Percentage), 2),
    Max = round(max(Percentage), 2),
    .groups = 'drop'
  )

# Output Table
knitr::kable(stats_kpg_individual, caption = "Descriptive Statistics: Graduates Distribution by Country & Gender (2023)")

Descriptive Statistics: Graduates Distribution by Country & Gender (2023)
Country	Sex	Mean	SD	Min	Max
Greece	Female	57.40	15.95	35.34	85.10
Greece	Male	42.60	15.95	14.90	64.66
Korea	Female	50.51	18.00	22.50	76.17
Korea	Male	49.49	18.00	23.83	77.50
Poland	Female	60.47	19.31	22.65	85.76
Poland	Male	39.53	19.31	14.24	77.35

# Aggregated Dumbbell Chart (Average of 3 Countries)


# Data Aggregation
kpg_combined_data <- clean_kpg_eda %>%
  group_by(Field, Sex) %>%
  summarise(
    Avg_Percentage = mean(Percentage, na.rm = TRUE), # 3개국 평균값 사용
    .groups = 'drop'
  )

# Reshape Data for Dumbbell Chart (Wide Format)
kpg_wide_combined <- kpg_combined_data %>%
  pivot_wider(names_from = Sex, values_from = Avg_Percentage)

# Visualize Aggregated Dumbbell Chart
p_kpg_combined_dumbbell <- ggplot(kpg_wide_combined) +
  # Dumbbell Segment (Gap)
  geom_segment(aes(y = reorder(Field, Female), yend = reorder(Field, Female),
                   x = Male, xend = Female),
               color = "gray60", size = 1.2) +
  
  # Male Points (Blue)
  geom_point(aes(y = reorder(Field, Female), x = Male), 
             color = "#6699CC", size = 4) + 
  
  # Female Points (Red)
  geom_point(aes(y = reorder(Field, Female), x = Female), 
             color = "#FF9999", size = 4) + 
  
  # Labels
  scale_y_discrete(labels = function(x) str_wrap(x, width = 35)) +
  scale_x_continuous(labels = scales::percent_format(scale = 1)) +
  
  labs(title = "Average Gender Gap by Field of Study",
       subtitle = "Aggregated View (Avg of Korea, Poland, Greece): Blue=Male, Red=Female",
       x = "Average Share of Graduates (%)", 
       y = "",
       caption = "Data Source: OECD (2023)") +
  
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 10, face = "bold"),
    panel.grid.major.y = element_blank(),
    legend.position = "top"
  )

p_kpg_combined_dumbbell

Analysis Results

Figure 1: How does higher education enrollment generally relate to the entry into STEM fields by gender?

To effectively identify overall trends and improve visual readability, we moved away from simple scatter plots, which proved to be too cluttered and failed to show a meaningful regression pattern. Instead, we classified countries into three distinct tiers based on their total tertiary enrollment rates: “Low” (21.5%–76.0%), “Medium” (76.0%–81.4%), and “High” (81.4%–165.1%). Notably, Korea falls into the “High” group with an enrollment rate of 106.7%. We selected boxplots as the primary visualization structure because they offer the most direct method to examine whether countries with similar levels of tertiary enrollment exhibit similar gender distributions in STEM graduation rates. This approach provides statistical clarity by allowing for a quick comparison of the median and overall distribution across the different educational tiers.

# Data Processing for Figure 1
# Data Cleaning
clean_enroll_all <- df_enroll_all %>%
  select(Code = `Country Code`, Name = `Country Name`, Enroll_Total = `2023`) %>%
  filter(!is.na(Enroll_Total)) %>%
  filter(!Name %in% remove_list)

# Data Cleaning
clean_stem_both <- df_stem %>%
  filter(TIME_PERIOD == 2023) %>%
  filter(Sex %in% c("Female", "Male")) %>%
  filter(Mobility == "_T" | Mobility == "Total") %>% 
  filter(!`Reference area` %in% c("OECD", "European Union (25 countries)")) %>% 
  select(Code = REF_AREA, Sex, STEM_Rate = OBS_VALUE) %>%
  filter(!is.na(STEM_Rate))

# Data Merging & Grouping
merged_boxplot <- inner_join(clean_enroll_all, clean_stem_both, by = "Code") %>%
  mutate(
    Edu_Level = cut(Enroll_Total, 
                    breaks = quantile(Enroll_Total, probs = c(0, 1/3, 2/3, 1), na.rm = TRUE),
                    labels = c("Low Enrollment", "Medium Enrollment", "High Enrollment"),
                    include.lowest = TRUE)
  )

# Visualization Graph 1
p1 <- ggplot(merged_boxplot, aes(x = Edu_Level, y = STEM_Rate, fill = Sex)) +
  geom_boxplot(alpha = 0.7, outlier.shape = 19, outlier.size = 2) +
  
  # Highlight Label for Japan
  geom_text_repel(data = subset(merged_boxplot, Name == "Japan"),
                  aes(label = Name),
                  size = 5, fontface = "bold", color = "black",
                  min.segment.length = 0, # 무조건 선 그리기
                  nudge_x = 0.4, # 라벨을 옆으로 밀기
                  show.legend = FALSE) +
  
  scale_fill_manual(values = c("Female" = "#FF9999", "Male" = "#6699CC")) +
  scale_y_continuous(breaks = seq(0, 100, 10)) +
  labs(title = "Global STEM Gender Gap by Education Level (2023)",
       subtitle = "Gender segregation in STEM persists regardless of national education levels.",
       x = "Tertiary Education Enrollment Level (Quantiles)", 
       y = "Share of Graduates in STEM (%)",
       fill = "Gender",
       caption = "Data Source: World Bank & OECD\n(Groups defined by enrollment rate tertiles)") +
  theme_minimal() +
  theme(legend.position = "bottom")

p1

The visual analysis reveals a persistent structural gap that exists regardless of a country’s educational attainment. Whether tertiary enrollment rates were high or low, gender disparities remained consistent, with the median STEM graduation rate standing at approximately 30% for women compared to 70% for men. We also identified an extreme outlier within the “Low” enrollment group: Japan. With a male graduation rate of 81.92% versus a female rate of only 18.08%, Japan illustrates that possessing a similar education level to peers does not guarantee gender parity. While this specific case warrants deeper sociological investigation, it serves here as a stark example that extreme gender disparities can persist even within comparable educational contexts.

Figure 2: For women, does a higher rate of tertiary education => also lead to a higher graduation in STEM fields?

To further investigate our research question, we narrowed the scope of our analysis specifically to the female population. We utilized a scatterplot to visually examine the correlation between female tertiary enrollment (the independent variable) and the share of female graduates in STEM (the dependent variable). Additionally, a regression line was superimposed on the plot to identify whether a global trend exists connecting educational access to STEM participation. Within this global context, we highlighted three specific countries to better situate Korea’s position: Korea, our primary subject of interest; Poland, which shares a similar enrollment rate to Korea yet exhibits higher STEM participation; and Greece, which outperforms Korea in both enrollment metrics and STEM graduation rates.

# Data Processing for Figure 2
# Data Cleaning
clean_enroll_female <- df_enroll_female %>%
  select(Code = `Country Code`, Name = `Country Name`, Enroll_Female = `2023`) %>%
  filter(!is.na(Enroll_Female)) %>%
  filter(!Name %in% remove_list)

# Filter STEM Data for Female Graduates
clean_stem_female <- clean_stem_both %>%
  filter(Sex == "Female")

# Data Merging & Defining Highlight Groups
merged_female <- inner_join(clean_enroll_female, clean_stem_female, by = "Code") %>%
  mutate(
    Highlight = case_when(
      Code %in% c("KOR", "POL", "GRC") ~ "Focus Country",
      TRUE ~ "Others"
    )
  )

# Visualization Figure 2
p2 <- ggplot(merged_female, aes(x = Enroll_Female, y = STEM_Rate)) +
  geom_smooth(method = "lm", color = "darkgray", fill = "lightgray", alpha = 0.5) +
  
  # Configure Point Color and Size
  geom_point(aes(color = Highlight, size = Highlight), alpha = 0.8) +
  scale_color_manual(values = c("Focus Country" = "#7B1FA2", "Others" = "#FF9999")) +
  scale_size_manual(values = c("Focus Country" = 4, "Others" = 2.5)) +
  
  # Labeling for Focus Countries
  geom_text_repel(data = subset(merged_female, Highlight == "Focus Country"),
                  aes(label = Name), 
                  size = 5, fontface = "bold", box.padding = 0.5, color = "black") +
  
  labs(title = "Female Tertiary Enrollment vs. Female STEM Share (2023)",
       subtitle = "Highlight: Korea, Poland, Greece vs. Global Trend",
       x = "Female Tertiary Enrollment Rate (% Gross)", 
       y = "Share of Female Graduates in STEM (%)",
       caption = "Data Source: World Bank & OECD",
       color = "Group", size = "Group") +
  theme_minimal() +
  theme(legend.position = "bottom")

p2

The resulting plot displays a slight positive trend, indicated by the upward slope of the regression line, which suggests that higher female education levels are generally associated with an increase in STEM graduates. However, the correlation is relatively weak, as evidenced by the wide distribution of data points around the line; this variance implies that educational access alone does not guarantee high STEM participation. Most notably, Korea lies significantly below the regression line compared to its peers with similar educational attainment. This marked discrepancy highlights a specific underperformance in STEM integration despite high enrollment rates, explicitly justifying the need for the subsequent focused comparison with Poland and Greece to uncover the underlying structural causes.

Figure 3: What factors influence women’s participation in STEM fields, more than education rates?

To visualize the data effectively, we prioritized clarity and intuition over complexity. Although we initially experimented with dumbbell charts, we found them too cluttered for comparing multiple fields simultaneously. Consequently, we selected a diverging bar chart as the most appropriate visualization method. Furthermore, to reduce redundancy—since the sum of female and male percentages naturally equals 100%—we chose not to display both raw figures. Instead, we defined and visualized a “Skewed Gap” metric (calculated as the female percentage minus 50%). This approach explicitly highlights both the magnitude and direction of gender segregation, allowing for an immediate visual understanding of which gender dominates a specific field of study.

# Data Processing for Figure 3
# Data Cleaning
clean_kpg <- df_kpg %>%
  filter(Mobility == "Total") %>% 
  filter(Sex %in% c("Female", "Male")) %>% 
  select(Country = `Reference area`, 
         Field = `Field of education`, 
         Sex, 
         Percentage = `OBS_VALUE`) %>%
  mutate(Percentage = as.numeric(Percentage)) %>%
  filter(!is.na(Percentage))

# Reshape to Wide Format & Calculate Gender Gap
kpg_gap <- clean_kpg %>%
  pivot_wider(names_from = Sex, values_from = Percentage) %>%
  mutate(Gap = Female - 50) %>% 
  mutate(Dominance = ifelse(Gap > 0, "Female-Dominated", "Male-Dominated"))

# Visualization Figure 3
p3 <- ggplot(kpg_gap, aes(x = reorder(Field, Gap), y = Gap, fill = Dominance)) +
  geom_col(width = 0.7) +
  facet_wrap(~ Country, ncol = 3) +
  coord_flip() +
  scale_fill_manual(values = c("Female-Dominated" = "#FF9999", 
                               "Male-Dominated" = "#6699CC")) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 30)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  labs(title = "Gender Gap by Field of Study (2023)",
       subtitle = "Right (Red) = Female Majority / Left (Blue) = Male Majority",
       x = "", 
       y = "Gender Gap (Female Share - 50%)", 
       fill = "Field Dominance",
       caption = "Data Source: OECD") +
  theme_bw() +
  theme(
    strip.text = element_text(size = 12, face = "bold"), 
    legend.position = "bottom",
    axis.text.y = element_text(size = 8)
  )

p3

The analysis reveals a consistent pattern of “gender segregation” across fields of study, which is observable regardless of the specific country in question. Specifically, fields such as Education, Healthcare, and Humanities are heavily female-dominated, whereas Engineering, ICT, and Manufacturing remain overwhelmingly male-dominated. This persistent structural distinction implies that simply increasing overall educational attainment does not automatically guarantee higher female participation in STEM industries. Instead, our findings suggest that deeply rooted gender-based structures, such as socio-cultural norms or labor market conditions, may play a more critical role in shaping these occupational outcomes than educational access alone.

Conclusion, Beyond Educational Access

In conclusion, our analysis demonstrates that simply raising women’s educational attainment is insufficient to achieve gender equality in STEM fields. While tertiary enrollment rates for women have increased globally, a distinct “gendered trend” persists across different fields of study.

Specifically, our focus group analysis (Figure 3) reveals a deeply rooted occupational segregation: fields such as Education, Healthcare, and Humanities remain heavily female-dominated, whereas STEM and engineering-related fields are overwhelmingly male-dominated. This structural barrier explains the paradox observed in Korea, where top-tier female enrollment rates do not translate into STEM participation.

Therefore, we suggest that closing the gender gap requires more than just educational access. Future interventions should address broader structural factors, such as socio-cultural norms or labor market conditions, that perpetuate the gender division of labor.

A Study on the Correlation Between Women’s Education and Entry into STEM fields

KiMin Yi, Seol Yoon

December 9, 2025