Overview of the Data

Data is taken from data.lacity.org and contains information about job applicants from the 2013-2014 and 2014-2015 fiscal year. Information includes applicants’ gender and ethnicity. Data was last updated December 1, 2016, and metadata last updated November 30,2020.

For more details see https://data.lacity.org/Administration-Finance/Job-Applicants-by-Gender-and-Ethnicity/mkf9-fagf/about_data.

Intial look at the data + Pre-processing

There’s a lot of interesting data to explore here, but for this project I want to explore gender differences between applicants.

Upon opening the data in a spreadsheet software, I’ve noticed there are many types of jobs for LA county. To maintain consistency, I added a column for occupation type based off the 2018 Standard Occupational Classification System under the U.S. Bureau of Labor and Statistics.

Let’s take a look at number of applications for the 2013-2014 fiscal year vs the 2014-2015 fiscal year.

Less Applicants in 2013-2014 versus 2014-2015

The number of total number of applicants decreased significantly from 2013-2014 to 2014-2015. Let’s see if the ratio between women and men applying changed in any way.

Less Female Applicants in the 2014-2015 Year

Interesting. While the 2013-2014 fiscal year had a more balanced distribution between male and female applicants, the 2014-2015 shows a large disparity between the two.

Let’s open up this data in R to find out why.

Set up our environment

# Load relevant libraries
library(tidyverse)  # Includes ggplot2, dplyr, stringr, etc.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)    # For cleaning column names and data
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(ggrepel)    # For better label placement in ggplot2

# Load up the data
df <- read_csv("../data/Los Angeles County Job Applicants Dataset.csv") %>%
  clean_names()
## Rows: 187 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Fiscal Year, HR Designations, Occupation Type
## dbl (1): Unknown Gender
## num (3): Apps Received, Female, Male
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Let’s look at the column types.

glimpse(df)
## Rows: 187
## Columns: 7
## $ fiscal_year     <chr> "2013-2014", "2013-2014", "2013-2014", "2013-2014", "2…
## $ hr_designations <chr> "OP", "P", "OP", "P", "O", "P", "OP", "O", "P", "O", "…
## $ occupation_type <chr> "Management", "Office and Administrative Support", "Ma…
## $ apps_received   <dbl> 54, 648, 51, 48, 40, 161, 102, 702, 105, 897, 329, 104…
## $ female          <dbl> 20, 488, 13, 9, 15, 89, 53, 430, 3, 467, 27, 27, 2, 7,…
## $ male            <dbl> 31, 152, 37, 38, 24, 66, 48, 240, 101, 411, 294, 75, 1…
## $ unknown_gender  <dbl> 3, 8, 1, 1, 1, 6, 1, 32, 1, 19, 8, 2, 2, 1, 1, 0, 0, 1…
# check if everything is loaded properly
df %>% filter(if_any(everything(), is.na))

Looks about right.

Let’s create a helper function to summarize data by occupation type.

# Helper Function: Summarizing by occupation
summarize_by_occupation <- function(df) {
  df %>% 
    group_by(occupation_type) %>% 
    summarize(
      total_male = sum(male, na.rm=TRUE), 
      total_female= sum(female, na.rm=TRUE), 
      total_apps = sum(apps_received, na.rm=TRUE),
      .groups = 'drop')
} 

Looking into the distribution of applicants by job category

# Helper: Truncate and wrap text
truncate_and_wrap <- function(text, truncate_at = 25, wrap_width = 15) {
  short <- ifelse(nchar(text) > truncate_at, paste0(substr(text, 1, truncate_at), "..."), text)
  str_wrap(short, width = wrap_width)
}

# Prepare summarized and processed data
df_gender_diff <- df %>% 
  summarize_by_occupation() %>% 
  mutate(
    gender_dominance = ifelse(total_male > total_female, "male", "female"),
    gender_diff = abs(total_male - total_female)
  ) %>% 
  select(occupation_type, gender_diff, gender_dominance) %>% 
  arrange(desc(gender_dominance), desc(gender_diff)) %>%
  mutate(
    id = row_number(),
    wrapped_short_label = truncate_and_wrap(occupation_type),
    log_gender_diff = log1p(gender_diff),
    angle = 90 - 360 * (id - 0.5) / n(),
    hjust = ifelse(angle < -90, 1, 0),
    angle = ifelse(angle < -90, angle + 180, angle)
  )

# Create the plot
ggplot(df_gender_diff, aes(x = factor(id), y = log_gender_diff, fill = gender_dominance)) +
  geom_bar(stat = "identity", alpha = 0.7) +

  # Inside bar value labels
  geom_text(aes(label = gender_diff, y = log_gender_diff / 2), 
            color = "black", size = 2.5) +

  # Outside wrapped labels
  geom_text(
    aes(
      label = wrapped_short_label,
      y = log_gender_diff + 0.5,
      angle = angle,
      hjust = hjust
    ),
    size = 1.8
  ) +

  coord_polar(start = 0) +
  ylim(-6, 12) +
  theme_minimal() +
  theme(
    axis.text = element_blank(),
    axis.title = element_blank(),
    panel.grid = element_blank(),
    plot.title = element_text(face = "bold")
  ) +
  scale_fill_manual(
    name = "Gender",
    values = c("male" = "#1F77B4", "female" = "#FF69B4")
  ) +
  labs(
    title = "Gender Dominance in Applicants by Occupation",
    subtitle = "Only 5 out of 19 Categories are Dominated by Women"
  )

It appears that women only dominate applications in five different job groups:

What if we look into the top 5 job categories with greatest gender differences?

Top 5 job categories with greatest gender difference

df %>%
  summarize_by_occupation() %>%  # Summarize data by occupation
  # Determine which gender dominates by comparing totals
  # Calculate absolute difference between male and female totals
  mutate(
    gender_dominance = ifelse(total_male > total_female, "male", "female"),
    gender_diff = abs(total_male - total_female)
  ) %>%
  # Keep only relevant columns for plotting, sort, and slice top 5
  select(occupation_type, gender_diff, gender_dominance) %>%
  arrange(desc(gender_diff)) %>%
  slice_head(n = 5) %>%
  # Mutate occupation_type for graphing by wrapping names, and add a factor level for charting
  mutate(
    occupation_type = str_wrap(occupation_type, width = 15),
    occupation_type = fct_reorder(occupation_type, gender_diff, .desc = TRUE)
  ) %>%
  # CREATING THE CHART
  ggplot(aes(x = occupation_type, y = gender_diff, fill = gender_dominance)) +
  geom_col() +
  # Add numeric labels above bars for exact difference values
  geom_text(aes(label = gender_diff), vjust = -0.3, size = 3.5) +
  theme_minimal() +
  labs(
    title = "Top 5 Occupations with Greatest Difference in Gender",
    subtitle = "The categories Protective Service and Office Administrative stand out",
    x = "Occupation Type",
    y = "Difference in # of Applicants"
  ) +
  # Bold plot title and custom colors
  theme(plot.title = element_text(face = "bold")) +
  scale_fill_manual(name = "Gender", values = c("male" = "#1F77B4", "female" = "#FF69B4"))

Taking a look at the top 5 occupations with greatest gender differences, we can see Protective Service and Office Administrative work stand out the most.

Let’s take a closer look at the fiscal year differences between the two.

Prepare data for visualization

# Split original df by year
df_2013_2014_raw <- filter(df, fiscal_year == "2013-2014")
df_2014_2015_raw <- filter(df, fiscal_year == "2014-2015")

# Helper function: summarize and label
summarize_and_tag <- function(df, label) {
  summarize_by_occupation(df) %>%
    mutate(fiscal_year = label)
}

# Summarize and label each year
df_2013_2014 <- summarize_and_tag(df_2013_2014_raw, "2013_2014")
df_2014_2015 <- summarize_and_tag(df_2014_2015_raw, "2014_2015")

# --- Create yearly gender difference table ---
df_yearly_gender_diff <- full_join(
  df_2013_2014 %>% select(-fiscal_year),
  df_2014_2015 %>% select(-fiscal_year),
  by = "occupation_type"
) %>%
  # Replace missing values with 0
  mutate(across(everything(), ~ replace_na(.x, 0))) %>%
  # Calculate differences
  mutate(
    male_diff = total_male.y - total_male.x,
    female_diff = total_female.y - total_female.x,
    total_diff = total_apps.y - total_apps.x
  ) %>%
  select(occupation_type, male_diff, female_diff, total_diff) %>%
  arrange(female_diff)

# --- Create long-format data for plotting ---
df_yearly_apps <- bind_rows(df_2013_2014, df_2014_2015) %>%
  pivot_longer(
    cols = c(total_male, total_female),
    names_to = "gender",
    values_to = "num_of_apps"
  ) %>%
  mutate(
    gender = str_remove(gender, "total_"),
    log_num_of_apps = log1p(num_of_apps),
    year_numeric = if_else(fiscal_year == "2013_2014", 2013, 2015)
  )

Office and Administrative Support

Let’s first look into Office and Administrative Support by creating a slope graph between the 2013-2014 and 2014-2015 fiscal years.

## CHART: Office and Administrative Support

df_male = df_yearly_apps %>% filter(gender=="male")
df_female = df_yearly_apps %>% filter(gender=="female")
selected_occupation = c("Office and Administrative Support")


# 1. Start with a single ggplot() call.
ggplot() + 
  # 2. Add the light grey lines for male and female
  geom_line(
    data = df_male, 
    aes(x = year_numeric, y = num_of_apps, group = occupation_type),
    color = "grey",
    alpha = 0.5
  ) +
  geom_line(
    data = df_female, 
    aes(x = year_numeric, y = num_of_apps, group = occupation_type), 
    color = "grey",
    alpha = 0.5
  ) +
  # 3. Add darker blue line for relevant occupation
  geom_line(
    data = df_male %>% 
      filter(occupation_type %in% selected_occupation), 
    aes(x = year_numeric, y = num_of_apps, group = occupation_type), 
    color = "#1F77B4",
    linetype = "solid",
    linewidth = .75, 
    lineend = "round"
  ) +
  # 4. Add darker pink line for relevant occupation
  geom_line(
    data = df_female %>% 
      filter(occupation_type %in% selected_occupation), 
    aes(x = year_numeric, y = num_of_apps, group = occupation_type), 
    color = "#FF69B4",
    linetype = "solid",
    linewidth = .75, 
    lineend = "round"
  ) +
  # 5. Add labels for num_of_apps LEFT SIDE
  geom_text_repel(
    data = df_male %>% filter(year_numeric == 2013 & 
                                occupation_type %in% selected_occupation),
    aes(x = year_numeric, y = num_of_apps, label = num_of_apps),
    box.padding = unit(0.5, "lines"),
    point.padding = unit(0.5, "lines"),
    segment.color = "#1F77B4",
    color = "#1F77B4",
    nudge_x = -0.25, # Assign a numeric value to nudge_x
    direction = "y", 
    hjust = "right"
  ) +
  geom_text_repel(
    data = df_female %>% filter(year_numeric == 2013 &
                                  occupation_type %in% selected_occupation),
    aes(x = year_numeric, y = num_of_apps, label = num_of_apps),
    box.padding = unit(0.5, "lines"),
    point.padding = unit(0.5, "lines"),
    segment.color = "#FF69B4",
    color = "#FF69B4",
    nudge_x = -0.25,
    direction = "y",
    hjust = "right"
  ) +
  # 6. Add labels for num_of_apps RIGHT SIDE
  geom_text_repel(
    data = df_male %>% filter(year_numeric == 2015 & 
                                occupation_type %in% selected_occupation),
    aes(x = year_numeric, y = num_of_apps, label = num_of_apps),
    box.padding = unit(0.5, "lines"),
    point.padding = unit(0.5, "lines"),
    segment.color = "#1F77B4",
    color = "#1F77B4",
    nudge_x = 0.25, # Assign a numeric value to nudge_x
    direction = "y", 
    hjust = "right"
  ) +
  geom_text_repel(
    data = df_female %>% filter(year_numeric == 2015 &
                                  occupation_type %in% selected_occupation),
    aes(x = year_numeric, y = num_of_apps, label = num_of_apps),
    box.padding = unit(0.5, "lines"),
    point.padding = unit(0.5, "lines"),
    segment.color = "#FF69B4",
    color = "#FF69B4",
    nudge_x = 0.25,
    nudge_y = 1000,
    direction = "y",
    hjust = "right"
  ) +
  # 7. Set optional theme and axis labels.
  labs(
    title = "Decrease in Applicants for Office and Administrative Support",
    subtitle = "Both male and female applicants decreased for the 2014-2015 fiscal year",
    x = "Year",
    y = "Number of Applications"
  )+
  theme_minimal() +
  theme(
    plot.title = element_text(face="bold"),
    panel.grid = element_blank()) +
  geom_line(aes(x = NA, y = NA, color = "Male"), show.legend = TRUE) +
  geom_line(aes(x = NA, y = NA, color = "Female"), show.legend = TRUE) +
  scale_color_manual(
    name = "Gender",  # Legend title
    values = c("Male" = "#1F77B4", "Female" = "#FF69B4")
  ) 
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).


Office Administrative applications decreased for the 2014-2015 fiscal year, despite being the largest contributor to female applicants.

## CHART: Protective Service

df_male = df_yearly_apps %>% filter(gender=="male")
df_female = df_yearly_apps %>% filter(gender=="female")
selected_occupation = c("Protective Service")

# 1. Start with a single ggplot() call.
ggplot() + 
  # 2. Add the light grey lines for male and female
  geom_line(
    data = df_male, 
    aes(x = year_numeric, y = num_of_apps, group = occupation_type),
    color = "grey",
    alpha = 0.5
  ) +
  geom_line(
    data = df_female, 
    aes(x = year_numeric, y = num_of_apps, group = occupation_type), 
    color = "grey",
    alpha = 0.5
  ) +
  # 3. Add darker blue line for relevant occupation
  geom_line(
    data = df_male %>% 
      filter(occupation_type %in% selected_occupation), 
    aes(x = year_numeric, y = num_of_apps, group = occupation_type), 
    color = "#1F77B4",
    linetype = "solid",
    linewidth = .75, 
    lineend = "round"
  ) +
  # 4. Add darker pink line for relevant occupation
  geom_line(
    data = df_female %>% 
      filter(occupation_type %in% selected_occupation), 
    aes(x = year_numeric, y = num_of_apps, group = occupation_type), 
    color = "#FF69B4",
    linetype = "solid",
    linewidth = .75, 
    lineend = "round"
  ) +
  # 5. Add labels for num_of_apps LEFT SIDE
  geom_text_repel(
    data = df_male %>% filter(year_numeric == 2013 & 
                                occupation_type %in% selected_occupation),
    aes(x = year_numeric, y = num_of_apps, label = num_of_apps),
    box.padding = unit(0.5, "lines"),
    point.padding = unit(0.5, "lines"),
    segment.color = "#1F77B4",
    color = "#1F77B4",
    nudge_x = -0.25, # Assign a numeric value to nudge_x
    direction = "y", 
    hjust = "right"
  ) +
  geom_text_repel(
    data = df_female %>% filter(year_numeric == 2013 &
                                  occupation_type %in% selected_occupation),
    aes(x = year_numeric, y = num_of_apps, label = num_of_apps),
    box.padding = unit(0.5, "lines"),
    point.padding = unit(0.5, "lines"),
    segment.color = "#FF69B4",
    color = "#FF69B4",
    nudge_x = -0.25,
    direction = "y",
    hjust = "right"
  ) +
  # 6. Add labels for num_of_apps RIGHT SIDE
  geom_text_repel(
    data = df_male %>% filter(year_numeric == 2015 & 
                                occupation_type %in% selected_occupation),
    aes(x = year_numeric, y = num_of_apps, label = num_of_apps),
    box.padding = unit(0.5, "lines"),
    point.padding = unit(0.5, "lines"),
    segment.color = "#1F77B4",
    color = "#1F77B4",
    nudge_x = 0.25, # Assign a numeric value to nudge_x
    direction = "y", 
    hjust = "right"
  ) +
  geom_text_repel(
    data = df_female %>% filter(year_numeric == 2015 &
                                  occupation_type %in% selected_occupation),
    aes(x = year_numeric, y = num_of_apps, label = num_of_apps),
    box.padding = unit(0.5, "lines"),
    point.padding = unit(0.5, "lines"),
    segment.color = "#FF69B4",
    color = "#FF69B4",
    nudge_x = 0.25,
    nudge_y = 1000,
    direction = "y",
    hjust = "right"
  ) +
  # 7. Set optional theme and axis labels.
  labs(
    title = "Increase in Male Applicants for Protective Service",
    subtitle = "Protective Service is the only job category that showed growth in male applicants\ndespite a sharp decline in female applicants for the 2014-2015 fiscal year",
    x = "Year",
    y = "Number of Applications"
  )+
  theme_minimal() +
  theme(
    plot.title = element_text(face="bold"),
    panel.grid = element_blank()) +
  geom_line(aes(x = NA, y = NA, color = "Male"), show.legend = TRUE) +
  geom_line(aes(x = NA, y = NA, color = "Female"), show.legend = TRUE) +
  scale_color_manual(
    name = "Gender",  # Legend title
    values = c("Male" = "#1F77B4", "Female" = "#FF69B4")
  ) 
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).


Protective Service increased in male applicants for the 2014-2015 fiscal year, despite all other job categories showing a decrease in applicants. This is an outlier since there are less women who applied in the 2014 fiscal year.

Conclusion

And so the gender disparity in the 2014-2015 fiscal year is explained by both…

  • The drastic decrease in applicants in Office Administrative roles typically dominated by women
  • An increase in male applicants in Protective Service despite all other categories showing a decrease Both categories in which contribute to the largest gender disparity among applicants.

Recommendations?

  • Promote applications for women in Protective Services as workplace culture might contribute to a confidence gap for women in the Protective Service category.
  • Develop targeted recruitment campaigns to attract applicants from underrepresented genders & ethnicity in the fields with the largest demographic gaps. This ensures equitable opportunities and a more balanced applicant pool across all departments.