Data is taken from data.lacity.org and contains information about job applicants from the 2013-2014 and 2014-2015 fiscal year. Information includes applicants’ gender and ethnicity. Data was last updated December 1, 2016, and metadata last updated November 30,2020.
For more details see https://data.lacity.org/Administration-Finance/Job-Applicants-by-Gender-and-Ethnicity/mkf9-fagf/about_data.
There’s a lot of interesting data to explore here, but for this project I want to explore gender differences between applicants.
Upon opening the data in a spreadsheet software, I’ve noticed there are many types of jobs for LA county. To maintain consistency, I added a column for occupation type based off the 2018 Standard Occupational Classification System under the U.S. Bureau of Labor and Statistics.
Let’s take a look at number of applications for the 2013-2014 fiscal year vs the 2014-2015 fiscal year.
The number of total number of applicants decreased significantly from 2013-2014 to 2014-2015. Let’s see if the ratio between women and men applying changed in any way.
Interesting. While the 2013-2014 fiscal year had a more balanced distribution between male and female applicants, the 2014-2015 shows a large disparity between the two.
Let’s open up this data in R to find out why.
# Load relevant libraries
library(tidyverse) # Includes ggplot2, dplyr, stringr, etc.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor) # For cleaning column names and data
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(ggrepel) # For better label placement in ggplot2
# Load up the data
df <- read_csv("../data/Los Angeles County Job Applicants Dataset.csv") %>%
clean_names()
## Rows: 187 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Fiscal Year, HR Designations, Occupation Type
## dbl (1): Unknown Gender
## num (3): Apps Received, Female, Male
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Let’s look at the column types.
glimpse(df)
## Rows: 187
## Columns: 7
## $ fiscal_year <chr> "2013-2014", "2013-2014", "2013-2014", "2013-2014", "2…
## $ hr_designations <chr> "OP", "P", "OP", "P", "O", "P", "OP", "O", "P", "O", "…
## $ occupation_type <chr> "Management", "Office and Administrative Support", "Ma…
## $ apps_received <dbl> 54, 648, 51, 48, 40, 161, 102, 702, 105, 897, 329, 104…
## $ female <dbl> 20, 488, 13, 9, 15, 89, 53, 430, 3, 467, 27, 27, 2, 7,…
## $ male <dbl> 31, 152, 37, 38, 24, 66, 48, 240, 101, 411, 294, 75, 1…
## $ unknown_gender <dbl> 3, 8, 1, 1, 1, 6, 1, 32, 1, 19, 8, 2, 2, 1, 1, 0, 0, 1…
# check if everything is loaded properly
df %>% filter(if_any(everything(), is.na))
Looks about right.
Let’s create a helper function to summarize data by occupation type.
# Helper Function: Summarizing by occupation
summarize_by_occupation <- function(df) {
df %>%
group_by(occupation_type) %>%
summarize(
total_male = sum(male, na.rm=TRUE),
total_female= sum(female, na.rm=TRUE),
total_apps = sum(apps_received, na.rm=TRUE),
.groups = 'drop')
}
# Helper: Truncate and wrap text
truncate_and_wrap <- function(text, truncate_at = 25, wrap_width = 15) {
short <- ifelse(nchar(text) > truncate_at, paste0(substr(text, 1, truncate_at), "..."), text)
str_wrap(short, width = wrap_width)
}
# Prepare summarized and processed data
df_gender_diff <- df %>%
summarize_by_occupation() %>%
mutate(
gender_dominance = ifelse(total_male > total_female, "male", "female"),
gender_diff = abs(total_male - total_female)
) %>%
select(occupation_type, gender_diff, gender_dominance) %>%
arrange(desc(gender_dominance), desc(gender_diff)) %>%
mutate(
id = row_number(),
wrapped_short_label = truncate_and_wrap(occupation_type),
log_gender_diff = log1p(gender_diff),
angle = 90 - 360 * (id - 0.5) / n(),
hjust = ifelse(angle < -90, 1, 0),
angle = ifelse(angle < -90, angle + 180, angle)
)
# Create the plot
ggplot(df_gender_diff, aes(x = factor(id), y = log_gender_diff, fill = gender_dominance)) +
geom_bar(stat = "identity", alpha = 0.7) +
# Inside bar value labels
geom_text(aes(label = gender_diff, y = log_gender_diff / 2),
color = "black", size = 2.5) +
# Outside wrapped labels
geom_text(
aes(
label = wrapped_short_label,
y = log_gender_diff + 0.5,
angle = angle,
hjust = hjust
),
size = 1.8
) +
coord_polar(start = 0) +
ylim(-6, 12) +
theme_minimal() +
theme(
axis.text = element_blank(),
axis.title = element_blank(),
panel.grid = element_blank(),
plot.title = element_text(face = "bold")
) +
scale_fill_manual(
name = "Gender",
values = c("male" = "#1F77B4", "female" = "#FF69B4")
) +
labs(
title = "Gender Dominance in Applicants by Occupation",
subtitle = "Only 5 out of 19 Categories are Dominated by Women"
)
It appears that women only dominate applications in five different job groups:
Office and Administrative
Business and Financial
Community and Social Service
Educational Instruction
Healthcare Practitioners
What if we look into the top 5 job categories with greatest gender differences?
df %>%
summarize_by_occupation() %>% # Summarize data by occupation
# Determine which gender dominates by comparing totals
# Calculate absolute difference between male and female totals
mutate(
gender_dominance = ifelse(total_male > total_female, "male", "female"),
gender_diff = abs(total_male - total_female)
) %>%
# Keep only relevant columns for plotting, sort, and slice top 5
select(occupation_type, gender_diff, gender_dominance) %>%
arrange(desc(gender_diff)) %>%
slice_head(n = 5) %>%
# Mutate occupation_type for graphing by wrapping names, and add a factor level for charting
mutate(
occupation_type = str_wrap(occupation_type, width = 15),
occupation_type = fct_reorder(occupation_type, gender_diff, .desc = TRUE)
) %>%
# CREATING THE CHART
ggplot(aes(x = occupation_type, y = gender_diff, fill = gender_dominance)) +
geom_col() +
# Add numeric labels above bars for exact difference values
geom_text(aes(label = gender_diff), vjust = -0.3, size = 3.5) +
theme_minimal() +
labs(
title = "Top 5 Occupations with Greatest Difference in Gender",
subtitle = "The categories Protective Service and Office Administrative stand out",
x = "Occupation Type",
y = "Difference in # of Applicants"
) +
# Bold plot title and custom colors
theme(plot.title = element_text(face = "bold")) +
scale_fill_manual(name = "Gender", values = c("male" = "#1F77B4", "female" = "#FF69B4"))
Taking a look at the top 5 occupations with greatest gender differences, we can see Protective Service and Office Administrative work stand out the most.
Let’s take a closer look at the fiscal year differences between the two.
# Split original df by year
df_2013_2014_raw <- filter(df, fiscal_year == "2013-2014")
df_2014_2015_raw <- filter(df, fiscal_year == "2014-2015")
# Helper function: summarize and label
summarize_and_tag <- function(df, label) {
summarize_by_occupation(df) %>%
mutate(fiscal_year = label)
}
# Summarize and label each year
df_2013_2014 <- summarize_and_tag(df_2013_2014_raw, "2013_2014")
df_2014_2015 <- summarize_and_tag(df_2014_2015_raw, "2014_2015")
# --- Create yearly gender difference table ---
df_yearly_gender_diff <- full_join(
df_2013_2014 %>% select(-fiscal_year),
df_2014_2015 %>% select(-fiscal_year),
by = "occupation_type"
) %>%
# Replace missing values with 0
mutate(across(everything(), ~ replace_na(.x, 0))) %>%
# Calculate differences
mutate(
male_diff = total_male.y - total_male.x,
female_diff = total_female.y - total_female.x,
total_diff = total_apps.y - total_apps.x
) %>%
select(occupation_type, male_diff, female_diff, total_diff) %>%
arrange(female_diff)
# --- Create long-format data for plotting ---
df_yearly_apps <- bind_rows(df_2013_2014, df_2014_2015) %>%
pivot_longer(
cols = c(total_male, total_female),
names_to = "gender",
values_to = "num_of_apps"
) %>%
mutate(
gender = str_remove(gender, "total_"),
log_num_of_apps = log1p(num_of_apps),
year_numeric = if_else(fiscal_year == "2013_2014", 2013, 2015)
)
Let’s first look into Office and Administrative Support by creating a slope graph between the 2013-2014 and 2014-2015 fiscal years.
## CHART: Office and Administrative Support
df_male = df_yearly_apps %>% filter(gender=="male")
df_female = df_yearly_apps %>% filter(gender=="female")
selected_occupation = c("Office and Administrative Support")
# 1. Start with a single ggplot() call.
ggplot() +
# 2. Add the light grey lines for male and female
geom_line(
data = df_male,
aes(x = year_numeric, y = num_of_apps, group = occupation_type),
color = "grey",
alpha = 0.5
) +
geom_line(
data = df_female,
aes(x = year_numeric, y = num_of_apps, group = occupation_type),
color = "grey",
alpha = 0.5
) +
# 3. Add darker blue line for relevant occupation
geom_line(
data = df_male %>%
filter(occupation_type %in% selected_occupation),
aes(x = year_numeric, y = num_of_apps, group = occupation_type),
color = "#1F77B4",
linetype = "solid",
linewidth = .75,
lineend = "round"
) +
# 4. Add darker pink line for relevant occupation
geom_line(
data = df_female %>%
filter(occupation_type %in% selected_occupation),
aes(x = year_numeric, y = num_of_apps, group = occupation_type),
color = "#FF69B4",
linetype = "solid",
linewidth = .75,
lineend = "round"
) +
# 5. Add labels for num_of_apps LEFT SIDE
geom_text_repel(
data = df_male %>% filter(year_numeric == 2013 &
occupation_type %in% selected_occupation),
aes(x = year_numeric, y = num_of_apps, label = num_of_apps),
box.padding = unit(0.5, "lines"),
point.padding = unit(0.5, "lines"),
segment.color = "#1F77B4",
color = "#1F77B4",
nudge_x = -0.25, # Assign a numeric value to nudge_x
direction = "y",
hjust = "right"
) +
geom_text_repel(
data = df_female %>% filter(year_numeric == 2013 &
occupation_type %in% selected_occupation),
aes(x = year_numeric, y = num_of_apps, label = num_of_apps),
box.padding = unit(0.5, "lines"),
point.padding = unit(0.5, "lines"),
segment.color = "#FF69B4",
color = "#FF69B4",
nudge_x = -0.25,
direction = "y",
hjust = "right"
) +
# 6. Add labels for num_of_apps RIGHT SIDE
geom_text_repel(
data = df_male %>% filter(year_numeric == 2015 &
occupation_type %in% selected_occupation),
aes(x = year_numeric, y = num_of_apps, label = num_of_apps),
box.padding = unit(0.5, "lines"),
point.padding = unit(0.5, "lines"),
segment.color = "#1F77B4",
color = "#1F77B4",
nudge_x = 0.25, # Assign a numeric value to nudge_x
direction = "y",
hjust = "right"
) +
geom_text_repel(
data = df_female %>% filter(year_numeric == 2015 &
occupation_type %in% selected_occupation),
aes(x = year_numeric, y = num_of_apps, label = num_of_apps),
box.padding = unit(0.5, "lines"),
point.padding = unit(0.5, "lines"),
segment.color = "#FF69B4",
color = "#FF69B4",
nudge_x = 0.25,
nudge_y = 1000,
direction = "y",
hjust = "right"
) +
# 7. Set optional theme and axis labels.
labs(
title = "Decrease in Applicants for Office and Administrative Support",
subtitle = "Both male and female applicants decreased for the 2014-2015 fiscal year",
x = "Year",
y = "Number of Applications"
)+
theme_minimal() +
theme(
plot.title = element_text(face="bold"),
panel.grid = element_blank()) +
geom_line(aes(x = NA, y = NA, color = "Male"), show.legend = TRUE) +
geom_line(aes(x = NA, y = NA, color = "Female"), show.legend = TRUE) +
scale_color_manual(
name = "Gender", # Legend title
values = c("Male" = "#1F77B4", "Female" = "#FF69B4")
)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
Office Administrative applications decreased for the 2014-2015
fiscal year, despite being the largest contributor to female
applicants.
## CHART: Protective Service
df_male = df_yearly_apps %>% filter(gender=="male")
df_female = df_yearly_apps %>% filter(gender=="female")
selected_occupation = c("Protective Service")
# 1. Start with a single ggplot() call.
ggplot() +
# 2. Add the light grey lines for male and female
geom_line(
data = df_male,
aes(x = year_numeric, y = num_of_apps, group = occupation_type),
color = "grey",
alpha = 0.5
) +
geom_line(
data = df_female,
aes(x = year_numeric, y = num_of_apps, group = occupation_type),
color = "grey",
alpha = 0.5
) +
# 3. Add darker blue line for relevant occupation
geom_line(
data = df_male %>%
filter(occupation_type %in% selected_occupation),
aes(x = year_numeric, y = num_of_apps, group = occupation_type),
color = "#1F77B4",
linetype = "solid",
linewidth = .75,
lineend = "round"
) +
# 4. Add darker pink line for relevant occupation
geom_line(
data = df_female %>%
filter(occupation_type %in% selected_occupation),
aes(x = year_numeric, y = num_of_apps, group = occupation_type),
color = "#FF69B4",
linetype = "solid",
linewidth = .75,
lineend = "round"
) +
# 5. Add labels for num_of_apps LEFT SIDE
geom_text_repel(
data = df_male %>% filter(year_numeric == 2013 &
occupation_type %in% selected_occupation),
aes(x = year_numeric, y = num_of_apps, label = num_of_apps),
box.padding = unit(0.5, "lines"),
point.padding = unit(0.5, "lines"),
segment.color = "#1F77B4",
color = "#1F77B4",
nudge_x = -0.25, # Assign a numeric value to nudge_x
direction = "y",
hjust = "right"
) +
geom_text_repel(
data = df_female %>% filter(year_numeric == 2013 &
occupation_type %in% selected_occupation),
aes(x = year_numeric, y = num_of_apps, label = num_of_apps),
box.padding = unit(0.5, "lines"),
point.padding = unit(0.5, "lines"),
segment.color = "#FF69B4",
color = "#FF69B4",
nudge_x = -0.25,
direction = "y",
hjust = "right"
) +
# 6. Add labels for num_of_apps RIGHT SIDE
geom_text_repel(
data = df_male %>% filter(year_numeric == 2015 &
occupation_type %in% selected_occupation),
aes(x = year_numeric, y = num_of_apps, label = num_of_apps),
box.padding = unit(0.5, "lines"),
point.padding = unit(0.5, "lines"),
segment.color = "#1F77B4",
color = "#1F77B4",
nudge_x = 0.25, # Assign a numeric value to nudge_x
direction = "y",
hjust = "right"
) +
geom_text_repel(
data = df_female %>% filter(year_numeric == 2015 &
occupation_type %in% selected_occupation),
aes(x = year_numeric, y = num_of_apps, label = num_of_apps),
box.padding = unit(0.5, "lines"),
point.padding = unit(0.5, "lines"),
segment.color = "#FF69B4",
color = "#FF69B4",
nudge_x = 0.25,
nudge_y = 1000,
direction = "y",
hjust = "right"
) +
# 7. Set optional theme and axis labels.
labs(
title = "Increase in Male Applicants for Protective Service",
subtitle = "Protective Service is the only job category that showed growth in male applicants\ndespite a sharp decline in female applicants for the 2014-2015 fiscal year",
x = "Year",
y = "Number of Applications"
)+
theme_minimal() +
theme(
plot.title = element_text(face="bold"),
panel.grid = element_blank()) +
geom_line(aes(x = NA, y = NA, color = "Male"), show.legend = TRUE) +
geom_line(aes(x = NA, y = NA, color = "Female"), show.legend = TRUE) +
scale_color_manual(
name = "Gender", # Legend title
values = c("Male" = "#1F77B4", "Female" = "#FF69B4")
)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
Protective Service increased in male applicants for the
2014-2015 fiscal year, despite all other job categories showing a
decrease in applicants. This is an outlier since there are less women
who applied in the 2014 fiscal year.
And so the gender disparity in the 2014-2015 fiscal year is explained by both…