Gender in the workforce is a topic of great social and economic significance, as it directly affects the lives of millions of individuals and has implications for the broader society. While progress has been made towards gender equality in recent years, women continue to face challenges and inequalities in the workplace, including but not limited to, wage disparities and underrepresentation in leadership positions.
Therefore, analyzing the dataset related to this topic allows us to gain valuable insights into the current state of gender distribution across various job categories in the United States, which can help us identify areas where further progress is needed. Additionally, it can help us examine the impact of gender on earnings and employment status as the information about women’s earnings and employment status from historical data allows us to understand the trends in the past to predict the upcoming possibilities in future dataset. And ultimately, we can identify areas where further progress is needed to achieve gender equality in the workplace.
Our project investigates the main characteristics of occupation in different gender. We will be using the data available at TidyTuesday.
The Jobs Gender dataset1, part of the TidyTuesday project, provides valuable insights into the current state of gender distribution across various job categories in the United States. With data spanning multiple industries and job types, this dataset allows for a comprehensive analysis of gender representation in the workforce. There are historical data about women’s earnings and employment status, as well as detailed information about specific occupation and earnings from 2013-2016.
This dataset was compiled from two sources: the Census Bureau and the Bureau of Labor Statistics. The data is provided as is, and it recognizes the limitations and issues in defining gender as binary. This dataset provides valuable insights into the current state of gender distribution across various job categories in the United States and can be used for further analysis and policymaking to address issues of gender inequality in the workforce.
| Variable | Class | Discription |
|---|---|---|
| year | interger | Year |
| occupation | character | Specific job/career |
| major_category | character | Broad category of occupation |
| minor_category | character | Fine category of occupation |
| total_workers | double | Total estimated full-time workers > 16 years old |
| workers_male | double | Estimated MALE full-time workers > 16 years old |
| workers_female | double | Estimated FEMALE full-time workers > 16 years old |
| percent_female | double | The percent of females for specific occupation |
| total_earnings | double | Total estimated median earnings for full-time workers > 16 years old |
| total_earnings_male | double | Estimated MALE median earnings for full-time workers > 16 years old |
| total_earnings_female | double | Estimated FEMALE median earnings for full-time workers > 16 years old |
| wage_percent_of_male | double | Female wages as percent of male wages - NA for occupations with small sample size |
jobs_gender <- read.csv("../data/jobs_gender.csv")
head(jobs_gender,5)
glimpse(jobs_gender)
## Rows: 2,088
## Columns: 12
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ occupation <chr> "Chief executives", "General and operations mana…
## $ major_category <chr> "Management, Business, and Financial", "Manageme…
## $ minor_category <chr> "Management", "Management", "Management", "Manag…
## $ total_workers <int> 1024259, 977284, 14815, 43015, 754514, 44198, 10…
## $ workers_male <int> 782400, 681627, 8375, 17775, 440078, 16141, 7287…
## $ workers_female <int> 241859, 295657, 6440, 25240, 314436, 28057, 3683…
## $ percent_female <dbl> 23.6, 30.3, 43.5, 58.7, 41.7, 63.5, 33.6, 27.5, …
## $ total_earnings <int> 120254, 73557, 67155, 61371, 78455, 74114, 62187…
## $ total_earnings_male <int> 126142, 81041, 71530, 75190, 91998, 90071, 66579…
## $ total_earnings_female <int> 95921, 60759, 65325, 55860, 65040, 66052, 55079,…
## $ wage_percent_of_male <dbl> 76.04208, 74.97316, 91.32532, 74.29179, 70.69719…
# check for NA value in the dataset
colSums(is.na(jobs_gender))
## year occupation major_category
## 0 0 0
## minor_category total_workers workers_male
## 0 0 0
## workers_female percent_female total_earnings
## 0 0 0
## total_earnings_male total_earnings_female wage_percent_of_male
## 4 65 846
#replace NA value by mean of the column's value in the dataset
jobs_gender <- jobs_gender %>%
mutate(total_earnings_male=ifelse(is.na(total_earnings_male),mean(total_earnings_male,na.rm=TRUE), total_earnings_male),
total_earnings_female=ifelse(is.na(total_earnings_female),mean(total_earnings_female,na.rm=TRUE), total_earnings_female),
wage_percent_of_male=ifelse(is.na(wage_percent_of_male),mean(wage_percent_of_male,na.rm=TRUE), wage_percent_of_male))
head(jobs_gender,5)
# Summary Statistics
summary(jobs_gender)
## year occupation major_category minor_category
## Min. :2013 Length:2088 Length:2088 Length:2088
## 1st Qu.:2014 Class :character Class :character Class :character
## Median :2014 Mode :character Mode :character Mode :character
## Mean :2014
## 3rd Qu.:2015
## Max. :2016
## total_workers workers_male workers_female percent_female
## Min. : 658 Min. : 0 Min. : 0 Min. : 0.00
## 1st Qu.: 18687 1st Qu.: 10765 1st Qu.: 2364 1st Qu.: 10.73
## Median : 58997 Median : 32302 Median : 15238 Median : 32.40
## Mean : 196055 Mean : 111515 Mean : 84540 Mean : 36.00
## 3rd Qu.: 187415 3rd Qu.: 102644 3rd Qu.: 63327 3rd Qu.: 57.31
## Max. :3758629 Max. :2570385 Max. :2290818 Max. :100.00
## total_earnings total_earnings_male total_earnings_female
## Min. : 17266 Min. : 12147 Min. : 7447
## 1st Qu.: 32410 1st Qu.: 35711 1st Qu.: 29364
## Median : 44437 Median : 46837 Median : 40903
## Mean : 49762 Mean : 53138 Mean : 44681
## 3rd Qu.: 61012 3rd Qu.: 64846 3rd Qu.: 53613
## Max. :201542 Max. :231420 Max. :166388
## wage_percent_of_male
## Min. : 50.87
## 1st Qu.: 82.86
## Median : 84.03
## Mean : 84.03
## 3rd Qu.: 86.77
## Max. :117.40
From the summary statistics, we can see that the dataset covers 2,088 rows of occupations and it was collected between 2013 and 2016. There is few notable statistics in the summary above: the total number of workers in the dataset ranges from 658 to 3,758,629 with a mean of 196,055. The percentage of female workers in the dataset ranges from 0% to 100%, with a mean of 36%. This indicates that there is some variation in gender representation across different occupations. Also, the percentage of the male wage that the female wage represents ranges from 50.87% to 117.40%, with a mean of 84.03%. This suggests that, on average, female workers earn about 84% of what male workers earn in the same occupation. However, it’s worth noting that there is likely a lot of variation in this figure across different occupations.
The below data visualizations are formed based on all 2088 rows in the dataset, including duplicated occupations. Detailed exloratory data analysis on different occupations will be analysed in the final write-up data project.
# Histogram of totall earnings across all occupations
ggplot(data = jobs_gender) +
geom_histogram(mapping = aes(x = total_earnings), fill = "steelblue", bins = 30) +
labs(title = "Distribution of the total earnings across all occupations", x = "Total earnings", y = "Count")
This data visualization is a histogram of total earnings, showing the distribution of total earnings through all 2088 occupations in the dataset. The histogram is constructed using 30 bins and a bluesteel fill color. This histogram showed a right-skewed pattern with the range of total earnings from 17266 to 201542. It can be observed that nearly the majority of the occupations earn from above 25000 to over 50000, with the center of the histogram at around 30000 (or 30000 is the earning that nearly 330 occupations are at). And there are just few occupations that can earn above 100000 in a year, which also shows that the values are slightly spread out.
# Bar chart of major categories
ggplot(data=jobs_gender) +
geom_bar(mapping = aes(x=major_category),fill='steelblue') +
labs(title="Number of occupations in each major category", x = "Major Category", y = "Number of occupations") + coord_flip()
The graph above shows that there are 8 different types of major
categories in this dataset. The categories are sorted in alphabetical
order on the x-axis. And the distribution of number of occupations per
major category ranges from over 110 to nearly 450 occupations. In
detail, Production, Transportation, and Material Moving has
most occupations with nearly 450 while
Healthcare Practitioners and Technical has the least
occupations of over 110.
# Scatter plot matrix of earnings and percent female
ggplot(data = jobs_gender) +
geom_point(mapping = aes(x = total_earnings, y = percent_female), alpha = 0.4, color = "steelblue") +
geom_smooth(mapping = aes(x = total_earnings, y = percent_female),method = "lm", se = FALSE, color='red') + labs(title = "Relationship between earnings and percent female", x = "Total earnings", y = "Female percent")
## `geom_smooth()` using formula = 'y ~ x'
The bi-variate data visualization above demonstrates a scatter plot of relationship between earnings and percent female with the best fit line. The best fit line slopes downward, which suggests a negative relationship, meaning that as one variable increases, the other tends to decrease. Also, from the plot, the points on the scatterplot are moderately clustered around the line and tend to spread out at the higher total earnings. This suggests a moderate relationships between total earnings and percent of female. And there seems to be hardly any outliers in the plot.
# Box plot of wage percent of male by major category
ggplot(data = jobs_gender) +
geom_boxplot(mapping = aes(x=major_category,y=wage_percent_of_male),fill='steelblue') +
labs(title = "Wage percent of male by major category",x= "Major category", y = "Wage percent of male") + coord_flip()
From the given boxplot, we can notice that
Production, Transportation, and Material Moving and
Natural Resources, Construction, and Maintenance ’s box
plots are shorter (nealy none) and have more outliers than other
categories. This mean that these two categories have the data less
spread out and more variable than the others, interpreting that the
distribution of wages for those category is highly skewed and that there
are some high earners who are making significantly more than the typical
earners in those category. Moreover, observing other categories, the
positions of the boxplots quite overlap with each other, which means
that the medians are similar. The whiskers of
Management, Business, and Financial and
Education, Legal, Community Service, Arts, and Media are
longer than the others, which may indicate that the data are more
variable in these 2 categories.
From the above exploratory data analysis, we can see that:
The Production, Transportation, and Material Moving
category has the most occupations (nearly 450) while the
Healthcare Practitioners and Technical category has the
least (over 110).Production, Transportation, and Material Moving and
Natural Resources, Construction, and Maintenance categories
have shorter boxplots with more outliers than other categories,
indicating the data is less spread out and more variable.
Management, Business, and Financial and
Education, Legal, Community Service, Arts, and Media have
longer whiskers, suggesting the data is more variable in these
categories.#distinguish the major
jobs_gender %>% distinct(major_category)
# Group the data by major category and year, and calculate the mean percentage of female workers
jobs_gender_df1 <- jobs_gender %>%
group_by(major_category, year) %>%
summarize(mean_percent_female = mean(percent_female))
## `summarise()` has grouped output by 'major_category'. You can override using
## the `.groups` argument.
jobs_gender_df1
#Visualize the data of percent_female in many major categories over time
jobs_gender_df1 %>% ggplot() +
geom_col(aes(x = year, y = mean_percent_female, group = major_category, fill = major_category)) +
labs(x = "Year", y = "Percentage of female workers",
title = "Percent of female in 8 major categories over time",
caption = "Data Source: [1]") + theme_bw()
Natural Resources, Construction, and Maintenance is the
least compared to other industries over time with around 5.6%.
Meanwhile, Healthcare Practitioners and Technical has the
most percent of female in the industry with around 65% throughout the 4
years.Final Conclusion and Answer to the Research Question:In short, it could be said that the female percent in each major category stay nearly the same from 2013 to 2016 in which
Natural Resources, Construction, and Maintenancehas the least andHealthcare Practitioners and Technicalhas the most female workers among the 8 major categories.
For this question, we will pick
Natural Resources, Construction, and Maintenance as the
industry to analyze. From question 1’s answer, it can easily be seen
that this industry has the least female percent among all major
categories. Therefore, we want to see if female_percent will influence
the pay gap between genders along with other factors in this major
category or not.
We gonna check how much is the gap between total earnings of female and male first:
Scatterplots of total earnings for male and female workers
within the
Natural Resources, Construction, and Maintenanceindustry:
#create a dataset with major_category Natural Resources, Construction, and Maintenance
jobs_gender_df2 <- jobs_gender %>%
filter(major_category == "Natural Resources, Construction, and Maintenance")
#Scatterplot of total earnings for male workers within Natural Resources, Construction, and Maintenance:
jobs_gender_df2 %>% ggplot() +
geom_point(aes(x=total_earnings,y=total_earnings_female,color = total_earnings_female)) +
geom_smooth(aes(x=total_earnings,y=total_earnings_female),method="lm",se=FALSE,color="red") +
labs(x="Total earnings", y = "Total earnings of female",
title = "Relationship between total earnings and total earnings of female",
caption = "Data Source: [1]")
## `geom_smooth()` using formula = 'y ~ x'
#Scatterplot of total earnings for male workers within Natural Resources, Construction, and Maintenance:
jobs_gender_df2 %>% ggplot() +
geom_point(aes(x=total_earnings,y=total_earnings_male,color = total_earnings_male)) +
geom_smooth(aes(x=total_earnings,y=total_earnings_male),method="lm",se=FALSE,color="red") +
labs(x="Total earnings", y = "Total earnings of male",
title = "Relationship between total earnings and total earnings of male",
caption = "Data Source: [1]")
## `geom_smooth()` using formula = 'y ~ x'
total_earnings_female and
total_earnings_male with total_earnings with
the best fit line, it can easily be seen that the data in
total_earnings_female is a little bit widespread while that
of total_earnings_male is clustered together. Also, the
range of total_earnings_female is from around 5000 to over
110000 with the highest 160000. Meanwhile, the range of
total_earnings_male is from around 20000 to 90000 with the
higest 90000. Moreover, from the best fit lines of the two graph, we can
see that the best fit line of total_earnings_female seems
to increase slower than the line of
total_earnings_male.It can be concluded that there is a significant difference between the earnings of male and female workers. Female workers seem to have a wider range of earnings than male workers, with some females earning significantly less and others earning significantly more than the male workers. Also, the two best fit lines indicate that male workers have higher earnings than female workers on average as
total_earnings_femaleincreases at a slower rate than the line fortotal_earnings_male.
Now, we will go deep into analysing which factors contributing to this pay gap by using linear regression model on both outcome variables with some predictor variables:
For this analyzing, we take into accounts all other variables to be
the predictor variables except for
total_earnings,total_earnings_female and
total_earnings_male (2 of them are the main outcome
variables, and the other is the sum of these 2 variables). After
regarding the fact that there are too many occupations and
total_workers is the sum of workers_male and
workers_female, we decide to eliminate
occupation, workers_female and
workers_female out of the predictor variables. Therefore,
the predictor variables for the linear models are year,
minor_category, total_workers,
percent_female, and wage_percent_of_male.
Linear regression model with
total_earnings_female as the response
variable:
female_model <- lm(total_earnings_female~
year+minor_category+total_workers+percent_female+wage_percent_of_male,
data = jobs_gender_df2)
summary(female_model)
##
## Call:
## lm(formula = total_earnings_female ~ year + minor_category +
## total_workers + percent_female + wage_percent_of_male, data = jobs_gender_df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29512 -8675 -684 4277 116037
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -3.468e+05 1.397e+06
## year 1.742e+02 6.944e+02
## minor_categoryFarming, Fishing, and Forestry -9.581e+03 3.495e+03
## minor_categoryInstallation, Maintenance, and Repair -1.611e+03 1.632e+03
## total_workers -1.110e-02 3.898e-03
## percent_female -1.057e+02 1.088e+02
## wage_percent_of_male 4.608e+02 1.752e+02
## t value Pr(>|t|)
## (Intercept) -0.248 0.80413
## year 0.251 0.80207
## minor_categoryFarming, Fishing, and Forestry -2.742 0.00646 **
## minor_categoryInstallation, Maintenance, and Repair -0.987 0.32445
## total_workers -2.848 0.00468 **
## percent_female -0.972 0.33181
## wage_percent_of_male 2.631 0.00893 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13960 on 321 degrees of freedom
## Multiple R-squared: 0.09427, Adjusted R-squared: 0.07734
## F-statistic: 5.569 on 6 and 321 DF, p-value: 1.646e-05
The factors (or variables) in the model that contribute to total_earnings_female are:
minor_categoryFarming, Fishing, and Forestry: being in
this occupation category is associated with a decrease in earnings
compared to other categories.total_workers: an increase in total workers is
associated with a decrease in earnings.wage_percent_of_male: an increase in the wage percent
of male workers is associated with an increase in female earnings.The other variables in the model, year and percent_female, do not appear to be significant contributors to female earnings.
However, it is worth noting that the R-squared value of 0.09427 indicates that the model explains only about 9.4% of the variance in the outcome variable, which is relatively low. This indicates that these variables explain only a small proportion of the variation in total earnings of male workers.
Check the assumptions of female_model:
par(mfrow=c(2,2))
plot(female_model)
Linearity: Based on the Residuals vs Fitted scatterplot, we can see that it does not look linear as the residuals seem not to be randomly scattered around zero, most of the data clustered in a point.
Independence: We can assume that the observations are independent as long as each earnings with other predictor variables are likely to be unique and not influenced by the presence of other data.
Normality: Based on the Normal Q-Q plot, the plot shows that the residuals are likely to be normally distributed. The points seem to be near to the line except for the last part where the data is gradually distant from the line. Also, from the Shapiro-Wilk test, as p-value < .05 (2.2e-16 < .05), we reject the null hypothesis and there is evidence to suggest that the residuals are not normally distributed. The normality assumption is likely to be broken.
shapiro.test(summary(female_model)$residuals)
##
## Shapiro-Wilk normality test
##
## data: summary(female_model)$residuals
## W = 0.83743, p-value < 2.2e-16
bptest(female_model)
##
## studentized Breusch-Pagan test
##
## data: female_model
## BP = 8.4992, df = 6, p-value = 0.2038
Based on the assessment of the assumptions, it may not be appropriate to use a linear model for this data. The linearity and normality assumptions appear to be violated. A nonlinear model or a different type of regression model may be more appropriate for this data. It’s also possible that further data cleaning or transformation may be necessary before building a regression model.
Linear regression model with total_earnings_male
as the response variable:
male_model <- lm(total_earnings_male~
year+minor_category+total_workers+percent_female+wage_percent_of_male,
data = jobs_gender_df2)
summary(male_model)
##
## Call:
## lm(formula = total_earnings_male ~ year + minor_category + total_workers +
## percent_female + wage_percent_of_male, data = jobs_gender_df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24373 -7547 -2173 6626 43669
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -1.252e+06 1.197e+06
## year 6.436e+02 5.951e+02
## minor_categoryFarming, Fishing, and Forestry -9.541e+03 2.995e+03
## minor_categoryInstallation, Maintenance, and Repair 1.789e+03 1.399e+03
## total_workers -7.860e-03 3.341e-03
## percent_female -6.561e+01 9.321e+01
## wage_percent_of_male 1.830e+01 1.501e+02
## t value Pr(>|t|)
## (Intercept) -1.046 0.29645
## year 1.081 0.28032
## minor_categoryFarming, Fishing, and Forestry -3.185 0.00159 **
## minor_categoryInstallation, Maintenance, and Repair 1.279 0.20192
## total_workers -2.353 0.01925 *
## percent_female -0.704 0.48202
## wage_percent_of_male 0.122 0.90308
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11970 on 321 degrees of freedom
## Multiple R-squared: 0.09813, Adjusted R-squared: 0.08127
## F-statistic: 5.821 on 6 and 321 DF, p-value: 8.949e-06
Based on the coefficients of the model, the variables that have a significant effect on the total earnings of male workers are:
minor_category: Male workers in the “Farming, Fishing,
and Forestry” category earn about 9,541 less than those in the reference
category, which is “Management, Business, and Financial Operations”
category. On the other hand, male workers in the “Installation,
Maintenance, and Repair” category earn about 1,789 more than the
reference category.total_workers: An increase of one worker is associated
with a decrease of $7.86 in total earnings of male workers.wage_percent_of_male: An increase of one percentage
point in the wage of male workers is associated with an increase of
$18.30 in total earnings of male workers.However, it’s worth noting that the R-squared value of the model is low (0.09813), which indicates that these variables explain only a small proportion of the variation in total earnings of male workers.
Check the assumptions of male_model:
par(mfrow=c(2,2))
plot(male_model)
Linearity: Based on the Residuals vs Fitted scatterplot, we can see that it does not look linear as the residuals seem to clustered a lot at the end of the graph, and at the beginning of the graph, the dât are just a little.
Independence: We can assume that the observations are independent as long as each earnings with other predictor variables are likely to be unique and not influenced by the presence of other data.
Normality: Based on the Normal Q-Q plot, the plot shows that the residuals are likely to be normally distributed. The points seem to be near to the line except for the last half part where the data is gradually distant from the line. Also, from the Shapiro-Wilk test, as p-value < .05 (1.386e-07 < .05), we reject the null hypothesis and there is evidence to suggest that the residuals are not normally distributed. The normality assumption is likely to be broken.
shapiro.test(summary(male_model)$residuals)
##
## Shapiro-Wilk normality test
##
## data: summary(male_model)$residuals
## W = 0.96165, p-value = 1.386e-07
bptest(male_model)
##
## studentized Breusch-Pagan test
##
## data: male_model
## BP = 19.996, df = 6, p-value = 0.002774
Based on the violation of multiple assumptions, it may not be appropriate to use a linear model for this data. The linearity and homoscedasticity assumptions seems to be violated. Other models may be more appropriate.
Final Conclusion and Answer to the Research Question:
Based on the analysis, the following factors contribute to the gender
pay gap within the
Natural Resources, Construction, and Maintenance
industry:
minor_category: Workers in the “Farming, Fishing,
and Forestry” category earn less than those in the reference category,
which is the “Management, Business, and Financial Operations” category.
This difference is more pronounced in male workers than female
workers.
total_workers: An increase in total workers is
associated with a decrease in earnings.
wage_percent_of_male: An increase in the wage
percent of male workers is associated with an increase in female
earnings.
It is worth noting that both two model only explains about 9.4% and 9,8% of the variance in the outcome variable, which are relatively low. This indicates that these variables explain only a small proportion of the variation in total earnings of both female and male workers.
Additionally, it may not be appropriate to use a linear model for this data because the linearity and normality assumptions appear to be violated. A nonlinear model or a different type of regression model may be more appropriate for this data. It’s also possible that further data cleaning or transformation may be necessary before building a regression model.
#Median of total earnings by major category for male and female workers
earnings_by_occ_gender <- jobs_gender %>%
group_by(major_category) %>%
summarise(median_earnings_male = median(total_earnings_male),
median_earnings_female = median(total_earnings_female))
# Calculate the difference in median earnings between male and female workers
earnings_by_occ_gender <- earnings_by_occ_gender %>%
mutate(median_earnings_diff = median_earnings_male - median_earnings_female)
ggplot(data = earnings_by_occ_gender)+
geom_col(mapping = aes(x = major_category, y = median_earnings_diff), fill = "orange")+
labs(title = "Median Earnings Difference by Major Category",
x = "Major Category", y = "Median Earnings Difference", caption = "Data Source: [1]") +
coord_flip()
Final Conclusion and Answer to the Research Question:
Based on the result from the research question 3, the largest pay differences based on gender are in the major categories of “Management, Business, and Financial”. Therefore, for this analyzing, we want to explore if there is any relationship between the percentage of female workers in the major categories of “Management, Business, and Financial” and the average earnings of workers in that major.
#Filter the dataset to include only this major category:
df_4 <- filter(jobs_gender, major_category == "Management, Business, and Financial")
#Calculate the correlation coefficient between the percent_female and total_earnings variables
cor <- cor(df_4$percent_female, df_4$total_earnings)
print(cor)
## [1] -0.2954133
A correlation coefficient of -0.2954133 suggests a negative relationship between the percentage of female workers in the “Management, Business, and Financial” major category and the average earnings of workers in that major. A negative correlation coefficient means that as the percentage of female workers in the major category increases, the average earnings tend to decrease.
ggplot(data = df_4) +
geom_point(mapping = aes(x = percent_female, y = total_earnings)) +
labs(x = "Percentage of female workers", y = "Average earnings", title = "Relationship between the percentage of female workers and the average earnings of workers in the major categories of Management, Business, and Financial")
Final Conclusion and Answer to the Research Question:
Overall, based on the above analysis results, it can be summed up that the dataset analyzed indicates that a gender pay gap exists across various industries, with men consistently earning higher salaries than women. The largest pay differences between gender are found in the major categories of “Management, Business, and Financial”, followed by “Engineering” and “Computers & Mathematics”, while the lowest pay differences are found in the major categories of “Natural Resources, Construction, and Maintenance”, followed by “Service”. Factors contributing to the gender pay gap include the measurable factors of educational attainment, occupational segregation, work experience, total workers, and the wage percent of male workers, but other difficult-to-measure factors such as gender discrimination may also contribute to the ongoing wage discrepancy. Additionally, the percentage of female workers in a major category is negatively correlated with average earnings, but the low explanatory power of the regression models suggests that further research is needed to better understand the factors driving the gender pay gap.
For future work:
Commentary of limitations: we can see that there are some limitations to this dataset:
Data Source: https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-03-05 posted by TidyTuesday who collected it from Census Bureau and Bureau of Labor here and here in March 2019.↩︎
Carolina Aragão. (2023). Gender pay gap in U.S. hasn’t changed much in two decades. Pew Research Center.↩︎
Carolina Aragão. (2023). Gender pay gap in U.S. hasn’t changed much in two decades. Pew Research Center.↩︎
Carolina Aragão. (2023). Gender pay gap in U.S. hasn’t changed much in two decades. Pew Research Center.↩︎