Introduction

  1. Research question:
  1. Why we choose this topic?
  1. About the dataset:
  1. Data dictionary:
Variable Class Discription
year interger Year
occupation character Specific job/career
major_category character Broad category of occupation
minor_category character Fine category of occupation
total_workers double Total estimated full-time workers > 16 years old
workers_male double Estimated MALE full-time workers > 16 years old
workers_female double Estimated FEMALE full-time workers > 16 years old
percent_female double The percent of females for specific occupation
total_earnings double Total estimated median earnings for full-time workers > 16 years old
total_earnings_male double Estimated MALE median earnings for full-time workers > 16 years old
total_earnings_female double Estimated FEMALE median earnings for full-time workers > 16 years old
wage_percent_of_male double Female wages as percent of male wages - NA for occupations with small sample size
  1. Load the data and take a look at the first 5 rows:
jobs_gender <- read.csv("../data/jobs_gender.csv")
head(jobs_gender,5)

Preliminary Exploratory Data Analysis

glimpse(jobs_gender)
## Rows: 2,088
## Columns: 12
## $ year                  <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ occupation            <chr> "Chief executives", "General and operations mana…
## $ major_category        <chr> "Management, Business, and Financial", "Manageme…
## $ minor_category        <chr> "Management", "Management", "Management", "Manag…
## $ total_workers         <int> 1024259, 977284, 14815, 43015, 754514, 44198, 10…
## $ workers_male          <int> 782400, 681627, 8375, 17775, 440078, 16141, 7287…
## $ workers_female        <int> 241859, 295657, 6440, 25240, 314436, 28057, 3683…
## $ percent_female        <dbl> 23.6, 30.3, 43.5, 58.7, 41.7, 63.5, 33.6, 27.5, …
## $ total_earnings        <int> 120254, 73557, 67155, 61371, 78455, 74114, 62187…
## $ total_earnings_male   <int> 126142, 81041, 71530, 75190, 91998, 90071, 66579…
## $ total_earnings_female <int> 95921, 60759, 65325, 55860, 65040, 66052, 55079,…
## $ wage_percent_of_male  <dbl> 76.04208, 74.97316, 91.32532, 74.29179, 70.69719…
# check for NA value in the dataset
colSums(is.na(jobs_gender))
##                  year            occupation        major_category 
##                     0                     0                     0 
##        minor_category         total_workers          workers_male 
##                     0                     0                     0 
##        workers_female        percent_female        total_earnings 
##                     0                     0                     0 
##   total_earnings_male total_earnings_female  wage_percent_of_male 
##                     4                    65                   846
#replace NA value by mean of the column's value in the dataset
jobs_gender <- jobs_gender %>% 
  mutate(total_earnings_male=ifelse(is.na(total_earnings_male),mean(total_earnings_male,na.rm=TRUE), total_earnings_male),
         total_earnings_female=ifelse(is.na(total_earnings_female),mean(total_earnings_female,na.rm=TRUE), total_earnings_female),
         wage_percent_of_male=ifelse(is.na(wage_percent_of_male),mean(wage_percent_of_male,na.rm=TRUE), wage_percent_of_male))

head(jobs_gender,5)
# Summary Statistics
summary(jobs_gender)
##       year       occupation        major_category     minor_category    
##  Min.   :2013   Length:2088        Length:2088        Length:2088       
##  1st Qu.:2014   Class :character   Class :character   Class :character  
##  Median :2014   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2014                                                           
##  3rd Qu.:2015                                                           
##  Max.   :2016                                                           
##  total_workers      workers_male     workers_female    percent_female  
##  Min.   :    658   Min.   :      0   Min.   :      0   Min.   :  0.00  
##  1st Qu.:  18687   1st Qu.:  10765   1st Qu.:   2364   1st Qu.: 10.73  
##  Median :  58997   Median :  32302   Median :  15238   Median : 32.40  
##  Mean   : 196055   Mean   : 111515   Mean   :  84540   Mean   : 36.00  
##  3rd Qu.: 187415   3rd Qu.: 102644   3rd Qu.:  63327   3rd Qu.: 57.31  
##  Max.   :3758629   Max.   :2570385   Max.   :2290818   Max.   :100.00  
##  total_earnings   total_earnings_male total_earnings_female
##  Min.   : 17266   Min.   : 12147      Min.   :  7447       
##  1st Qu.: 32410   1st Qu.: 35711      1st Qu.: 29364       
##  Median : 44437   Median : 46837      Median : 40903       
##  Mean   : 49762   Mean   : 53138      Mean   : 44681       
##  3rd Qu.: 61012   3rd Qu.: 64846      3rd Qu.: 53613       
##  Max.   :201542   Max.   :231420      Max.   :166388       
##  wage_percent_of_male
##  Min.   : 50.87      
##  1st Qu.: 82.86      
##  Median : 84.03      
##  Mean   : 84.03      
##  3rd Qu.: 86.77      
##  Max.   :117.40

From the summary statistics, we can see that the dataset covers 2,088 rows of occupations and it was collected between 2013 and 2016. There is few notable statistics in the summary above: the total number of workers in the dataset ranges from 658 to 3,758,629 with a mean of 196,055. The percentage of female workers in the dataset ranges from 0% to 100%, with a mean of 36%. This indicates that there is some variation in gender representation across different occupations. Also, the percentage of the male wage that the female wage represents ranges from 50.87% to 117.40%, with a mean of 84.03%. This suggests that, on average, female workers earn about 84% of what male workers earn in the same occupation. However, it’s worth noting that there is likely a lot of variation in this figure across different occupations.

The below data visualizations are formed based on all 2088 rows in the dataset, including duplicated occupations. Detailed exloratory data analysis on different occupations will be analysed in the final write-up data project.

# Histogram of totall earnings across all occupations
ggplot(data = jobs_gender) +
  geom_histogram(mapping = aes(x = total_earnings), fill = "steelblue", bins = 30) +
  labs(title = "Distribution of the total earnings across all occupations", x = "Total earnings", y = "Count")

This data visualization is a histogram of total earnings, showing the distribution of total earnings through all 2088 occupations in the dataset. The histogram is constructed using 30 bins and a bluesteel fill color. This histogram showed a right-skewed pattern with the range of total earnings from 17266 to 201542. It can be observed that nearly the majority of the occupations earn from above 25000 to over 50000, with the center of the histogram at around 30000 (or 30000 is the earning that nearly 330 occupations are at). And there are just few occupations that can earn above 100000 in a year, which also shows that the values are slightly spread out.

# Bar chart of major categories
ggplot(data=jobs_gender) +
  geom_bar(mapping = aes(x=major_category),fill='steelblue') +
  labs(title="Number of occupations in each major category", x = "Major Category", y = "Number of occupations") + coord_flip()

The graph above shows that there are 8 different types of major categories in this dataset. The categories are sorted in alphabetical order on the x-axis. And the distribution of number of occupations per major category ranges from over 110 to nearly 450 occupations. In detail, Production, Transportation, and Material Moving has most occupations with nearly 450 while Healthcare Practitioners and Technical has the least occupations of over 110.

# Scatter plot matrix of earnings and percent female
ggplot(data = jobs_gender) +
  geom_point(mapping = aes(x = total_earnings, y = percent_female), alpha = 0.4, color = "steelblue") +
  geom_smooth(mapping = aes(x = total_earnings, y = percent_female),method = "lm", se = FALSE, color='red') + labs(title = "Relationship between earnings and percent female", x = "Total earnings", y = "Female percent")
## `geom_smooth()` using formula = 'y ~ x'

The bi-variate data visualization above demonstrates a scatter plot of relationship between earnings and percent female with the best fit line. The best fit line slopes downward, which suggests a negative relationship, meaning that as one variable increases, the other tends to decrease. Also, from the plot, the points on the scatterplot are moderately clustered around the line and tend to spread out at the higher total earnings. This suggests a moderate relationships between total earnings and percent of female. And there seems to be hardly any outliers in the plot.

# Box plot of wage percent of male by major category
ggplot(data = jobs_gender) +
  geom_boxplot(mapping = aes(x=major_category,y=wage_percent_of_male),fill='steelblue') +
  labs(title = "Wage percent of male by major category",x= "Major category", y = "Wage percent of male") + coord_flip()

From the given boxplot, we can notice that Production, Transportation, and Material Moving and Natural Resources, Construction, and Maintenance ’s box plots are shorter (nealy none) and have more outliers than other categories. This mean that these two categories have the data less spread out and more variable than the others, interpreting that the distribution of wages for those category is highly skewed and that there are some high earners who are making significantly more than the typical earners in those category. Moreover, observing other categories, the positions of the boxplots quite overlap with each other, which means that the medians are similar. The whiskers of Management, Business, and Financial and Education, Legal, Community Service, Arts, and Media are longer than the others, which may indicate that the data are more variable in these 2 categories.

From the above exploratory data analysis, we can see that:

Research question 1: How has the gender composition of different majors changed over time?

#distinguish the major
jobs_gender %>% distinct(major_category)
# Group the data by major category and year, and calculate the mean percentage of female workers
jobs_gender_df1 <- jobs_gender %>%
  group_by(major_category, year) %>%
  summarize(mean_percent_female = mean(percent_female))
## `summarise()` has grouped output by 'major_category'. You can override using
## the `.groups` argument.
jobs_gender_df1 
#Visualize the data of percent_female in many major categories over time
jobs_gender_df1 %>% ggplot() +
  geom_col(aes(x = year, y = mean_percent_female, group = major_category, fill = major_category)) +
  labs(x = "Year", y = "Percentage of female workers", 
       title = "Percent of female in 8 major categories over time",
       caption = "Data Source: [1]") + theme_bw()

Final Conclusion and Answer to the Research Question:In short, it could be said that the female percent in each major category stay nearly the same from 2013 to 2016 in which Natural Resources, Construction, and Maintenance has the least and Healthcare Practitioners and Technical has the most female workers among the 8 major categories.

Research question 2: What factors contribute to the gender pay gap within a particular industry?

For this question, we will pick Natural Resources, Construction, and Maintenance as the industry to analyze. From question 1’s answer, it can easily be seen that this industry has the least female percent among all major categories. Therefore, we want to see if female_percent will influence the pay gap between genders along with other factors in this major category or not.

We gonna check how much is the gap between total earnings of female and male first:

Scatterplots of total earnings for male and female workers within the Natural Resources, Construction, and Maintenanceindustry:

#create a dataset with major_category Natural Resources, Construction, and Maintenance
jobs_gender_df2 <- jobs_gender %>% 
  filter(major_category == "Natural Resources, Construction, and Maintenance")
#Scatterplot of total earnings for male workers within Natural Resources, Construction, and Maintenance:
jobs_gender_df2 %>% ggplot() +
  geom_point(aes(x=total_earnings,y=total_earnings_female,color = total_earnings_female)) + 
  geom_smooth(aes(x=total_earnings,y=total_earnings_female),method="lm",se=FALSE,color="red") +
  labs(x="Total earnings", y = "Total earnings of female",
       title = "Relationship between total earnings and total earnings of female",
       caption = "Data Source: [1]")
## `geom_smooth()` using formula = 'y ~ x'

#Scatterplot of total earnings for male workers within Natural Resources, Construction, and Maintenance:
jobs_gender_df2 %>% ggplot() +
  geom_point(aes(x=total_earnings,y=total_earnings_male,color = total_earnings_male)) + 
  geom_smooth(aes(x=total_earnings,y=total_earnings_male),method="lm",se=FALSE,color="red") +
  labs(x="Total earnings", y = "Total earnings of male",
       title = "Relationship between total earnings and total earnings of male",
       caption = "Data Source: [1]")
## `geom_smooth()` using formula = 'y ~ x'

It can be concluded that there is a significant difference between the earnings of male and female workers. Female workers seem to have a wider range of earnings than male workers, with some females earning significantly less and others earning significantly more than the male workers. Also, the two best fit lines indicate that male workers have higher earnings than female workers on average as total_earnings_female increases at a slower rate than the line for total_earnings_male.

Now, we will go deep into analysing which factors contributing to this pay gap by using linear regression model on both outcome variables with some predictor variables:

For this analyzing, we take into accounts all other variables to be the predictor variables except for total_earnings,total_earnings_female and total_earnings_male (2 of them are the main outcome variables, and the other is the sum of these 2 variables). After regarding the fact that there are too many occupations and total_workers is the sum of workers_male and workers_female, we decide to eliminate occupation, workers_female and workers_female out of the predictor variables. Therefore, the predictor variables for the linear models are year, minor_category, total_workers, percent_female, and wage_percent_of_male.

Linear regression model with total_earnings_female as the response variable:

female_model <- lm(total_earnings_female~
                     year+minor_category+total_workers+percent_female+wage_percent_of_male, 
                   data = jobs_gender_df2)
summary(female_model)
## 
## Call:
## lm(formula = total_earnings_female ~ year + minor_category + 
##     total_workers + percent_female + wage_percent_of_male, data = jobs_gender_df2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -29512  -8675   -684   4277 116037 
## 
## Coefficients:
##                                                       Estimate Std. Error
## (Intercept)                                         -3.468e+05  1.397e+06
## year                                                 1.742e+02  6.944e+02
## minor_categoryFarming, Fishing, and Forestry        -9.581e+03  3.495e+03
## minor_categoryInstallation, Maintenance, and Repair -1.611e+03  1.632e+03
## total_workers                                       -1.110e-02  3.898e-03
## percent_female                                      -1.057e+02  1.088e+02
## wage_percent_of_male                                 4.608e+02  1.752e+02
##                                                     t value Pr(>|t|)   
## (Intercept)                                          -0.248  0.80413   
## year                                                  0.251  0.80207   
## minor_categoryFarming, Fishing, and Forestry         -2.742  0.00646 **
## minor_categoryInstallation, Maintenance, and Repair  -0.987  0.32445   
## total_workers                                        -2.848  0.00468 **
## percent_female                                       -0.972  0.33181   
## wage_percent_of_male                                  2.631  0.00893 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13960 on 321 degrees of freedom
## Multiple R-squared:  0.09427,    Adjusted R-squared:  0.07734 
## F-statistic: 5.569 on 6 and 321 DF,  p-value: 1.646e-05

The factors (or variables) in the model that contribute to total_earnings_female are:

The other variables in the model, year and percent_female, do not appear to be significant contributors to female earnings.

However, it is worth noting that the R-squared value of 0.09427 indicates that the model explains only about 9.4% of the variance in the outcome variable, which is relatively low. This indicates that these variables explain only a small proportion of the variation in total earnings of male workers.

Check the assumptions of female_model:

par(mfrow=c(2,2))
plot(female_model)

shapiro.test(summary(female_model)$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  summary(female_model)$residuals
## W = 0.83743, p-value < 2.2e-16
bptest(female_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  female_model
## BP = 8.4992, df = 6, p-value = 0.2038

Based on the assessment of the assumptions, it may not be appropriate to use a linear model for this data. The linearity and normality assumptions appear to be violated. A nonlinear model or a different type of regression model may be more appropriate for this data. It’s also possible that further data cleaning or transformation may be necessary before building a regression model.

Linear regression model with total_earnings_male as the response variable:

male_model <- lm(total_earnings_male~
                     year+minor_category+total_workers+percent_female+wage_percent_of_male, 
                   data = jobs_gender_df2)
summary(male_model)
## 
## Call:
## lm(formula = total_earnings_male ~ year + minor_category + total_workers + 
##     percent_female + wage_percent_of_male, data = jobs_gender_df2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24373  -7547  -2173   6626  43669 
## 
## Coefficients:
##                                                       Estimate Std. Error
## (Intercept)                                         -1.252e+06  1.197e+06
## year                                                 6.436e+02  5.951e+02
## minor_categoryFarming, Fishing, and Forestry        -9.541e+03  2.995e+03
## minor_categoryInstallation, Maintenance, and Repair  1.789e+03  1.399e+03
## total_workers                                       -7.860e-03  3.341e-03
## percent_female                                      -6.561e+01  9.321e+01
## wage_percent_of_male                                 1.830e+01  1.501e+02
##                                                     t value Pr(>|t|)   
## (Intercept)                                          -1.046  0.29645   
## year                                                  1.081  0.28032   
## minor_categoryFarming, Fishing, and Forestry         -3.185  0.00159 **
## minor_categoryInstallation, Maintenance, and Repair   1.279  0.20192   
## total_workers                                        -2.353  0.01925 * 
## percent_female                                       -0.704  0.48202   
## wage_percent_of_male                                  0.122  0.90308   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11970 on 321 degrees of freedom
## Multiple R-squared:  0.09813,    Adjusted R-squared:  0.08127 
## F-statistic: 5.821 on 6 and 321 DF,  p-value: 8.949e-06

Based on the coefficients of the model, the variables that have a significant effect on the total earnings of male workers are:

However, it’s worth noting that the R-squared value of the model is low (0.09813), which indicates that these variables explain only a small proportion of the variation in total earnings of male workers.

Check the assumptions of male_model:

par(mfrow=c(2,2))
plot(male_model)

shapiro.test(summary(male_model)$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  summary(male_model)$residuals
## W = 0.96165, p-value = 1.386e-07
bptest(male_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  male_model
## BP = 19.996, df = 6, p-value = 0.002774

Based on the violation of multiple assumptions, it may not be appropriate to use a linear model for this data. The linearity and homoscedasticity assumptions seems to be violated. Other models may be more appropriate.

Final Conclusion and Answer to the Research Question:

Based on the analysis, the following factors contribute to the gender pay gap within the Natural Resources, Construction, and Maintenance industry:

It is worth noting that both two model only explains about 9.4% and 9,8% of the variance in the outcome variable, which are relatively low. This indicates that these variables explain only a small proportion of the variation in total earnings of both female and male workers.

Additionally, it may not be appropriate to use a linear model for this data because the linearity and normality assumptions appear to be violated. A nonlinear model or a different type of regression model may be more appropriate for this data. It’s also possible that further data cleaning or transformation may be necessary before building a regression model.

Research question 3: How do the earnings of male and female workers vary within the same major, and what factors might explain these differences?

#Median of total earnings by major category for male and female workers
earnings_by_occ_gender <- jobs_gender %>%
  group_by(major_category) %>%
  summarise(median_earnings_male = median(total_earnings_male),
            median_earnings_female = median(total_earnings_female))
# Calculate the difference in median earnings between male and female workers 
earnings_by_occ_gender <- earnings_by_occ_gender %>%
  mutate(median_earnings_diff = median_earnings_male - median_earnings_female)
ggplot(data = earnings_by_occ_gender)+
  geom_col(mapping = aes(x = major_category, y = median_earnings_diff), fill = "orange")+
  labs(title = "Median Earnings Difference by Major Category",
       x = "Major Category", y = "Median Earnings Difference", caption = "Data Source: [1]") +
  coord_flip()

Final Conclusion and Answer to the Research Question:

Research question 4: Is there a relationship between the percentage of female workers in an major and the average earnings of workers in that major?

Based on the result from the research question 3, the largest pay differences based on gender are in the major categories of “Management, Business, and Financial”. Therefore, for this analyzing, we want to explore if there is any relationship between the percentage of female workers in the major categories of “Management, Business, and Financial” and the average earnings of workers in that major.

#Filter the dataset to include only this major category:
df_4 <- filter(jobs_gender, major_category == "Management, Business, and Financial")
#Calculate the correlation coefficient between the percent_female and total_earnings variables
cor <- cor(df_4$percent_female, df_4$total_earnings)
print(cor)
## [1] -0.2954133

A correlation coefficient of -0.2954133 suggests a negative relationship between the percentage of female workers in the “Management, Business, and Financial” major category and the average earnings of workers in that major. A negative correlation coefficient means that as the percentage of female workers in the major category increases, the average earnings tend to decrease.

ggplot(data = df_4) +
  geom_point(mapping = aes(x = percent_female, y = total_earnings)) +
  labs(x = "Percentage of female workers", y = "Average earnings", title = "Relationship between the percentage of female workers and the average earnings of workers in the major categories of Management, Business, and Financial")

Final Conclusion and Answer to the Research Question:

Conclusion

Data Sources and Works Cited


  1. Data Source: https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-03-05 posted by TidyTuesday who collected it from Census Bureau and Bureau of Labor here and here in March 2019.↩︎

  2. Carolina Aragão. (2023). Gender pay gap in U.S. hasn’t changed much in two decades. Pew Research Center.↩︎

  3. Carolina Aragão. (2023). Gender pay gap in U.S. hasn’t changed much in two decades. Pew Research Center.↩︎

  4. Carolina Aragão. (2023). Gender pay gap in U.S. hasn’t changed much in two decades. Pew Research Center.↩︎