1. Synopsis

Women have been challenged by inequality in the workforce over the years. Until modern times, legal and cultural practices, combined with the inertia of longstanding religious and educational conventions, restricted women’s entry and participation in the workforce.

The gender pay gap is the gap between what men and women are paid. Most commonly, it refers to the median annual pay of all women who work full time and year-round, compared to the pay of a similar cohort of men. It is important for us to identify which employment sectors and occupations have significant gender pay gap. It is also important for us to identify if there has been any improvement in bridging the gap with respect to time.

We performed exploratory data analysis on the historical data about women’s earnings and employment status with the help of summaries and graphs to discover patterns and to spot anomalies.

2. Packages Required

The following packages has been used for the analysis:

library(tidyr)
library(DT)
library(ggplot2)
library(dplyr)
library(readxl)
library(tidyverse)
library(kableExtra)
library(shiny)
library(plotly)
library(ggalt)

tidyr : For changing the layout of the data sets, to convert data into the tidy format.

DT : For HTML display of data.

ggplot2 : For customizable graphical representation.

dplyr : For data manipulation.

readxl : For reading the excel file.

tidyverse : Collection of R packages designed for data science that works harmoniously with other packages.

kableExtra : To display table in a fancy way.

shiny : For interactive graphs and dashboards.

plotly : To convert ggplots into more interactive and stylish ones.

ggalt : For making dumbbell graphs.

3. Data Prepration

a. Data Source

There is historical data about women’s earnings and employment status, as well as detailed information about specific occupation and earnings from 2013-2016 from the Bureau of Labor Statistics and the Census Bureau about women in the workforce.

b. Explanation of Source Data

The data used in the analysis can be found here. The data consists of three tables.

The first one contains information about the major employment sectors, occupations, proportion of women and the percentage earnings of women in that occupation. It has 2008 observations and 12 variables.

jobs_gender.csv

VARIABLE CLASS DESCRIPTION
year integer Year
occupation character Specific job/career
major_category character Broad category of occupation
minor_category character Fine category of occupation
total_workers double Total estimated full-time workers > 16 years old
workers_male double Estimated MALE full-time workers > 16 years old
workers_female double Estimated FEMALE full-time workers > 16 years old
percent_female double The percent of females for specific occupation
total_earnings double Total estimated median earnings for full-time workers > 16 years old
total_earnings_male double Estimated MALE median earnings for full-time workers > 16 years old
total_earnings_female double Estimated FEMALE median earnings for full-time workers > 16 years old
wage_percent_of_male double Female wages as percent of male wages - NA for occupations with small sample size

The second table describes the percent of earnings of women with respect to men, for different age groups over the span of time. It has 264 observations and 3 variables.

earnings_female.csv
VARIABLE CLASS DESCRIPTION
Year integer Year
group character Age group
percent double Female salary percent of male salary

This table contains data of proportion of women and men working part-time and full-time over the span of time. It has 49 observations and 7 variables.

employed_gender.csv
VARIABLE CLASS DESCRIPTION
year double Year
total_full_time double Percent of total employed people usually working full time
total_part_time double Percent of total employed people usually working part time
full_time_female double Percent of employed women usually working full time
part_time_female double Percent of employed women usually working part time
full_time_male double Percent of employed men usually working full time
part_time_male double Percent of employed men usually working part time

c. Data Cleaning Process

We read the data from the three tables. Since we are using most of the character fields as categorical variables, we keep STRINGASFACTORS as defaulted TRUE.

jobs_gender <- read.csv("jobs_gender.csv")
earnings_female <- read.csv("earnings_female.csv")
employed_gender <- read.csv("employed_gender.csv")

We now take a look at the structure of the data and also their summary statistics. The summaries would help us spot any anomalies like negative values. It would also indicate the fields with missing values and their counts.

str(jobs_gender)
## 'data.frame':    2088 obs. of  12 variables:
##  $ year                 : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ occupation           : Factor w/ 522 levels "Accountants and auditors",..: 69 218 265 6 289 415 5 87 178 82 ...
##  $ major_category       : Factor w/ 8 levels "Computer, Engineering, and Science",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ minor_category       : Factor w/ 23 levels "Architecture and Engineering",..: 16 16 16 16 16 16 16 16 16 16 ...
##  $ total_workers        : int  1024259 977284 14815 43015 754514 44198 109703 489048 990611 14656 ...
##  $ workers_male         : int  782400 681627 8375 17775 440078 16141 72873 354369 460842 3387 ...
##  $ workers_female       : int  241859 295657 6440 25240 314436 28057 36830 134679 529769 11269 ...
##  $ percent_female       : num  23.6 30.3 43.5 58.7 41.7 63.5 33.6 27.5 53.5 76.9 ...
##  $ total_earnings       : int  120254 73557 67155 61371 78455 74114 62187 99167 70456 71927 ...
##  $ total_earnings_male  : int  126142 81041 71530 75190 91998 90071 66579 101318 90278 97552 ...
##  $ total_earnings_female: int  95921 60759 65325 55860 65040 66052 55079 90940 57406 68207 ...
##  $ wage_percent_of_male : num  76 75 91.3 74.3 70.7 ...
summary(jobs_gender)
##       year                                               occupation  
##  Min.   :2013   Accountants and auditors                      :   4  
##  1st Qu.:2014   Actors                                        :   4  
##  Median :2014   Actuaries                                     :   4  
##  Mean   :2014   Adhesive bonding machine operators and tenders:   4  
##  3rd Qu.:2015   Administrative services managers              :   4  
##  Max.   :2016   Advertising and promotions managers           :   4  
##                 (Other)                                       :2064  
##                                           major_category
##  Production, Transportation, and Material Moving :444   
##  Natural Resources, Construction, and Maintenance:328   
##  Sales and Office                                :280   
##  Service                                         :272   
##  Computer, Engineering, and Science              :236   
##  Management, Business, and Financial             :232   
##  (Other)                                         :296   
##                                 minor_category total_workers    
##  Production                            : 308   Min.   :    658  
##  Office and Administrative Support     : 208   1st Qu.:  18687  
##  Construction and Extraction           : 152   Median :  58997  
##  Installation, Maintenance, and Repair : 144   Mean   : 196055  
##  Healthcare Practitioners and Technical: 128   3rd Qu.: 187415  
##  Management                            : 120   Max.   :3758629  
##  (Other)                               :1028                    
##   workers_male     workers_female    percent_female   total_earnings  
##  Min.   :      0   Min.   :      0   Min.   :  0.00   Min.   : 17266  
##  1st Qu.:  10765   1st Qu.:   2364   1st Qu.: 10.73   1st Qu.: 32410  
##  Median :  32302   Median :  15238   Median : 32.40   Median : 44437  
##  Mean   : 111515   Mean   :  84540   Mean   : 36.00   Mean   : 49762  
##  3rd Qu.: 102644   3rd Qu.:  63327   3rd Qu.: 57.31   3rd Qu.: 61012  
##  Max.   :2570385   Max.   :2290818   Max.   :100.00   Max.   :201542  
##                                                                       
##  total_earnings_male total_earnings_female wage_percent_of_male
##  Min.   : 12147      Min.   :  7447        Min.   : 50.88      
##  1st Qu.: 35702      1st Qu.: 28872        1st Qu.: 77.56      
##  Median : 46825      Median : 40191        Median : 85.16      
##  Mean   : 53138      Mean   : 44681        Mean   : 84.03      
##  3rd Qu.: 65015      3rd Qu.: 54813        3rd Qu.: 90.62      
##  Max.   :231420      Max.   :166388        Max.   :117.40      
##  NA's   :4           NA's   :65            NA's   :846

We see that there are 4 missing values under the column ‘total_earnings_male’, 65 missing values for ‘total_earnings_female’ and 846 missing values under ‘wage_percent_of_male’ from the first table- ‘jobs_gender’.

Since 4 and 65 correspond to 0.19% and 3.11% of the dataset respectively, we could remove them from further analysis. However, 846 is a significant fraction we wouldn’t remove those observations. The values for these observations can be calculated using total_earnings_female/total_earnings_male X 100.

We would rename the field ‘wage_percent_of_male’ to ‘wage_percent_female_wrt_male’ for clarity.

We also see from the summary() for job_gender table, the minimum value for both columns: workers_male and workers_female is 0. This indicates that there are certain occupations where either only male or female employees work.

jobs_gender%>% filter(workers_female==0)%>% count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1    18
jobs_gender%>% filter(workers_male==0)%>% count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1     3

These observations have NA values in their corresponding earnings variables hence they automatically handled.

jobs_gender <- jobs_gender %>% filter(!is.na(total_earnings_male) &  !is.na(total_earnings_female)) %>% rename(wage_percent_female_wrt_male = wage_percent_of_male) 

jobs_gender$wage_percent_female_wrt_male[is.na(jobs_gender$wage_percent_female_wrt_male)] <- jobs_gender$total_earnings_female/jobs_gender$total_earnings_male *100

We will now look into the table earnings_female that provides us data regarding percentage earnings of women of various age groups over the years

str(earnings_female)
## 'data.frame':    264 obs. of  3 variables:
##  $ Year   : int  1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 ...
##  $ group  : Factor w/ 8 levels "16-19 years",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ percent: num  62.3 64.2 64.4 65.7 66.5 67.6 68.1 69.5 69.8 70.2 ...
summary(earnings_female)
##       Year              group       percent     
##  Min.   :1979   16-19 years:33   Min.   :56.80  
##  1st Qu.:1987   20-24 years:33   1st Qu.:69.40  
##  Median :1995   25-34 years:33   Median :75.50  
##  Mean   :1995   35-44 years:33   Mean   :76.88  
##  3rd Qu.:2003   45-54 years:33   3rd Qu.:86.90  
##  Max.   :2011   55-64 years:33   Max.   :95.40  
##                 (Other)    :66
unique(earnings_female$group)
## [1] Total, 16 years and older 16-19 years              
## [3] 20-24 years               25-34 years              
## [5] 35-44 years               45-54 years              
## [7] 55-64 years               65 years and older       
## 8 Levels: 16-19 years 20-24 years 25-34 years 35-44 years ... Total, 16 years and older

Here we find a group named “Total, 16 years and older” in the group column. This does not giving any proper insights, hence we will remove those values from the data set.

earnings_female <- earnings_female %>% filter(str_detect(group, "Total, 16 years and older") == FALSE)

Now taking a look at the employed_gender table

str(employed_gender)
## 'data.frame':    49 obs. of  7 variables:
##  $ year            : int  1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 ...
##  $ total_full_time : num  86 85.5 84.8 84.4 84.3 84.4 84.2 83.4 83.3 83.3 ...
##  $ total_part_time : num  14 14.5 15.2 15.6 15.7 15.6 15.8 16.6 16.7 16.7 ...
##  $ full_time_female: num  75.1 74.9 73.9 73.2 73.1 73.2 73.2 72.4 72.5 72.6 ...
##  $ part_time_female: num  24.9 25.1 26.1 26.8 26.9 26.8 26.8 27.6 27.5 27.4 ...
##  $ full_time_male  : num  92.2 91.8 91.5 91.2 91.1 91.4 91.2 90.6 90.6 90.5 ...
##  $ part_time_male  : num  7.8 8.2 8.5 8.8 8.9 8.6 8.8 9.4 9.4 9.5 ...
summary(employed_gender)
##       year      total_full_time total_part_time full_time_female
##  Min.   :1968   Min.   :80.30   Min.   :14.00   Min.   :71.90   
##  1st Qu.:1980   1st Qu.:81.80   1st Qu.:16.80   1st Qu.:73.20   
##  Median :1992   Median :82.60   Median :17.40   Median :73.90   
##  Mean   :1992   Mean   :82.64   Mean   :17.36   Mean   :73.86   
##  3rd Qu.:2004   3rd Qu.:83.20   3rd Qu.:18.20   3rd Qu.:74.70   
##  Max.   :2016   Max.   :86.00   Max.   :19.70   Max.   :75.40   
##  part_time_female full_time_male  part_time_male 
##  Min.   :24.60    Min.   :86.60   Min.   : 7.80  
##  1st Qu.:25.30    1st Qu.:89.00   1st Qu.: 9.60  
##  Median :26.10    Median :89.50   Median :10.50  
##  Mean   :26.14    Mean   :89.49   Mean   :10.51  
##  3rd Qu.:26.80    3rd Qu.:90.40   3rd Qu.:11.00  
##  Max.   :28.10    Max.   :92.20   Max.   :13.40

We will use the employed_gender table as it is as there are no concerning issues.

Now the data is ready for analysis.

d. Cleaned Data

The cleaned data can be found below:

jobs_gender

earnings_female

employed_gender

4. Exploratory Data Analysis

The data has been analysed for patterns and trends in the following:

  • Representation of women in various occupational sectors
  • Earnings of women as compared to men in each of these sectors
  • Representation of women in part time and full time jobs
  • Change in the above factors with respect to time

a. Analysis by occupational category

We grouped and divided the job_gender data with respect to each major category that gave us an idea where the pay gap is maximum and minimum. Few Interesting Observations from the analysis are:

  • Even though the women earn a maximum salary by around 22% when compared to the maximum salary earned by men in Production, Transportation and Material Moving Category, the percentage pay gap difference between men and women is maximum. This is a category with 75% men in the workforce where the minimum salary earned by women is 189% less than the minimum salary earned by men.

  • In the Healthcare Practitioners and Technical Category, where the women in the workplace is more than men, still they receive less earning by around 20% when compared to male.

  • Further analysis indicates that the earning of women is independent on the representation of women in the workplace for each category.

  • We can also see the trend that as the age of the women increases, the pay gap also increases.

1. We summarize and visualize the mean earning of women in comparison to men by each major category of occupation.

We see that there is a significant positive difference in the earnings of women in comparison to men in all major occupational categories. The percentage difference is as high as 25% in categories like Production, Transportation, and Material Moving and in Management, Business and Finacial. The least difference is (around 13%) in the field of Natural Resources, Construction and Maintenance. The overall average in the pay gap is around 19% across all categories.

2. We will look at the minimum and maximum salaries in each department. We are interested in learning if these are salaries earned by a woman or a man

Comparing maximum and minimum values of earnings in each major category, we see that women are earning the minimum salaries in most categories. They earn maximum salaries in about three major categories. In the category of Production, Transportation and Material Moving, we see that a woman earns the maximum as well as the minimum salary. This is the department with the highest difference in the mean salary too.

3. We now take a look at the proportion of women in each of the major categories.

We see that the category of Natural Resources, Construction and Maintenance is highly male-dominated. The healthcare Practitioners and Technical department have the largest proportion of women but still, have a pay gap of 20%. Service Sales and Office, Management Business have about 50% of women representation.

4. To study the correlation between the earnings of women and their representation in each category, we look at the correlation values.

## # A tibble: 8 x 2
##   major_category                                            cor
##   <fct>                                                   <dbl>
## 1 Computer, Engineering, and Science                   -0.116  
## 2 Education, Legal, Community Service, Arts, and Media  0.0838 
## 3 Healthcare Practitioners and Technical                0.292  
## 4 Management, Business, and Financial                  -0.101  
## 5 Natural Resources, Construction, and Maintenance     -0.0228 
## 6 Production, Transportation, and Material Moving      -0.133  
## 7 Sales and Office                                      0.272  
## 8 Service                                               0.00473

From the above results, we can confirm that the representation of women in an occupational category does not influence their earnings.

5. Now we use the earning females table to visualize the aggregate percentage earnings of female with respect to male for various age categories of women over the years.

Women belonging to the age-group of 16-19 years and 20-24 years face a lesser pay gap in comparison to the remaining. The pay gap among age groups tends to increase with respect to the age of women. However, the pay gap reduces again for women age 65 years and older.

b. Analysis with respect to Time

We grouped and divide the data with respect to the time frame that gave us an idea where the pay gap is maximum and minimum. Few Interesting Observations from the analysis are:

  • Salary earned by women in 2016 in all major category is less than the salary earned by men in 2013.

  • Even though the proportion of women in each major category is more or less the same from 2013 to 2016, there are some category where mean salary obtained by women fluctuates from 2013 to 2016

  • We also see that the part time female workers are around 3 times compared to part time male workers but their proportion decreases with time.

  • Younger women face less gender pay gap compared to elder women.

1. We check how the salary has changed for each major occupational category over the time period for both men and women.

We see that even though there are certain categories where the increment percentage(indicated by the number on the dumbbell plot) in salary from 2013 to 2016 is more for women, the actual picture is very different. There is a huge pay gap per year in each major occupational categories which is evident from the fact that in all the categories the salary that a woman is making in 2016 is way less than the salary that men used to make 4 years ago i.e. 2013.

2. Also, we look at the proportion of women in each occupational category over the period from 2013 to 2016.

We see that the representation of women in each of the categories has remained almost constant over the four years. As seen earlier, Natural Resources is one of the major occupational sectors which is heavily male-dominated over the four years.

We see that the most drastic drop in earnings of women happens in Natural Resources in 2015. The most drastic rise happens in Service in 2014. In the categories of Computer, Healthcare, Education the salary percentages fluctuate while it has remained fairly constant in Management, Production, and Sales.

3. Now we use the employee gender table to derive insights for our analysis.

From the above output, we see that the number of full time women in all the year is less than the men. The ratio of full time women is more or less the same with some positive increment in the last few years. In the ’70s the ratio was 100 women for about 123 men and improves to a ratio of 100 women for 117 men in 2010.

From the above output we see that the number of full time women in all the year is less than the men. The ratio of full time women is more or less the same with some postive increment in the last few years. In the 70’s the ratio was 100 women for about 123 men and improves to a ratio 100 women for 117 men in 2010.

From the above two graphs, we see that the variation in full time and part time jobs are huge for the genders. 30% of women work part time jobs while only about 10% of men work part time.

4. Now we look into the earning female table to derive insights for our analysis.

We see that the change in the percentage of earnings for women with respect to men is increasing for the women between the age group 25 to 64. The variation is random for women of the age group 16 to 19 years, 20-24 years and 65 years and older.

c. Study of independent occupational categories

Pay gap percentage = median earnings of men- median earnings of women/median earning of men

We have studied some major occupational categories that had some significant observations. They are as followws:

Computer Engineering and Science

  • The median salary of women in the occupation of survey researchers is about 270% percent of that of the total male median salary. This implies a negative pay gap, good representation, highly promising occupation for women in 2016. The pay difference percentage decreased from 40% to -170%. Also, the representation of women increased from 57% to 70% in 3 years.

  • The architecture and Engineering department generally has a low representation of women in comparison to the other minor categories. Agricultural engineer women had received pay gap of 40% in 2015 with representation as low as 0.05%.

  • The pay gap for mathematicians has increased from -34% to a shocking 60% from 2013 to 2014, and has stayed around 40% till 2016. This could be as a result of small data set. The representation of women in the sector has increased which implies more women took up average paying jobs after 2014 in the field of mathematics.

  • The pay gap for women nuclear technicians wrt men has decreased from 35% to about 18% from 2013 to 2015.

Healthcare Practitioners and Technical

  • Podiatrists have only 15% of their workforce as women and have a positive pay gap of 30%.

  • Nurses, an occupation where about 80% are women are experiencing a pay gap of about 10%.

  • Dieticians and Nutritionists with about 90% of their workforce as women have a negative pay gap of -17%.

5. Summary

Based on the data at hand we have tried to study the trends and patterns in earnings of women in comparison to men with respect to various factors. This resulted in a few insights about the same.

Factor 1: Major occupational categories: We have grouped and divided the whole data with respect to each major category which gave us an idea where the pay gap is maximum and minimum. We see that there is a significant positive difference in the earnings of women in comparison to men in all major occupational categories. It is also seen that six out of eight occupational categories have minimum median salaries earned by a woman. Whereas five out of these eight occupational categories have the maximum salary earned by a man. This indicates the possibility of men holding a larger proportion of well-paid jobs in each of these sectors.

Factor 2: Representation of Women in the Workforce We see that the proportion of women in the workforce for each of the departments hasn’t changed significantly over the years. There is no considerate effect of their representation on their income. It is also noticed that they suffer a pay gap of about 10 to 20% in certain occupations like Nursing where their representation is above 80%. Women are almost half of the workforce. Yet, on average, women continue to earn considerably less than men.

Factor 3: Full time and part time Taking ratios of the number of women to men in part time to full time jobs, we see that women outnumber men in part time jobs. Full time jobs have more men representation. This suggests the possibility of the existence of bias. Women tend to take up part time jobs to meet other household expectations.

Factor 4: Time We grouped and divided the data with respect to year giving us the indication whether the pay gap increases or decreases with respect to time. The trend suggests that though there has been an increase in the pay of women with respect to men over the years, the change is not very significant for some age groups. It has almost remained the same for the oldest and youngest age groups.