Do gender-based social roles or top salary impact our choices of career paths? Although many countries, such as China, have incorporated women into their labor force and developed strong economies as a result, women still tend to choose careers that align more with gender stereotypes.\(^2\) Undeniably, the personality characteristics often associated with women are sympathy, kindness, and warmth, which all reflect a sense of concern towards other people. On the other hand, the traits frequently associated with men are success and ambition, which are concerned more with accomplishing tasks. These characteristics manifest themselves in the stereotypical association of men with the worker role and women with the family role.\(^1\)
In response to this gender bias, more schools are encouraging girls to enter STEM programs in addition to providing them with various resources to help them succeed in these types of careers. However, despite these efforts, women still tend to choose careers where the median pay is lower. Thus, our research question aims to find associations within STEM college majors that influence their median wages. Our goals are to explore the data for STEM college majors and to create a predictive model for median wages.
The data was obtained from the American Community Survey
2010-2012 Public Use Microdata Series and is a subset which only
contains STEM majors. The data dimensions are 76 rows
(STEM majors) by 9 columns (variables). The variables
are: Rank, Major_code, Major,
Major_category, Total, Men,
Women, ShareWomen, and Median.
Below are descriptions of each variable.
Median: Median earnings of full-time, year-round
workersRank: Rank by median earningsMajor_code: Major code, FO1DP in ACS PUMSMajor: Major descriptionMajor_category: Category of major from Carnevale et
alTotal: Total number of people with majorMen: Male graduatesWomen: Female graduatesShareWomen: Women as share of Total For the purpose of exploratory data analysis,
Major_category was set as a factor to explore the variation
of the share of women within major categories and the median wages for
those major categories. Within the STEM majors, median wage ranges from
\(\$26,000\) for Zoology to \(\$110,000\) for Petroleum Engineering, with
median wage’s median = \(\$44350\) and
its mean = \(\$46118\). Through a
stacked barplot of gender proportions per major category, we see that
the biggest proportion of women chose fields related to Health
and the biggest proportion of men chose fields related to
Engineering (Figure 1). This is congruent with the gender roles
and personality characteristics associated with women and men.
Figure 1: Gender proportions per major category.
Side-by-side boxplots of each major category were then generated to
show descriptive statistics, such as the interquartile range, to help
identify STEM majors which are outliers with regards to the
Median variable (see Figure 2). From our jitter plot, we
noticed that the Engineering major category contains an
outlier, Petroleum Engineering, and another outlier,
Astronomy and Astrophysics, can be found in the Physical
Sciences major category. As seen in Figure 2, Petroleum
Engineering has a smaller proportion of women compared to men
(ShareWomen= 0.121) with a median wage of \(\$110,000\) (Median= 110000).
Astronomy and Astrophysics has a roughly balanced proportion of
women compared to men (ShareWomen= 0.536) with a median
wage of \(\$62,000\)
(Median= 62000). These data visualizations illustrate that
there may be a significant difference between median wage by major
category, as well as an association between the proportion of women in
the major and its median wage.
Figure 2: Side-by-side box plot (left) and jitter plot (right) of Median wage ($1000) by major category. Major categories: [1]Biology & Life Science, [2]Computers & Mathematics, [3]Engineering, [4]Health, [5]Physical Sciences.
Analysis of Variance (ANOVA) was done to test if there is a
statistically significant difference between median wage for any of our
five major categories. Based on our ANOVA, we rejected our null
hypothesis, given \((F(4, 71) = [16.7];
p=0.00000001013 < \alpha=0.05\), and concluded that there is a
statistically significant difference between median wage per major
category. Since they are irrelevant to our analysis, the columns
Major_code and Rank were removed from our
dataset. Then, we generated a scatterplot matrix which revealed that
there seems to be a negative association between ShareWomen
and Median. The scatterplot matrix also supports our
assumption that there may be an issue of multicollinearity among
Total, Men, Women, and
ShareWomen. This observation makes sense since the column
Total is the sum of the columns Men and
Women. Likewise, ShareWomen refers to the
ratio of Women to Total.
Figure 3: Scatterplot matrix.
We selected the column for median wage, Median,
as our response variable. While checking normality, linearity, and
constant variance, we noticed the data for Median shows
some right skewing. Accordingly, a Box-Cox test was performed to see if
a transformation was necessary (see Figure 4). The resulting rounded
power was -1, suggesting that an inverse transformation of the response
was required to help with right-skewedness. However, this transformation
would later complicate the interpretability of the model.
Figure 4: Box-Cox plot for
Median.
\[[1]Y^{-1} = \beta_0 + \beta_{Major\_category} + \beta_{Total} + \beta_{Men} + \beta_{Women} + \beta_{ShareWomen} + \epsilon \]
Our full addictive model is described by equation [1]. Running this
model through the step-wise function using AIC as our criterion, we
ended up removing too many predictors; thus, it was decided to check for
interactions to see if this new model would help with this issue. Then,
another step-wise function was run to reduce the model’s AIC. This
process resulted in the removal of the predictor Women
because the \(p-value= 0.7394 > \alpha =
0.05\). The final reduced model is described by:
\(\widehat{Median^{-1}}= (2.71 \times 10^{-5}) - (3.44 \times10^{-6}) \cdot Computers \& Mathematics - (8.87 \times 10^{-6}) \cdot Engineering -\)
\((3.99 \times 10^{-7}) \cdot Health - (3.09 \times10^{-6}) \cdot Physical Sciences - (4.14 \times 10^{-11}) \cdot Men +\)
\((1.08 \times 10^{-6}) \cdot ShareWomen + (8.98 \times 10^{-11}) \cdot Men:ShareWomen\).
We achieved an adjusted \(R^2\)
score of 0.5377, which means that roughly 53.77% of the variation in the
inverse of Median can be explained by the model. While the
score is not too low, it does indicate that in practical settings, the
model still needs improvement. We also noticed that the predictors
Men, ShareWomen, and the interaction term
Men:ShareWomen are not statistically significant at any
significance level (given their p-values).
As noted earlier, model interpretability would be difficult here due
to the nature of the transformation. For example, looking at the
coefficient for the variable Major_categoryEngineering, it
can be interpreted to mean that if the major being examined is in the
Engineering category (and all other predictors would be held
constant), the intercept would decrease by roughly \(8.866\times10^{-6}\) inverse dollars.
To verify the results of the model, a plot of the standardized residuals against the model’s fitted values was made in addition to a Q-Q plot of the standardized residuals. In Figure 5, it can be seen on the left that the standardized residuals do not appear to have any discernible relationship with the final model’s predicted values. After confirming this interpretation with the studentized Breusch-Pagan test, given \(p=0.8582 > \alpha=0.05\), it was concluded that the assumption of constant variance for this data set holds up.
As for the Q-Q plot, although some of the data points seem to deviate from the Q-Q line at the tail ends of the data distribution, the standardized residuals do seen to follow the Q-Q line fairly well. To confirm this finding, the Shapiro-Wilks test was used, with resulting \(p=0.6165 > \alpha=0.05\). Therefore, it can be concluded that the standardized residuals follow a normal distribution, confirming our normality assumption.
Figure 5: The residual plot (left) along with the Q-Q plot (right) for the final model.
To confirm that the goal of creating a predictive model for
median wages was achieved, 95% prediction intervals of
(Median)\(^{-1}\) for
selected majors were generated, as seen below in Table 1.
| Major | Major Category | Men | Share Women | Median | Prediction Interval |
|---|---|---|---|---|---|
| Statistics & Decision Science | Computers & Mathematics | 2960 | 0.5265 | 45000 | (30997,61595) |
| Petroleum Engineering | Engineering | 2057 | 0.1206 | 110000 | (38461,94271) |
| Zoology | Biology & Life Sciences | 3050 | 0.6373 | 26000 | (28199,50201) |
| Astronomy & Astrophysics | Physical Sciences | 832 | 0.5357 | 62000 | (30976,61690) |
| Nursing | Health | 21773 | 0.8960 | 48000 | (27368,48764) |
Table 1: Prediction intervals for the chosen majors.
Here, we find that the median wages for Zoology, Astronomy & Astrophysics, and Petroleum Engineering lie outside their prediction intervals, which makes sense given that they’re outliers.
After the data analysis, the obtained results support that there is an association with gender and median wage in STEM majors. Since this is an association, we cannot assume that gender causes this difference. However, it would be interesting to see more research done in experimental research to find the causation of this discrepancy in median wage and gender.
Our final model has the capacity of predicting the median wage of STEM majors based on the major category, total number of men in the major, and total proportion of women in the major. We tested our model on each major category, having the least success with majors having median wages at the extremes of the dataset or are outliers within their major category (i.e. Petroleum Engineering, Zoology, and Astronomy & Astrophysics).
The most ‘important’ conclusion from our research is that since Petroleum Engineering has the highest median salary in this data set (i.e. \(\$110,000\)), potential students should consider majoring in this field if they only care about the median salary. However, if more women go into this field, the median wage could potentially decrease.
Despite the findings obtained, the data set was found to be
too limited to get a thorough look at associations within STEM college
majors that influence their median wages. For instance, if the data set
was sex-disaggregated for median wage, it could be useful to see the
difference in median wage by gender for each major. Another way to
improve future research is to collect time series data so that analysis
could be done to see how median wages change with an influx of women
and/or exodus of men from a given major. Since this project only looked
at STEM majors, it would also be interesting to see if these same
variables (i.e. Major_category, Men,
ShareWomen) are associated with the median wages for all
majors.
Etaugh, Claire A., and Judith S. Bridges. Women’s Lives: A Psychological Exploration. 3rd ed., Pearson, 2013.
Kristof, Nicholas D. Half the Sky: Turning Oppression into Opportunity for Women Worldwide. Three Rivers Press, 2010.
For supplementary R script, visit https://github.com/lgibson7/Gender-Wage-Inequality-in-STEM