A Comprehensive Analysis of Gender Pay Gap in the Tech Industry

Kristin Lussi and Tony Fraser

December 7th, 2023

Abstract

This research paper uses six years’ worth of the annual Stack Overflow survey data to examine the gender pay gap within the technology sector. It’s important to note that the self-reported responses in this dataset are not necessarily specific to any particular industry but rather focus on individuals’ job functions. This data spans a variety of tech positions such as R programmer, cloud engineer, technical project manager, blockchain developer, etc.

Regarding research methods, we employ descriptive statistics to describe the data at an overview level, along with providing specific examples of the gender pay gap. Additionally, we conduct a deep dive using simple and multiple linear regression for some of our more advanced analyses.

The data clearly indicates the presence of a substantial pay gap. However, despite the extensive nature of this dataset, it is apparent that there is at least one crucial yet unidentified variable that needs to be incorporated to effectively model the pay gap. We propose that without the inclusion of the “CompanyPercentSexist” column in this dataset, gaining a comprehensive understanding and modeling of this pay gap may remain challenging.

Upon completing this study, we recommend persevering and potentially seeking funding for further research and modeling. Our first immediate suggested action would be proposing additional questions to Stack Overflow, particularly those related to geography and industry. While we might not be able to find “CompanyPercentSexist,” narrowing our focus to industry and region could enable us to provide essential information to local politicians and the media.

Introduction

The gender pay gap within the US tech sector has long been a subject of concern, reflecting broader societal issues and potential barriers to gender equity.

This research paper leverages a data set built From six years / 1.55GB of Stack Overflow survey data to examine the pay gap across many variables.

Our aim is to answer the question, Is there a significant difference in salary between males and females within the tech industry?

The dependent variable is Annual Salary. We will utilize simple linear regression to first determine if we can reject the null hypothesis.

The null hypothesis (\(H_0\)) is: There is no significant difference in the mean annual salaries between male and females.

The alternative hypothesis (\(H_1\)) is: There is a significant difference in the mean annual salaries between males and females.

Once we determine if we can reject the null hypothesis, we will determine which variables are statistically significant in predicting the response variable (Annual Salary) by creating a multiple linear regression model.

Data Overview

Data engineering pipeline

Our pipeline does the following:

  1. For each year, download raw survey data file from S3.
  2. Unify columns per year, and then union all years together.
  3. Explode wide certain multi-value columns. For example, “PlatformWorkedWith” contains both AWS and Google Cloud.
  4. Preprocess certain columns. For example, we aggregate a second grouped_ethnicity column that is only either minority, non-minority, or NA.
  5. Save the CSV file in the root directory so the markdown can load it from cache.

How we filtered the base data set

We filtered our more than 500K raw records all the way down to 43,655 for this study. Our working dataset includes only those who:

A quick glimpse of filtered data

## Rows: 43,655
## Columns: 67
## $ Year                   <int> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018,…
## $ Country                <chr> "United States", "United States", "United State…
## $ Gender                 <fct> Male, Male, Female, Female, Male, Male, Male, M…
## $ EdLevel                <fct> Some College, Some College, Bachelors, Bachelor…
## $ DevType                <chr> "Back-end developer;Front-end developer;Full-st…
## $ AnnualSalary           <dbl> 120000, 250000, 44000, 60000, 80000, 74000, 115…
## $ YearsCodeProAvg        <dbl> 10, 30, 7, 4, 13, 10, 16, 4, 19, 13, 1, 7, 7, 1…
## $ OrgSizeAvg             <dbl> 5.00, 299.50, 749.50, 251.00, 59.50, 253.25, 59…
## $ AgeAvg                 <dbl> 21.0, 39.5, 21.0, 29.5, 29.5, 29.5, 29.5, 39.5,…
## $ python                 <fct> no, yes, no, no, yes, yes, no, no, no, yes, no,…
## $ sql                    <fct> no, yes, yes, no, yes, no, no, yes, no, no, yes…
## $ java                   <fct> no, no, yes, no, yes, no, yes, no, no, no, yes,…
## $ javascript             <fct> yes, yes, yes, yes, yes, no, yes, yes, yes, no,…
## $ ruby                   <fct> no, yes, no, no, yes, no, yes, no, no, no, no, …
## $ php                    <fct> no, no, yes, yes, yes, no, no, yes, no, no, no,…
## $ c                      <fct> no, no, no, no, yes, yes, no, no, no, yes, no, …
## $ swift                  <fct> no, no, yes, no, no, no, no, no, no, no, no, ye…
## $ scala                  <fct> no, no, no, no, no, no, yes, no, no, yes, no, n…
## $ r                      <fct> no, no, no, no, no, no, no, no, no, no, no, no,…
## $ rust                   <fct> no, no, no, no, no, yes, no, no, no, no, no, no…
## $ julia                  <fct> no, no, no, no, no, no, no, no, no, no, no, no,…
## $ mysql                  <fct> no, no, yes, no, no, no, no, yes, no, yes, no, …
## $ microsoftsqlserver     <fct> no, no, no, no, no, no, no, no, no, no, no, no,…
## $ mongodb                <fct> yes, no, no, no, no, no, yes, no, no, no, no, n…
## $ postgresql             <fct> no, yes, no, yes, yes, no, no, no, no, yes, yes…
## $ oracle                 <fct> no, no, no, no, no, no, no, no, no, no, no, no,…
## $ ibmdb2                 <fct> no, no, no, no, no, no, no, no, no, no, no, no,…
## $ redis                  <fct> no, yes, no, no, no, no, no, no, no, no, no, no…
## $ sqlite                 <fct> no, no, no, no, no, no, no, no, no, no, no, no,…
## $ mariadb                <fct> no, no, no, no, no, no, no, no, no, no, no, no,…
## $ microsoftazure         <fct> no, no, no, no, no, no, no, no, no, no, no, no,…
## $ googlecloud            <fct> no, no, no, no, no, no, no, no, no, no, no, no,…
## $ ibmcloudorwatson       <fct> no, no, no, no, no, no, no, no, no, yes, no, no…
## $ kubernetes             <fct> no, no, no, no, no, no, no, no, no, no, no, no,…
## $ linux                  <fct> yes, yes, no, no, yes, no, yes, no, no, no, yes…
## $ windows                <fct> no, no, no, no, no, no, no, no, yes, no, no, no…
## $ sexuality_grouped      <fct> straight, straight, straight, straight, straigh…
## $ ethnicity_grouped      <fct> non-minority, non-minority, non-minority, non-m…
## $ aws                    <fct> no, yes, no, no, no, no, yes, no, no, no, no, n…
## $ python_num             <dbl> 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1,…
## $ sql_num                <dbl> 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1,…
## $ java_num               <dbl> 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ javascript_num         <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,…
## $ ruby_num               <dbl> 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ php_num                <dbl> 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ c_num                  <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ swift_num              <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ scala_num              <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ r_num                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ rust_num               <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ julia_num              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mysql_num              <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1,…
## $ microsoftsqlserver_num <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mongodb_num            <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ postgresql_num         <dbl> 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1,…
## $ oracle_num             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ibmdb2_num             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ redis_num              <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ sqlite_num             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mariadb_num            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ microsoftazure_num     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ googlecloud_num        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ibmcloudorwatson_num   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ kubernetes_num         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ linux_num              <dbl> 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,…
## $ windows_num            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ aws_num                <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0,…

A high-level respondent overview

The range of respondents includes many types of careers, but is all highly technical in nature. As well, many more men than women responded to this survey, and that most likely implies several different kinds of bias.

grid.arrange(salByJobType, respGender, ncol = 2)

A non-gender based look at correlation

We converted many independent variables to integers so we could see could look at correlation to the dependent variable.

grid.arrange(corr_chart, ncol = 1)

A quick look at gender discrepency

As a final series of charts, and as we alluded to in the project proposal, the presence of a pay gap was not limited to just one or two variables; rather, it manifested across all of them. Before delving into a more thorough analysis, we decided to incorporate additional descriptive box plots into this presentation.

grid.arrange(aws, linux, oracle, ncol = 3)

grid.arrange(minority_plot, non_minority_plot, lgbqt_plot , straight_plot, ncol = 4)

Data analysis

Simple Linear Regression

To determine if we can reject the null hypothesis, we will begin with constructing a simple linear regression model with annual salary as the dependent variable and gender as the independent variable.

wide_stack$Gender <- relevel(wide_stack$Gender, ref = "Male")
m_salary_gender <- lm(AnnualSalary ~ Gender, data = wide_stack)
summary(m_salary_gender)
## 
## Call:
## lm(formula = AnnualSalary ~ Gender, data = wide_stack)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -119345  -35345   -7345   30655  189768 
## 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  119344.8      246.9  483.45 <0.0000000000000002 ***
## GenderFemale -14112.9      842.5  -16.75 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49310 on 43653 degrees of freedom
## Multiple R-squared:  0.006387,   Adjusted R-squared:  0.006364 
## F-statistic: 280.6 on 1 and 43653 DF,  p-value: < 0.00000000000000022

From the above, we see that the p-value is 0.000000000000022%. Using a 95% confidence interval, having a p-value of less than 5% means that we can reject the null hypothesis. We can see that the adjusted \(R^2\) value is very small (0.6364%), which means that the model does not explain much of the variability of the dependent variable. The coefficient -14,112.9 means that, on average and after accounting for other factors in the model, being female is associated with a decrease of $14,112.9 in annual salary compared to males.

Multiple Linear Regression

Now that we have determined that we can reject the null hypothesis, we will construct a multiple linear regression model which includes all of the variables in our study to determine which are good predictors of the variance in annual salary. The inclusion of interaction terms, represented as “Gender * Variable,” allows us to examine the impact of each variable in relation to gender.

wide_stack$EdLevel<- relevel(wide_stack$EdLevel, ref = "Something Else")

m_salary <- lm(AnnualSalary ~ Gender + Gender:AgeAvg + Gender:ethnicity_grouped + Gender:sexuality_grouped +
   Gender:EdLevel + Gender:OrgSizeAvg +
   Gender:YearsCodeProAvg + Gender:Year + Gender:python + Gender:r + Gender:scala + Gender:julia +
   Gender:microsoftazure + Gender:aws + Gender:mariadb + Gender:mongodb +
   Gender:linux + Gender:windows + Gender:mysql + Gender:oracle + Gender:ibmdb2 + 
   Gender:c + Gender:googlecloud + Gender:ibmcloudorwatson + Gender:java + 
   Gender:javascript + Gender:kubernetes + + Gender:microsoftsqlserver + Gender:php +
   Gender:postgresql + Gender:redis + Gender:ruby + Gender:rust + Gender:sqlite +
   Gender:swift, data = wide_stack)

summary(m_salary)
## 
## Call:
## lm(formula = AnnualSalary ~ Gender + Gender:AgeAvg + Gender:ethnicity_grouped + 
##     Gender:sexuality_grouped + Gender:EdLevel + Gender:OrgSizeAvg + 
##     Gender:YearsCodeProAvg + Gender:Year + Gender:python + Gender:r + 
##     Gender:scala + Gender:julia + Gender:microsoftazure + Gender:aws + 
##     Gender:mariadb + Gender:mongodb + Gender:linux + Gender:windows + 
##     Gender:mysql + Gender:oracle + Gender:ibmdb2 + Gender:c + 
##     Gender:googlecloud + Gender:ibmcloudorwatson + Gender:java + 
##     Gender:javascript + Gender:kubernetes + +Gender:microsoftsqlserver + 
##     Gender:php + Gender:postgresql + Gender:redis + Gender:ruby + 
##     Gender:rust + Gender:sqlite + Gender:swift, data = wide_stack)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -198667  -26190   -5315   20592  204972 
## 
## Coefficients: (1 not defined because of singularities)
##                                                  Estimate     Std. Error
## (Intercept)                                -15593717.8116    392326.7922
## GenderFemale                                 1055153.3243   1324697.4099
## GenderMale:AgeAvg                                -11.4028        41.1481
## GenderFemale:AgeAvg                             -254.9622       124.9653
## GenderMale:ethnicity_groupednon-minority       -6204.3484       661.3269
## GenderFemale:ethnicity_groupednon-minority     -5112.7181      1894.4108
## GenderMale:sexuality_groupedstraight            3160.2572       873.0873
## GenderFemale:sexuality_groupedstraight         -1558.5486      1597.9757
## GenderMale:EdLevelAssociates                    3318.7737      6599.2093
## GenderFemale:EdLevelAssociates                 -5706.9969      4771.4491
## GenderMale:EdLevelBachelors                    16665.5007      6524.7049
## GenderFemale:EdLevelBachelors                  11877.7744      2997.5375
## GenderMale:EdLevelDoctorate                    29788.6644      6638.1813
## GenderFemale:EdLevelDoctorate                  28693.4539      5140.7648
## GenderMale:EdLevelMasters                      25948.1762      6539.0454
## GenderFemale:EdLevelMasters                    19306.4405      3256.4361
## GenderMale:EdLevelNo Education                 22880.5619      8568.8382
## GenderFemale:EdLevelNo Education               16709.1622     41489.2264
## GenderMale:EdLevelPrimary                      13698.3498      7733.2068
## GenderFemale:EdLevelPrimary                    14404.1486     24189.9773
## GenderMale:EdLevelProfessional                 19728.2009      7369.8143
## GenderFemale:EdLevelProfessional               12354.4802      9477.7365
## GenderMale:EdLevelSecondary                     8434.8881      6695.5834
## GenderFemale:EdLevelSecondary                   3764.6574      7490.3956
## GenderMale:EdLevelSome College                 10810.8658      6547.7393
## GenderFemale:EdLevelSome College                       NA             NA
## GenderMale:OrgSizeAvg                             -0.1833         1.0729
## GenderFemale:OrgSizeAvg                           13.8690         3.6031
## GenderMale:YearsCodeProAvg                      2068.5699        43.7464
## GenderFemale:YearsCodeProAvg                    1762.6129       144.8698
## GenderMale:Year                                 7760.3671       194.1885
## GenderFemale:Year                               7246.2860       626.7079
## GenderMale:pythonyes                            1353.5161       481.8252
## GenderFemale:pythonyes                          -715.7141      1648.3513
## GenderMale:ryes                                -8005.0421       958.9999
## GenderFemale:ryes                              -8614.8058      2795.4388
## GenderMale:scalayes                            14652.0037      1112.5435
## GenderFemale:scalayes                          10536.9673      3959.7968
## GenderMale:juliayes                            -8849.4946      2677.7737
## GenderFemale:juliayes                          -4053.9224     14756.1703
## GenderMale:microsoftazureyes                    6116.3312       673.9515
## GenderFemale:microsoftazureyes                  9812.1632      2587.1221
## GenderMale:awsyes                              10137.3801       499.5139
## GenderFemale:awsyes                             9247.6344      1710.3190
## GenderMale:mariadbyes                          -5738.0277       795.8890
## GenderFemale:mariadbyes                        -2285.2461      3239.1143
## GenderMale:mongodbyes                          -1576.2923       579.4587
## GenderFemale:mongodbyes                         -655.3025      1934.8371
## GenderMale:linuxyes                             3662.2745       573.4466
## GenderFemale:linuxyes                            480.5524      2054.1250
## GenderMale:windowsyes                          -4989.0588       572.8468
## GenderFemale:windowsyes                        -6696.2747      2059.7840
## GenderMale:mysqlyes                            -4048.3508       523.0716
## GenderFemale:mysqlyes                          -2454.4706      1682.2298
## GenderMale:oracleyes                           -7145.7340       763.8362
## GenderFemale:oracleyes                         -9029.7563      2818.4312
## GenderMale:ibmdb2yes                           -9270.4277      1583.5959
## GenderFemale:ibmdb2yes                        -13383.5044      6940.0395
## GenderMale:cyes                                 -883.9777       662.4677
## GenderFemale:cyes                               2934.1621      3102.4411
## GenderMale:googlecloudyes                       6875.8451       677.9926
## GenderFemale:googlecloudyes                    12847.7626      2367.3459
## GenderMale:ibmcloudorwatsonyes                  1131.4440      1941.3628
## GenderFemale:ibmcloudorwatsonyes                9411.8809      5984.9447
## GenderMale:javayes                              4299.5514       496.2564
## GenderFemale:javayes                            5129.1553      1745.9366
## GenderMale:javascriptyes                       -5124.1743       527.7915
## GenderFemale:javascriptyes                     -7964.2541      1755.6545
## GenderMale:kubernetesyes                       11714.8712      1037.0743
## GenderFemale:kubernetesyes                      7285.1293      3579.2253
## GenderMale:microsoftsqlserveryes               -9124.6077       590.2690
## GenderFemale:microsoftsqlserveryes            -12487.1496      2066.1472
## GenderMale:phpyes                             -11730.2482       660.7160
## GenderFemale:phpyes                           -17690.9342      2103.8932
## GenderMale:postgresqlyes                        2180.3931       515.7995
## GenderFemale:postgresqlyes                      2603.4975      1725.3109
## GenderMale:redisyes                            12363.9187       583.4769
## GenderFemale:redisyes                          10312.9177      2153.3243
## GenderMale:rubyyes                              4647.9992       697.7576
## GenderFemale:rubyyes                            2780.3021      2108.7727
## GenderMale:rustyes                              6651.6929       949.3499
## GenderFemale:rustyes                           17251.6436      5179.4416
## GenderMale:sqliteyes                            -523.8986       563.7727
## GenderFemale:sqliteyes                         -4752.5285      2175.2623
## GenderMale:swiftyes                             5602.7613       879.6365
## GenderFemale:swiftyes                            625.8604      3227.2455
##                                            t value             Pr(>|t|)    
## (Intercept)                                -39.747 < 0.0000000000000002 ***
## GenderFemale                                 0.797             0.425732    
## GenderMale:AgeAvg                           -0.277             0.781693    
## GenderFemale:AgeAvg                         -2.040             0.041331 *  
## GenderMale:ethnicity_groupednon-minority    -9.382 < 0.0000000000000002 ***
## GenderFemale:ethnicity_groupednon-minority  -2.699             0.006961 ** 
## GenderMale:sexuality_groupedstraight         3.620             0.000295 ***
## GenderFemale:sexuality_groupedstraight      -0.975             0.329404    
## GenderMale:EdLevelAssociates                 0.503             0.615034    
## GenderFemale:EdLevelAssociates              -1.196             0.231676    
## GenderMale:EdLevelBachelors                  2.554             0.010647 *  
## GenderFemale:EdLevelBachelors                3.963   0.0000742975004777 ***
## GenderMale:EdLevelDoctorate                  4.487   0.0000072278365953 ***
## GenderFemale:EdLevelDoctorate                5.582   0.0000000239959755 ***
## GenderMale:EdLevelMasters                    3.968   0.0000725499155147 ***
## GenderFemale:EdLevelMasters                  5.929   0.0000000030789904 ***
## GenderMale:EdLevelNo Education               2.670             0.007584 ** 
## GenderFemale:EdLevelNo Education             0.403             0.687145    
## GenderMale:EdLevelPrimary                    1.771             0.076507 .  
## GenderFemale:EdLevelPrimary                  0.595             0.551540    
## GenderMale:EdLevelProfessional               2.677             0.007434 ** 
## GenderFemale:EdLevelProfessional             1.304             0.192403    
## GenderMale:EdLevelSecondary                  1.260             0.207760    
## GenderFemale:EdLevelSecondary                0.503             0.615250    
## GenderMale:EdLevelSome College               1.651             0.098730 .  
## GenderFemale:EdLevelSome College                NA                   NA    
## GenderMale:OrgSizeAvg                       -0.171             0.864334    
## GenderFemale:OrgSizeAvg                      3.849             0.000119 ***
## GenderMale:YearsCodeProAvg                  47.286 < 0.0000000000000002 ***
## GenderFemale:YearsCodeProAvg                12.167 < 0.0000000000000002 ***
## GenderMale:Year                             39.963 < 0.0000000000000002 ***
## GenderFemale:Year                           11.562 < 0.0000000000000002 ***
## GenderMale:pythonyes                         2.809             0.004970 ** 
## GenderFemale:pythonyes                      -0.434             0.664146    
## GenderMale:ryes                             -8.347 < 0.0000000000000002 ***
## GenderFemale:ryes                           -3.082             0.002059 ** 
## GenderMale:scalayes                         13.170 < 0.0000000000000002 ***
## GenderFemale:scalayes                        2.661             0.007794 ** 
## GenderMale:juliayes                         -3.305             0.000951 ***
## GenderFemale:juliayes                       -0.275             0.783527    
## GenderMale:microsoftazureyes                 9.075 < 0.0000000000000002 ***
## GenderFemale:microsoftazureyes               3.793             0.000149 ***
## GenderMale:awsyes                           20.294 < 0.0000000000000002 ***
## GenderFemale:awsyes                          5.407   0.0000000644772639 ***
## GenderMale:mariadbyes                       -7.210   0.0000000000005714 ***
## GenderFemale:mariadbyes                     -0.706             0.480494    
## GenderMale:mongodbyes                       -2.720             0.006525 ** 
## GenderFemale:mongodbyes                     -0.339             0.734848    
## GenderMale:linuxyes                          6.386   0.0000000001717139 ***
## GenderFemale:linuxyes                        0.234             0.815029    
## GenderMale:windowsyes                       -8.709 < 0.0000000000000002 ***
## GenderFemale:windowsyes                     -3.251             0.001151 ** 
## GenderMale:mysqlyes                         -7.740   0.0000000000000102 ***
## GenderFemale:mysqlyes                       -1.459             0.144557    
## GenderMale:oracleyes                        -9.355 < 0.0000000000000002 ***
## GenderFemale:oracleyes                      -3.204             0.001357 ** 
## GenderMale:ibmdb2yes                        -5.854   0.0000000048361452 ***
## GenderFemale:ibmdb2yes                      -1.928             0.053807 .  
## GenderMale:cyes                             -1.334             0.182090    
## GenderFemale:cyes                            0.946             0.344277    
## GenderMale:googlecloudyes                   10.141 < 0.0000000000000002 ***
## GenderFemale:googlecloudyes                  5.427   0.0000000576256647 ***
## GenderMale:ibmcloudorwatsonyes               0.583             0.560025    
## GenderFemale:ibmcloudorwatsonyes             1.573             0.115821    
## GenderMale:javayes                           8.664 < 0.0000000000000002 ***
## GenderFemale:javayes                         2.938             0.003308 ** 
## GenderMale:javascriptyes                    -9.709 < 0.0000000000000002 ***
## GenderFemale:javascriptyes                  -4.536   0.0000057407814317 ***
## GenderMale:kubernetesyes                    11.296 < 0.0000000000000002 ***
## GenderFemale:kubernetesyes                   2.035             0.041818 *  
## GenderMale:microsoftsqlserveryes           -15.458 < 0.0000000000000002 ***
## GenderFemale:microsoftsqlserveryes          -6.044   0.0000000015199263 ***
## GenderMale:phpyes                          -17.754 < 0.0000000000000002 ***
## GenderFemale:phpyes                         -8.409 < 0.0000000000000002 ***
## GenderMale:postgresqlyes                     4.227   0.0000237143991293 ***
## GenderFemale:postgresqlyes                   1.509             0.131306    
## GenderMale:redisyes                         21.190 < 0.0000000000000002 ***
## GenderFemale:redisyes                        4.789   0.0000016797668885 ***
## GenderMale:rubyyes                           6.661   0.0000000000274942 ***
## GenderFemale:rubyyes                         1.318             0.187362    
## GenderMale:rustyes                           7.007   0.0000000000024817 ***
## GenderFemale:rustyes                         3.331             0.000867 ***
## GenderMale:sqliteyes                        -0.929             0.352753    
## GenderFemale:sqliteyes                      -2.185             0.028909 *  
## GenderMale:swiftyes                          6.369   0.0000000001918707 ***
## GenderFemale:swiftyes                        0.194             0.846232    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41160 on 39033 degrees of freedom
##   (4537 observations deleted due to missingness)
## Multiple R-squared:  0.304,  Adjusted R-squared:  0.3025 
## F-statistic: 202.9 on 84 and 39033 DF,  p-value: < 0.00000000000000022

The \(R^2\) value is 30.25%, which means that approximately 30.25% of the variability in the dependent variable (AnnualSalary) can be explained by the independent variables included in the model. The remaining 69.75% of the variability is unaccounted for by the model. The p-value is 0.000000000000022%, which suggests that there is strong evidence that at least one of the predictors in the model has a non-zero effect, and the overall model is statistically significant. Hence, we can reject the null hypothesis (There is no significant difference in the mean annual salaries between male and females).

Assuring compliance with model conditions

Having examined the four essential conditions for multiple linear regression, including linearity, normality, constant variability, and independence of residuals, we conclude that the utilization of multiple linear regression is deemed valid.

Linearity

We can assume linearity, as there is no apparent trend observed in the distribution.

ggplot(data = m_salary, aes(x = .fitted, y = .resid)) +
  geom_point()

Normality

We can assume normality as the line mostly falls on the normal line.

ggplot(data = m_salary, aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line()

Constant Variability

Points are scattered with no apparent pattern around 0, indicating that we can assume constant variability.

ggplot(data = m_salary, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed")

Independence

Given that each case represents an individual response, we can reasonably assume independence. Even though the same individuals may submit survey responses over multiple years, we can expect their salaries to change within a year due to factors such as bonuses, promotions, and annual salary increases.

Conclusion

Findings

Having analyzing gigabytes of data spanning 557 raw variables, which were expanded to over 750 variables, the following conclusions have become evident:

  1. The linear model that we’ve developed only accounts for 30.25% of the variability of the dependent variable
  2. There is another variable, or variables, not previously considered that plays a pivotal role in describing the wage gap within the technology industry

In conclusion, there is an obvious gap in salaries between males and females shown in our regression model.

For example, the coefficients for GenderFemale:awsyes and GenderMale:awsyes represent the estimated change in Annual Salary associated with a one-unit change in the variable “awsyes”, while holding all other variables constant. The coefficients for GenderFemale:awsyes and GenderMale:awsyes are 9,247 and 10,137, respectively. Both coefficients are positive, which means that, on average, both men and women who use AWS tend to have higher salaries compared to those who do not use AWS. However, the magnitude of the coefficient for GenderMale:awsyes is higher than that for GenderFemale:aws yes. This suggests that, on average, men who use AWS have a higher estimated increase in Annual salary compared to women who use AWS.

This trend is shown across several variables in our model. In conjunction with the results from our simple linear regression model, we can conclude that our model supports our alternative hypothesis that there is a significant difference in the mean annual salaries between males and females. Importantly, this difference in mean annual salaries is not confined to a single variable but holds true across multiple variables considered in the study.

Recommendations

We anticipate that delving deeper into the factors examined in this study may yield inconclusive results. However, our main suggestion is to persist with this research, seek funding, or take any necessary measures to ensure its continuation. We have identified three potential directions to pursue next. Nonetheless, we are confident that exploring these avenues would likely uncover influential variables contributing to the pay gap.

  1. Study salaries by gender and location Our study lacks consideration for state and county variables. It’s plausible that the wage gap is specific to small towns versus big cities, or possibly confined to a particular location such as Portland, Oregon. The current data does not incorporate this geographical dimension.

  2. Study company salaries by gender Certainly, not every company, regardless of its size, faces substantial issues with pay disparities. However, specific companies are notorious for having a significantly massive wage gap. Examining the salaries of individual companies and discerning the wage gap within each could be a highly impactful analysis.

  3. Study entire industries by gender and salary Software developers and project managers performing identical tasks are found in both non-profit and media sectors. It’s possible that the industry itself plays a pivotal role in influencing the wage gap.