Kristin Lussi and Tony Fraser
December 7th, 2023
This research paper uses six years’ worth of the annual Stack Overflow survey data to examine the gender pay gap within the technology sector. It’s important to note that the self-reported responses in this dataset are not necessarily specific to any particular industry but rather focus on individuals’ job functions. This data spans a variety of tech positions such as R programmer, cloud engineer, technical project manager, blockchain developer, etc.
Regarding research methods, we employ descriptive statistics to describe the data at an overview level, along with providing specific examples of the gender pay gap. Additionally, we conduct a deep dive using simple and multiple linear regression for some of our more advanced analyses.
The data clearly indicates the presence of a substantial pay gap. However, despite the extensive nature of this dataset, it is apparent that there is at least one crucial yet unidentified variable that needs to be incorporated to effectively model the pay gap. We propose that without the inclusion of the “CompanyPercentSexist” column in this dataset, gaining a comprehensive understanding and modeling of this pay gap may remain challenging.
Upon completing this study, we recommend persevering and potentially seeking funding for further research and modeling. Our first immediate suggested action would be proposing additional questions to Stack Overflow, particularly those related to geography and industry. While we might not be able to find “CompanyPercentSexist,” narrowing our focus to industry and region could enable us to provide essential information to local politicians and the media.
The gender pay gap within the US tech sector has long been a subject of concern, reflecting broader societal issues and potential barriers to gender equity.
This research paper leverages a data set built From six years / 1.55GB of Stack Overflow survey data to examine the pay gap across many variables.
Our aim is to answer the question, Is there a significant difference in salary between males and females within the tech industry?
The dependent variable is Annual Salary. We will utilize simple linear regression to first determine if we can reject the null hypothesis.
The null hypothesis (\(H_0\)) is: There is no significant difference in the mean annual salaries between male and females.
The alternative hypothesis (\(H_1\)) is: There is a significant difference in the mean annual salaries between males and females.
Once we determine if we can reject the null hypothesis, we will determine which variables are statistically significant in predicting the response variable (Annual Salary) by creating a multiple linear regression model.
Our pipeline does the following:
We filtered our more than 500K raw records all the way down to 43,655 for this study. Our working dataset includes only those who:
Provided their salary
Work in the United States
Identify as male or female
Have full time jobs
Have an annual salary below 300,000
## Rows: 43,655
## Columns: 67
## $ Year <int> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018,…
## $ Country <chr> "United States", "United States", "United State…
## $ Gender <fct> Male, Male, Female, Female, Male, Male, Male, M…
## $ EdLevel <fct> Some College, Some College, Bachelors, Bachelor…
## $ DevType <chr> "Back-end developer;Front-end developer;Full-st…
## $ AnnualSalary <dbl> 120000, 250000, 44000, 60000, 80000, 74000, 115…
## $ YearsCodeProAvg <dbl> 10, 30, 7, 4, 13, 10, 16, 4, 19, 13, 1, 7, 7, 1…
## $ OrgSizeAvg <dbl> 5.00, 299.50, 749.50, 251.00, 59.50, 253.25, 59…
## $ AgeAvg <dbl> 21.0, 39.5, 21.0, 29.5, 29.5, 29.5, 29.5, 39.5,…
## $ python <fct> no, yes, no, no, yes, yes, no, no, no, yes, no,…
## $ sql <fct> no, yes, yes, no, yes, no, no, yes, no, no, yes…
## $ java <fct> no, no, yes, no, yes, no, yes, no, no, no, yes,…
## $ javascript <fct> yes, yes, yes, yes, yes, no, yes, yes, yes, no,…
## $ ruby <fct> no, yes, no, no, yes, no, yes, no, no, no, no, …
## $ php <fct> no, no, yes, yes, yes, no, no, yes, no, no, no,…
## $ c <fct> no, no, no, no, yes, yes, no, no, no, yes, no, …
## $ swift <fct> no, no, yes, no, no, no, no, no, no, no, no, ye…
## $ scala <fct> no, no, no, no, no, no, yes, no, no, yes, no, n…
## $ r <fct> no, no, no, no, no, no, no, no, no, no, no, no,…
## $ rust <fct> no, no, no, no, no, yes, no, no, no, no, no, no…
## $ julia <fct> no, no, no, no, no, no, no, no, no, no, no, no,…
## $ mysql <fct> no, no, yes, no, no, no, no, yes, no, yes, no, …
## $ microsoftsqlserver <fct> no, no, no, no, no, no, no, no, no, no, no, no,…
## $ mongodb <fct> yes, no, no, no, no, no, yes, no, no, no, no, n…
## $ postgresql <fct> no, yes, no, yes, yes, no, no, no, no, yes, yes…
## $ oracle <fct> no, no, no, no, no, no, no, no, no, no, no, no,…
## $ ibmdb2 <fct> no, no, no, no, no, no, no, no, no, no, no, no,…
## $ redis <fct> no, yes, no, no, no, no, no, no, no, no, no, no…
## $ sqlite <fct> no, no, no, no, no, no, no, no, no, no, no, no,…
## $ mariadb <fct> no, no, no, no, no, no, no, no, no, no, no, no,…
## $ microsoftazure <fct> no, no, no, no, no, no, no, no, no, no, no, no,…
## $ googlecloud <fct> no, no, no, no, no, no, no, no, no, no, no, no,…
## $ ibmcloudorwatson <fct> no, no, no, no, no, no, no, no, no, yes, no, no…
## $ kubernetes <fct> no, no, no, no, no, no, no, no, no, no, no, no,…
## $ linux <fct> yes, yes, no, no, yes, no, yes, no, no, no, yes…
## $ windows <fct> no, no, no, no, no, no, no, no, yes, no, no, no…
## $ sexuality_grouped <fct> straight, straight, straight, straight, straigh…
## $ ethnicity_grouped <fct> non-minority, non-minority, non-minority, non-m…
## $ aws <fct> no, yes, no, no, no, no, yes, no, no, no, no, n…
## $ python_num <dbl> 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1,…
## $ sql_num <dbl> 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1,…
## $ java_num <dbl> 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ javascript_num <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,…
## $ ruby_num <dbl> 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ php_num <dbl> 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ c_num <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ swift_num <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ scala_num <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ r_num <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ rust_num <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ julia_num <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mysql_num <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1,…
## $ microsoftsqlserver_num <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mongodb_num <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ postgresql_num <dbl> 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1,…
## $ oracle_num <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ibmdb2_num <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ redis_num <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ sqlite_num <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mariadb_num <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ microsoftazure_num <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ googlecloud_num <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ibmcloudorwatson_num <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ kubernetes_num <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ linux_num <dbl> 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,…
## $ windows_num <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ aws_num <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0,…
The range of respondents includes many types of careers, but is all highly technical in nature. As well, many more men than women responded to this survey, and that most likely implies several different kinds of bias.
We converted many independent variables to integers so we could see could look at correlation to the dependent variable.
As a final series of charts, and as we alluded to in the project proposal, the presence of a pay gap was not limited to just one or two variables; rather, it manifested across all of them. Before delving into a more thorough analysis, we decided to incorporate additional descriptive box plots into this presentation.
To determine if we can reject the null hypothesis, we will begin with constructing a simple linear regression model with annual salary as the dependent variable and gender as the independent variable.
wide_stack$Gender <- relevel(wide_stack$Gender, ref = "Male")
m_salary_gender <- lm(AnnualSalary ~ Gender, data = wide_stack)
summary(m_salary_gender)
##
## Call:
## lm(formula = AnnualSalary ~ Gender, data = wide_stack)
##
## Residuals:
## Min 1Q Median 3Q Max
## -119345 -35345 -7345 30655 189768
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 119344.8 246.9 483.45 <0.0000000000000002 ***
## GenderFemale -14112.9 842.5 -16.75 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49310 on 43653 degrees of freedom
## Multiple R-squared: 0.006387, Adjusted R-squared: 0.006364
## F-statistic: 280.6 on 1 and 43653 DF, p-value: < 0.00000000000000022
From the above, we see that the p-value is 0.000000000000022%. Using a 95% confidence interval, having a p-value of less than 5% means that we can reject the null hypothesis. We can see that the adjusted \(R^2\) value is very small (0.6364%), which means that the model does not explain much of the variability of the dependent variable. The coefficient -14,112.9 means that, on average and after accounting for other factors in the model, being female is associated with a decrease of $14,112.9 in annual salary compared to males.
Now that we have determined that we can reject the null hypothesis, we will construct a multiple linear regression model which includes all of the variables in our study to determine which are good predictors of the variance in annual salary. The inclusion of interaction terms, represented as “Gender * Variable,” allows us to examine the impact of each variable in relation to gender.
wide_stack$EdLevel<- relevel(wide_stack$EdLevel, ref = "Something Else")
m_salary <- lm(AnnualSalary ~ Gender + Gender:AgeAvg + Gender:ethnicity_grouped + Gender:sexuality_grouped +
Gender:EdLevel + Gender:OrgSizeAvg +
Gender:YearsCodeProAvg + Gender:Year + Gender:python + Gender:r + Gender:scala + Gender:julia +
Gender:microsoftazure + Gender:aws + Gender:mariadb + Gender:mongodb +
Gender:linux + Gender:windows + Gender:mysql + Gender:oracle + Gender:ibmdb2 +
Gender:c + Gender:googlecloud + Gender:ibmcloudorwatson + Gender:java +
Gender:javascript + Gender:kubernetes + + Gender:microsoftsqlserver + Gender:php +
Gender:postgresql + Gender:redis + Gender:ruby + Gender:rust + Gender:sqlite +
Gender:swift, data = wide_stack)
summary(m_salary)
##
## Call:
## lm(formula = AnnualSalary ~ Gender + Gender:AgeAvg + Gender:ethnicity_grouped +
## Gender:sexuality_grouped + Gender:EdLevel + Gender:OrgSizeAvg +
## Gender:YearsCodeProAvg + Gender:Year + Gender:python + Gender:r +
## Gender:scala + Gender:julia + Gender:microsoftazure + Gender:aws +
## Gender:mariadb + Gender:mongodb + Gender:linux + Gender:windows +
## Gender:mysql + Gender:oracle + Gender:ibmdb2 + Gender:c +
## Gender:googlecloud + Gender:ibmcloudorwatson + Gender:java +
## Gender:javascript + Gender:kubernetes + +Gender:microsoftsqlserver +
## Gender:php + Gender:postgresql + Gender:redis + Gender:ruby +
## Gender:rust + Gender:sqlite + Gender:swift, data = wide_stack)
##
## Residuals:
## Min 1Q Median 3Q Max
## -198667 -26190 -5315 20592 204972
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error
## (Intercept) -15593717.8116 392326.7922
## GenderFemale 1055153.3243 1324697.4099
## GenderMale:AgeAvg -11.4028 41.1481
## GenderFemale:AgeAvg -254.9622 124.9653
## GenderMale:ethnicity_groupednon-minority -6204.3484 661.3269
## GenderFemale:ethnicity_groupednon-minority -5112.7181 1894.4108
## GenderMale:sexuality_groupedstraight 3160.2572 873.0873
## GenderFemale:sexuality_groupedstraight -1558.5486 1597.9757
## GenderMale:EdLevelAssociates 3318.7737 6599.2093
## GenderFemale:EdLevelAssociates -5706.9969 4771.4491
## GenderMale:EdLevelBachelors 16665.5007 6524.7049
## GenderFemale:EdLevelBachelors 11877.7744 2997.5375
## GenderMale:EdLevelDoctorate 29788.6644 6638.1813
## GenderFemale:EdLevelDoctorate 28693.4539 5140.7648
## GenderMale:EdLevelMasters 25948.1762 6539.0454
## GenderFemale:EdLevelMasters 19306.4405 3256.4361
## GenderMale:EdLevelNo Education 22880.5619 8568.8382
## GenderFemale:EdLevelNo Education 16709.1622 41489.2264
## GenderMale:EdLevelPrimary 13698.3498 7733.2068
## GenderFemale:EdLevelPrimary 14404.1486 24189.9773
## GenderMale:EdLevelProfessional 19728.2009 7369.8143
## GenderFemale:EdLevelProfessional 12354.4802 9477.7365
## GenderMale:EdLevelSecondary 8434.8881 6695.5834
## GenderFemale:EdLevelSecondary 3764.6574 7490.3956
## GenderMale:EdLevelSome College 10810.8658 6547.7393
## GenderFemale:EdLevelSome College NA NA
## GenderMale:OrgSizeAvg -0.1833 1.0729
## GenderFemale:OrgSizeAvg 13.8690 3.6031
## GenderMale:YearsCodeProAvg 2068.5699 43.7464
## GenderFemale:YearsCodeProAvg 1762.6129 144.8698
## GenderMale:Year 7760.3671 194.1885
## GenderFemale:Year 7246.2860 626.7079
## GenderMale:pythonyes 1353.5161 481.8252
## GenderFemale:pythonyes -715.7141 1648.3513
## GenderMale:ryes -8005.0421 958.9999
## GenderFemale:ryes -8614.8058 2795.4388
## GenderMale:scalayes 14652.0037 1112.5435
## GenderFemale:scalayes 10536.9673 3959.7968
## GenderMale:juliayes -8849.4946 2677.7737
## GenderFemale:juliayes -4053.9224 14756.1703
## GenderMale:microsoftazureyes 6116.3312 673.9515
## GenderFemale:microsoftazureyes 9812.1632 2587.1221
## GenderMale:awsyes 10137.3801 499.5139
## GenderFemale:awsyes 9247.6344 1710.3190
## GenderMale:mariadbyes -5738.0277 795.8890
## GenderFemale:mariadbyes -2285.2461 3239.1143
## GenderMale:mongodbyes -1576.2923 579.4587
## GenderFemale:mongodbyes -655.3025 1934.8371
## GenderMale:linuxyes 3662.2745 573.4466
## GenderFemale:linuxyes 480.5524 2054.1250
## GenderMale:windowsyes -4989.0588 572.8468
## GenderFemale:windowsyes -6696.2747 2059.7840
## GenderMale:mysqlyes -4048.3508 523.0716
## GenderFemale:mysqlyes -2454.4706 1682.2298
## GenderMale:oracleyes -7145.7340 763.8362
## GenderFemale:oracleyes -9029.7563 2818.4312
## GenderMale:ibmdb2yes -9270.4277 1583.5959
## GenderFemale:ibmdb2yes -13383.5044 6940.0395
## GenderMale:cyes -883.9777 662.4677
## GenderFemale:cyes 2934.1621 3102.4411
## GenderMale:googlecloudyes 6875.8451 677.9926
## GenderFemale:googlecloudyes 12847.7626 2367.3459
## GenderMale:ibmcloudorwatsonyes 1131.4440 1941.3628
## GenderFemale:ibmcloudorwatsonyes 9411.8809 5984.9447
## GenderMale:javayes 4299.5514 496.2564
## GenderFemale:javayes 5129.1553 1745.9366
## GenderMale:javascriptyes -5124.1743 527.7915
## GenderFemale:javascriptyes -7964.2541 1755.6545
## GenderMale:kubernetesyes 11714.8712 1037.0743
## GenderFemale:kubernetesyes 7285.1293 3579.2253
## GenderMale:microsoftsqlserveryes -9124.6077 590.2690
## GenderFemale:microsoftsqlserveryes -12487.1496 2066.1472
## GenderMale:phpyes -11730.2482 660.7160
## GenderFemale:phpyes -17690.9342 2103.8932
## GenderMale:postgresqlyes 2180.3931 515.7995
## GenderFemale:postgresqlyes 2603.4975 1725.3109
## GenderMale:redisyes 12363.9187 583.4769
## GenderFemale:redisyes 10312.9177 2153.3243
## GenderMale:rubyyes 4647.9992 697.7576
## GenderFemale:rubyyes 2780.3021 2108.7727
## GenderMale:rustyes 6651.6929 949.3499
## GenderFemale:rustyes 17251.6436 5179.4416
## GenderMale:sqliteyes -523.8986 563.7727
## GenderFemale:sqliteyes -4752.5285 2175.2623
## GenderMale:swiftyes 5602.7613 879.6365
## GenderFemale:swiftyes 625.8604 3227.2455
## t value Pr(>|t|)
## (Intercept) -39.747 < 0.0000000000000002 ***
## GenderFemale 0.797 0.425732
## GenderMale:AgeAvg -0.277 0.781693
## GenderFemale:AgeAvg -2.040 0.041331 *
## GenderMale:ethnicity_groupednon-minority -9.382 < 0.0000000000000002 ***
## GenderFemale:ethnicity_groupednon-minority -2.699 0.006961 **
## GenderMale:sexuality_groupedstraight 3.620 0.000295 ***
## GenderFemale:sexuality_groupedstraight -0.975 0.329404
## GenderMale:EdLevelAssociates 0.503 0.615034
## GenderFemale:EdLevelAssociates -1.196 0.231676
## GenderMale:EdLevelBachelors 2.554 0.010647 *
## GenderFemale:EdLevelBachelors 3.963 0.0000742975004777 ***
## GenderMale:EdLevelDoctorate 4.487 0.0000072278365953 ***
## GenderFemale:EdLevelDoctorate 5.582 0.0000000239959755 ***
## GenderMale:EdLevelMasters 3.968 0.0000725499155147 ***
## GenderFemale:EdLevelMasters 5.929 0.0000000030789904 ***
## GenderMale:EdLevelNo Education 2.670 0.007584 **
## GenderFemale:EdLevelNo Education 0.403 0.687145
## GenderMale:EdLevelPrimary 1.771 0.076507 .
## GenderFemale:EdLevelPrimary 0.595 0.551540
## GenderMale:EdLevelProfessional 2.677 0.007434 **
## GenderFemale:EdLevelProfessional 1.304 0.192403
## GenderMale:EdLevelSecondary 1.260 0.207760
## GenderFemale:EdLevelSecondary 0.503 0.615250
## GenderMale:EdLevelSome College 1.651 0.098730 .
## GenderFemale:EdLevelSome College NA NA
## GenderMale:OrgSizeAvg -0.171 0.864334
## GenderFemale:OrgSizeAvg 3.849 0.000119 ***
## GenderMale:YearsCodeProAvg 47.286 < 0.0000000000000002 ***
## GenderFemale:YearsCodeProAvg 12.167 < 0.0000000000000002 ***
## GenderMale:Year 39.963 < 0.0000000000000002 ***
## GenderFemale:Year 11.562 < 0.0000000000000002 ***
## GenderMale:pythonyes 2.809 0.004970 **
## GenderFemale:pythonyes -0.434 0.664146
## GenderMale:ryes -8.347 < 0.0000000000000002 ***
## GenderFemale:ryes -3.082 0.002059 **
## GenderMale:scalayes 13.170 < 0.0000000000000002 ***
## GenderFemale:scalayes 2.661 0.007794 **
## GenderMale:juliayes -3.305 0.000951 ***
## GenderFemale:juliayes -0.275 0.783527
## GenderMale:microsoftazureyes 9.075 < 0.0000000000000002 ***
## GenderFemale:microsoftazureyes 3.793 0.000149 ***
## GenderMale:awsyes 20.294 < 0.0000000000000002 ***
## GenderFemale:awsyes 5.407 0.0000000644772639 ***
## GenderMale:mariadbyes -7.210 0.0000000000005714 ***
## GenderFemale:mariadbyes -0.706 0.480494
## GenderMale:mongodbyes -2.720 0.006525 **
## GenderFemale:mongodbyes -0.339 0.734848
## GenderMale:linuxyes 6.386 0.0000000001717139 ***
## GenderFemale:linuxyes 0.234 0.815029
## GenderMale:windowsyes -8.709 < 0.0000000000000002 ***
## GenderFemale:windowsyes -3.251 0.001151 **
## GenderMale:mysqlyes -7.740 0.0000000000000102 ***
## GenderFemale:mysqlyes -1.459 0.144557
## GenderMale:oracleyes -9.355 < 0.0000000000000002 ***
## GenderFemale:oracleyes -3.204 0.001357 **
## GenderMale:ibmdb2yes -5.854 0.0000000048361452 ***
## GenderFemale:ibmdb2yes -1.928 0.053807 .
## GenderMale:cyes -1.334 0.182090
## GenderFemale:cyes 0.946 0.344277
## GenderMale:googlecloudyes 10.141 < 0.0000000000000002 ***
## GenderFemale:googlecloudyes 5.427 0.0000000576256647 ***
## GenderMale:ibmcloudorwatsonyes 0.583 0.560025
## GenderFemale:ibmcloudorwatsonyes 1.573 0.115821
## GenderMale:javayes 8.664 < 0.0000000000000002 ***
## GenderFemale:javayes 2.938 0.003308 **
## GenderMale:javascriptyes -9.709 < 0.0000000000000002 ***
## GenderFemale:javascriptyes -4.536 0.0000057407814317 ***
## GenderMale:kubernetesyes 11.296 < 0.0000000000000002 ***
## GenderFemale:kubernetesyes 2.035 0.041818 *
## GenderMale:microsoftsqlserveryes -15.458 < 0.0000000000000002 ***
## GenderFemale:microsoftsqlserveryes -6.044 0.0000000015199263 ***
## GenderMale:phpyes -17.754 < 0.0000000000000002 ***
## GenderFemale:phpyes -8.409 < 0.0000000000000002 ***
## GenderMale:postgresqlyes 4.227 0.0000237143991293 ***
## GenderFemale:postgresqlyes 1.509 0.131306
## GenderMale:redisyes 21.190 < 0.0000000000000002 ***
## GenderFemale:redisyes 4.789 0.0000016797668885 ***
## GenderMale:rubyyes 6.661 0.0000000000274942 ***
## GenderFemale:rubyyes 1.318 0.187362
## GenderMale:rustyes 7.007 0.0000000000024817 ***
## GenderFemale:rustyes 3.331 0.000867 ***
## GenderMale:sqliteyes -0.929 0.352753
## GenderFemale:sqliteyes -2.185 0.028909 *
## GenderMale:swiftyes 6.369 0.0000000001918707 ***
## GenderFemale:swiftyes 0.194 0.846232
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41160 on 39033 degrees of freedom
## (4537 observations deleted due to missingness)
## Multiple R-squared: 0.304, Adjusted R-squared: 0.3025
## F-statistic: 202.9 on 84 and 39033 DF, p-value: < 0.00000000000000022
The \(R^2\) value is 30.25%, which means that approximately 30.25% of the variability in the dependent variable (AnnualSalary) can be explained by the independent variables included in the model. The remaining 69.75% of the variability is unaccounted for by the model. The p-value is 0.000000000000022%, which suggests that there is strong evidence that at least one of the predictors in the model has a non-zero effect, and the overall model is statistically significant. Hence, we can reject the null hypothesis (There is no significant difference in the mean annual salaries between male and females).
Having examined the four essential conditions for multiple linear regression, including linearity, normality, constant variability, and independence of residuals, we conclude that the utilization of multiple linear regression is deemed valid.
We can assume linearity, as there is no apparent trend observed in the distribution.
We can assume normality as the line mostly falls on the normal line.
Points are scattered with no apparent pattern around 0, indicating that we can assume constant variability.
ggplot(data = m_salary, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed")
Given that each case represents an individual response, we can reasonably assume independence. Even though the same individuals may submit survey responses over multiple years, we can expect their salaries to change within a year due to factors such as bonuses, promotions, and annual salary increases.
Having analyzing gigabytes of data spanning 557 raw variables, which were expanded to over 750 variables, the following conclusions have become evident:
In conclusion, there is an obvious gap in salaries between males and females shown in our regression model.
For example, the coefficients for GenderFemale:awsyes and GenderMale:awsyes represent the estimated change in Annual Salary associated with a one-unit change in the variable “awsyes”, while holding all other variables constant. The coefficients for GenderFemale:awsyes and GenderMale:awsyes are 9,247 and 10,137, respectively. Both coefficients are positive, which means that, on average, both men and women who use AWS tend to have higher salaries compared to those who do not use AWS. However, the magnitude of the coefficient for GenderMale:awsyes is higher than that for GenderFemale:aws yes. This suggests that, on average, men who use AWS have a higher estimated increase in Annual salary compared to women who use AWS.
This trend is shown across several variables in our model. In conjunction with the results from our simple linear regression model, we can conclude that our model supports our alternative hypothesis that there is a significant difference in the mean annual salaries between males and females. Importantly, this difference in mean annual salaries is not confined to a single variable but holds true across multiple variables considered in the study.
We anticipate that delving deeper into the factors examined in this study may yield inconclusive results. However, our main suggestion is to persist with this research, seek funding, or take any necessary measures to ensure its continuation. We have identified three potential directions to pursue next. Nonetheless, we are confident that exploring these avenues would likely uncover influential variables contributing to the pay gap.
Study salaries by gender and location Our study lacks consideration for state and county variables. It’s plausible that the wage gap is specific to small towns versus big cities, or possibly confined to a particular location such as Portland, Oregon. The current data does not incorporate this geographical dimension.
Study company salaries by gender Certainly, not every company, regardless of its size, faces substantial issues with pay disparities. However, specific companies are notorious for having a significantly massive wage gap. Examining the salaries of individual companies and discerning the wage gap within each could be a highly impactful analysis.
Study entire industries by gender and salary Software developers and project managers performing identical tasks are found in both non-profit and media sectors. It’s possible that the industry itself plays a pivotal role in influencing the wage gap.