According to the ‘opm94’ data, females have been earning a significantly lower salary than their male counterparts. In the regression models displayed in this analysis, it shows that qualifications and experience doesn’t influence the wage gap. Through time, the ‘opm2008’ data reveals that the wage gap in sex decreased, but still exists. We declared that the information from this sample is statistically significant and it can represent the population.
In 1994, the salary for minorities was inferior to nonminorities, but in 2008, it changed. The sample displays an increase in salary for minorities where some groups surpassed nonminority salaries! We concluded that the data in this sample is statistically significant and could depict the minority population.
The idea of the American dream conveys freedom. Freedom to choose from the plethora of oppurtunities that America offers. As the country grows, the population becomes more diverse. It’s a mixing pot of people with various ethnicities. With this information in mind, the concept of freedom doesn’t apply to everyone. It’s common knowledge that women are paid less than men, and even less if they’re a minority. This is an issue that needs to be tackled. According to the US census, 50% of US citizens are women and 34% of US citizens are minorities. These are millions of people whos getting unfair pay due to their gender or ethnicity.
Does race and gender determine how much you get paid? And has it changed since 1994?
Variables used in this project are from both ‘opm2008’ and ‘opm94’. It will be a comparison between the correlations of ‘race’, ‘minority’, ‘male’, and ‘female01’ with ‘salary’ from different years.
Other Variables that influence the outcome: ‘edyrs’ - Education (years) ‘yos’ - Federal experience (years) ‘grade’ - Grade level (1-16)
The data that we’re stemming this information from is the United States’ federal employee records from 1994 and 2008. The Office of Personnel Management collects data on wages and promotions on 1.9 federal workers into its Central Personnel Data File database.
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("summarytools")
## Registered S3 method overwritten by 'pryr':
## method from
## print.bytes Rcpp
## Warning in fun(libname, pkgname): couldn't connect to display ":0"
## system might not have X11 capabilities; in case of errors when using dfSummary(), set st_options(use.x11 = FALSE)
## For best results, restart R session and update pander using devtools:: or remotes::install_github('rapporter/pander')
library("knitr")
library("ggplot2")
load("/cloud/project/Project data.RData")
First we’re going to calculate the regression for males and females:
lm(sal ~ female01, data = opm94) %>% summary()
##
## Call:
## lm(formula = sal ~ female01, data = opm94)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31945 -11537 -3092 9591 71883
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46999.4 729.8 64.40 <2e-16 ***
## female01 -12776.6 1046.3 -12.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16500 on 993 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.1305, Adjusted R-squared: 0.1297
## F-statistic: 149.1 on 1 and 993 DF, p-value: < 2.2e-16
For reference, we’re going to be constructing a boxplot:
ggplot(data = opm94, mapping = aes(x = factor(male), y = sal, col = factor(male))) + geom_boxplot()
## Warning: Removed 5 rows containing non-finite values (stat_boxplot).
According to the calculations from our dataset in 1994, the average salary for females was $12,776 lower than their male counterpart. Maybe men have higher education or higher grades than women do that would explain their higher wages.
We can calculate that with a Multiple Regression:
lm(sal ~ female01 + grade + yos + edyrs, data = opm94) %>% summary()
##
## Call:
## lm(formula = sal ~ female01 + grade + yos + edyrs, data = opm94)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14363 -4613 -560 3311 45093
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14255.58 1601.92 -8.899 < 2e-16 ***
## female01 -1442.57 460.41 -3.133 0.00178 **
## grade 4126.46 88.16 46.808 < 2e-16 ***
## yos 313.91 26.08 12.038 < 2e-16 ***
## edyrs 796.22 123.10 6.468 1.56e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6742 on 990 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.8552, Adjusted R-squared: 0.8546
## F-statistic: 1462 on 4 and 990 DF, p-value: < 2.2e-16
According to this sample, the average wage of females is $1,442 less than men regardless of having the same qualificiations.
First, calculate the mean of ‘race’:
opm94 %>% group_by(race) %>% summarise(Mean_Salary = mean(sal, na.rm = TRUE))
## # A tibble: 5 x 2
## race Mean_Salary
## <fct> <dbl>
## 1 American Indian 32846.
## 2 Asian 38440.
## 3 Black 32713.
## 4 Hispanic 36500.
## 5 White 43294.
This gives us an idea of the wage inequality between different races. There are too many variables, so we’re going to simplify it to minorities and nonminorities.
Now calculate the regression between ‘minority’ and ‘sal’:
lm(sal ~ minority, data = opm94) %>% summary()
##
## Call:
## lm(formula = sal ~ minority, data = opm94)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28240 -13169 -2282 10818 78126
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43294 639 67.75 < 2e-16 ***
## minority -9250 1227 -7.54 1.06e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17210 on 993 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.05415, Adjusted R-squared: 0.0532
## F-statistic: 56.85 on 1 and 993 DF, p-value: 1.058e-13
And now the box plot:
ggplot(data = opm94, mapping = aes(x = factor(minority), y = sal, col = factor(minority))) + geom_boxplot()
## Warning: Removed 5 rows containing non-finite values (stat_boxplot).
ggplot(data = opm94, mapping = aes(x = factor(race), y = sal, col = factor(race))) + geom_boxplot()
## Warning: Removed 5 rows containing non-finite values (stat_boxplot).
According to the data, minorities make $9,250 less than nonminorities. Maybe qualifications are a factor in this data.
Calculate a multiple regression with ‘sal’, ‘minority’, ‘yos’, ‘grade’ and ‘edyrs’:
lm(sal ~ minority + grade + yos + edyrs, data = opm94) %>% summary()
##
## Call:
## lm(formula = sal ~ minority + grade + yos + edyrs, data = opm94)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15344 -4565 -504 3227 45231
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15860.88 1526.44 -10.391 < 2e-16 ***
## minority -440.38 497.64 -0.885 0.376
## grade 4172.61 87.52 47.677 < 2e-16 ***
## yos 312.03 26.25 11.885 < 2e-16 ***
## edyrs 838.50 122.85 6.826 1.52e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6773 on 990 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.8539, Adjusted R-squared: 0.8533
## F-statistic: 1447 on 4 and 990 DF, p-value: < 2.2e-16
Minorities make on average $440 less than nonminorities. No matter their qualifications.
Make a Correlation Matrix with the variables ‘sal’,‘male’,‘minority’,‘promo’:
opm94 %>% select(sal, female01, minority, promo01) %>% cor(use = "pairwise.complete.obs") %>% round(2)
## sal female01 minority promo01
## sal 1.00 -0.36 -0.23 -0.15
## female01 -0.36 1.00 0.12 0.07
## minority -0.23 0.12 1.00 0.04
## promo01 -0.15 0.07 0.04 1.00
In this sample, minorities and females have less opportunity to earn a promotion and tend to earn a lower wage than their counterparts. In 1994, the United States displayed obvious discrimination on women and minorities. Times were a little different back then. Maybe America has changed since then.
Calculate the regression between men and women:
lm(salary ~ female01, data = opm2008) %>% summary()
##
## Call:
## lm(formula = salary ~ female01, data = opm2008)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51916 -22631 -5674 18366 139731
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74840.5 444.0 168.56 <2e-16 ***
## female01 -10938.5 606.8 -18.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28810 on 9058 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.03463, Adjusted R-squared: 0.03452
## F-statistic: 324.9 on 1 and 9058 DF, p-value: < 2.2e-16
Now the BoxPlot:
ggplot(data = opm2008, mapping = aes(x = factor(male), y = salary, col = factor(male))) + geom_boxplot()
## Warning: Removed 8 rows containing non-finite values (stat_boxplot).
From our data, the average wage for females is $10,938.50 less than men. It’s a little better than 1994, but it isn’t enough to make up for the hard work women go through.
Calculate the multiple regression for ‘salary’ with ‘female01’, ‘grade’, ‘yos’, and ‘edyrs’:
lm(salary ~ female01 + grade + yos + edyrs, data = opm2008) %>% summary()
##
## Call:
## lm(formula = salary ~ female01 + grade + yos + edyrs, data = opm2008)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31034 -8124 -940 6604 114201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -30919.90 813.94 -37.988 < 2e-16 ***
## female01 -988.93 251.05 -3.939 8.24e-05 ***
## grade 7485.34 48.05 155.794 < 2e-16 ***
## yos 381.95 12.04 31.726 < 2e-16 ***
## edyrs 1334.25 59.15 22.559 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11660 on 9055 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.842, Adjusted R-squared: 0.8419
## F-statistic: 1.206e+04 on 4 and 9055 DF, p-value: < 2.2e-16
In this 2008 sample, women are still getting $988 less than their male counterpart regardless of their suitability.
Start with the average salary on different races:
opm2008 %>% group_by(race) %>% summarise(Mean_Salary = mean(salary, na.rm = TRUE))
## # A tibble: 5 x 2
## race Mean_Salary
## <fct> <dbl>
## 1 Black 62048.
## 2 White 71930.
## 3 Latino 63861.
## 4 Asian 74138.
## 5 American Indian 53793.
It looks like Asians gained a higher average salary than Whites, however the wage difference between whites and other minorities did not change. Native Americans are earning 18,000 less than whites. Ridiculous.
Calculate the regression between ‘minority’ and ‘salary’:
lm(salary ~ minority, data = opm2008) %>% summary()
##
## Call:
## lm(formula = salary ~ minority, data = opm2008)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50239 -23433 -5589 18288 151119
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71930.3 377.4 190.59 <2e-16 ***
## minority -8476.9 640.6 -13.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29040 on 9064 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.01895, Adjusted R-squared: 0.01885
## F-statistic: 175.1 on 1 and 9064 DF, p-value: < 2.2e-16
From this regression, we can conclude that the average salary of a minority is $8,476 less than nonminorities.
Lets see this visually with a boxplot:
ggplot(data = opm2008, mapping = aes(x = factor(minority), y = salary, col = factor(minority))) + geom_boxplot()
## Warning: Removed 8 rows containing non-finite values (stat_boxplot).
This boxplot shows that while there are a few outliers that earn more than nonminorities, the majority of minorities earn less.
ggplot(data = opm2008, mapping = aes(x = factor(race), y = salary, col = factor(race))) + geom_boxplot()
## Warning: Removed 8 rows containing non-finite values (stat_boxplot).
Calculate the Multiple Regression:
lm(salary ~ edyrs + grade + race, data = opm2008) %>% summary()
##
## Call:
## lm(formula = salary ~ edyrs + grade + race, data = opm2008)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32514 -8394 -620 6974 109147
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -22804.82 783.48 -29.107 < 2e-16 ***
## edyrs 838.63 60.36 13.894 < 2e-16 ***
## grade 8013.44 47.69 168.043 < 2e-16 ***
## raceWhite -615.12 328.19 -1.874 0.06093 .
## raceLatino -1873.92 600.66 -3.120 0.00182 **
## raceAsian 107.73 649.22 0.166 0.86822
## raceAmerican Indian -802.19 836.12 -0.959 0.33737
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12290 on 9059 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.8244, Adjusted R-squared: 0.8243
## F-statistic: 7090 on 6 and 9059 DF, p-value: < 2.2e-16
With the exception of Asians, other minorities make less on average with the same capabilities.
Now calculate the correlation matrix with ‘salary’, ‘female01’, ‘minority’, and ‘promo01’:
opm2008 %>% select(salary, female01, minority, promo01) %>% cor(use = "pairwise.complete.obs") %>% round(2)
## salary female01 minority promo01
## salary 1.00 -0.19 -0.14 -0.22
## female01 -0.19 1.00 0.15 0.04
## minority -0.14 0.15 1.00 0.01
## promo01 -0.22 0.04 0.01 1.00
In this sample it shows that females and minorities are likely to have a lower wage and a low tendency of getting promotions.
As we have already established that gender and race do have an influence on salary, can this sample represent appropriate data for the population?
Lets look at the relationship between gender and salary calculated with a regression:
lm(salary ~ female01, data = opm2008) %>% summary()
##
## Call:
## lm(formula = salary ~ female01, data = opm2008)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51916 -22631 -5674 18366 139731
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74840.5 444.0 168.56 <2e-16 ***
## female01 -10938.5 606.8 -18.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28810 on 9058 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.03463, Adjusted R-squared: 0.03452
## F-statistic: 324.9 on 1 and 9058 DF, p-value: < 2.2e-16
lm(salary ~ female01, data = opm2008) %>% confint()
## 2.5 % 97.5 %
## (Intercept) 73970.19 75710.872
## female01 -12128.09 -9748.986
As we already know, females make 10,000 less than males in this sample. The coefficient on ‘female01’ is statistically significant (p=2*10^(-16)) and the t-value is -18.02. There is a negative relationship between females and salary in the population. In this sample, we are 95% confident that the average salary of females is 12,128 to 9,748 less than their male counterparts in the population.
Lets add qualifications:
lm(salary ~ female01 + yos + edyrs + grade, dat = opm2008) %>% summary()
##
## Call:
## lm(formula = salary ~ female01 + yos + edyrs + grade, data = opm2008)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31034 -8124 -940 6604 114201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -30919.90 813.94 -37.988 < 2e-16 ***
## female01 -988.93 251.05 -3.939 8.24e-05 ***
## yos 381.95 12.04 31.726 < 2e-16 ***
## edyrs 1334.25 59.15 22.559 < 2e-16 ***
## grade 7485.34 48.05 155.794 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11660 on 9055 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.842, Adjusted R-squared: 0.8419
## F-statistic: 1.206e+04 on 4 and 9055 DF, p-value: < 2.2e-16
lm(salary ~ female01 + yos + edyrs + grade, dat = opm2008) %>% confint()
## 2.5 % 97.5 %
## (Intercept) -32515.3998 -29324.4034
## female01 -1481.0436 -496.8178
## yos 358.3488 405.5475
## edyrs 1218.3074 1450.1834
## grade 7391.1620 7579.5260
The difference with the salary has gotten smaller but there’s still a wage gap. It remains statistically significant. Now we are 95% confident that the wage is 1,481 to 496 less than males with the same experience and grade in the population.
Lets regress ‘minority’ and ‘salary’:
lm(salary ~ minority, data = opm2008) %>% summary()
##
## Call:
## lm(formula = salary ~ minority, data = opm2008)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50239 -23433 -5589 18288 151119
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71930.3 377.4 190.59 <2e-16 ***
## minority -8476.9 640.6 -13.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29040 on 9064 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.01895, Adjusted R-squared: 0.01885
## F-statistic: 175.1 on 1 and 9064 DF, p-value: < 2.2e-16
lm(salary ~ minority, data = opm2008) %>% confint()
## 2.5 % 97.5 %
## (Intercept) 71190.499 72670.099
## minority -9732.581 -7221.251
We know minorities get on average 8,476 less than nonminorities’ salary. According to the sample, minorities have a negative relationship with salaries. Since the data is statistically significant, we are 95% confident that minorities get from 9,732 to 7,221 less than nonminorities in the population.
Add the qualifications:
lm(salary ~ minority + yos + edyrs + grade, dat = opm2008) %>% summary()
##
## Call:
## lm(formula = salary ~ minority + yos + edyrs + grade, data = opm2008)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31736 -8128 -917 6631 114283
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -32124.26 799.44 -40.184 <2e-16 ***
## minority 340.38 260.92 1.305 0.192
## yos 380.20 12.03 31.593 <2e-16 ***
## edyrs 1354.16 59.16 22.888 <2e-16 ***
## grade 7514.54 47.87 156.987 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11670 on 9061 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.8417, Adjusted R-squared: 0.8416
## F-statistic: 1.204e+04 on 4 and 9061 DF, p-value: < 2.2e-16
lm(salary ~ minority + yos + edyrs + grade, dat = opm2008) %>% confint()
## 2.5 % 97.5 %
## (Intercept) -33691.3398 -30557.1813
## minority -171.0779 851.8315
## yos 356.6078 403.7879
## edyrs 1238.1839 1470.1373
## grade 7420.7098 7608.3716
From this data, we can conclude that qualifications do influence minority salary greatly. This regression does conclude that with the appropriate skills, minorities can earn 171 dollars less than or up to 851 dollars more than nonminorities. Since it’s statistically significant, we can safely determine that this can be used for the population.
According to both of these samples, minorities and females deal with an unfair wage gap. For females from 1994 to 2008, their salary is still considerably lower than men. Even with the same experience and qualifications, the pay is still inferior. This displays the apparent lack of respect for women in this country. With 50% of our workforce unable to earn what they deserve, America needs to work harder for their female citizens.
With Minorities, in 1994 sample, their salary was lower than nonminorities, however in 2008, minorities started earning more. As America progresses, the wage difference between minorities and nonminorities begin to close. This country is finally recognizing the importance and the impact that minorities have on this nation.