Executive Summary

According to the ‘opm94’ data, females have been earning a significantly lower salary than their male counterparts. In the regression models displayed in this analysis, it shows that qualifications and experience doesn’t influence the wage gap. Through time, the ‘opm2008’ data reveals that the wage gap in sex decreased, but still exists. We declared that the information from this sample is statistically significant and it can represent the population.

In 1994, the salary for minorities was inferior to nonminorities, but in 2008, it changed. The sample displays an increase in salary for minorities where some groups surpassed nonminority salaries! We concluded that the data in this sample is statistically significant and could depict the minority population.

Research Questions

The idea of the American dream conveys freedom. Freedom to choose from the plethora of oppurtunities that America offers. As the country grows, the population becomes more diverse. It’s a mixing pot of people with various ethnicities. With this information in mind, the concept of freedom doesn’t apply to everyone. It’s common knowledge that women are paid less than men, and even less if they’re a minority. This is an issue that needs to be tackled. According to the US census, 50% of US citizens are women and 34% of US citizens are minorities. These are millions of people whos getting unfair pay due to their gender or ethnicity.

Research Question

Does race and gender determine how much you get paid? And has it changed since 1994?

Variables

Variables used in this project are from both ‘opm2008’ and ‘opm94’. It will be a comparison between the correlations of ‘race’, ‘minority’, ‘male’, and ‘female01’ with ‘salary’ from different years.

Other Variables that influence the outcome: ‘edyrs’ - Education (years) ‘yos’ - Federal experience (years) ‘grade’ - Grade level (1-16)

Data

The data that we’re stemming this information from is the United States’ federal employee records from 1994 and 2008. The Office of Personnel Management collects data on wages and promotions on 1.9 federal workers into its Central Personnel Data File database.

Exploratory Data Analysis

Install and load packages

library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library("summarytools")
## Registered S3 method overwritten by 'pryr':
##   method      from
##   print.bytes Rcpp
## Warning in fun(libname, pkgname): couldn't connect to display ":0"
## system might not have X11 capabilities; in case of errors when using dfSummary(), set st_options(use.x11 = FALSE)
## For best results, restart R session and update pander using devtools:: or remotes::install_github('rapporter/pander')
library("knitr")
library("ggplot2")
load("/cloud/project/Project data.RData")

Starting with the Data in 1994

We’re going to analyze the wage difference for gender

First we’re going to calculate the regression for males and females:

lm(sal ~ female01, data = opm94) %>% summary()
## 
## Call:
## lm(formula = sal ~ female01, data = opm94)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -31945 -11537  -3092   9591  71883 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46999.4      729.8   64.40   <2e-16 ***
## female01    -12776.6     1046.3  -12.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16500 on 993 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.1305, Adjusted R-squared:  0.1297 
## F-statistic: 149.1 on 1 and 993 DF,  p-value: < 2.2e-16

For reference, we’re going to be constructing a boxplot:

ggplot(data = opm94, mapping = aes(x = factor(male), y = sal, col = factor(male))) + geom_boxplot()
## Warning: Removed 5 rows containing non-finite values (stat_boxplot).

According to the calculations from our dataset in 1994, the average salary for females was $12,776 lower than their male counterpart. Maybe men have higher education or higher grades than women do that would explain their higher wages.

We can calculate that with a Multiple Regression:

lm(sal ~ female01 + grade + yos + edyrs, data = opm94) %>% summary()
## 
## Call:
## lm(formula = sal ~ female01 + grade + yos + edyrs, data = opm94)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14363  -4613   -560   3311  45093 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -14255.58    1601.92  -8.899  < 2e-16 ***
## female01     -1442.57     460.41  -3.133  0.00178 ** 
## grade         4126.46      88.16  46.808  < 2e-16 ***
## yos            313.91      26.08  12.038  < 2e-16 ***
## edyrs          796.22     123.10   6.468 1.56e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6742 on 990 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.8552, Adjusted R-squared:  0.8546 
## F-statistic:  1462 on 4 and 990 DF,  p-value: < 2.2e-16

According to this sample, the average wage of females is $1,442 less than men regardless of having the same qualificiations.

Now lets observe the correlation based on Race

First, calculate the mean of ‘race’:

opm94 %>% group_by(race) %>% summarise(Mean_Salary = mean(sal, na.rm = TRUE))
## # A tibble: 5 x 2
##   race            Mean_Salary
##   <fct>                 <dbl>
## 1 American Indian      32846.
## 2 Asian                38440.
## 3 Black                32713.
## 4 Hispanic             36500.
## 5 White                43294.

This gives us an idea of the wage inequality between different races. There are too many variables, so we’re going to simplify it to minorities and nonminorities.

Now calculate the regression between ‘minority’ and ‘sal’:

lm(sal ~ minority, data = opm94) %>% summary()
## 
## Call:
## lm(formula = sal ~ minority, data = opm94)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -28240 -13169  -2282  10818  78126 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    43294        639   67.75  < 2e-16 ***
## minority       -9250       1227   -7.54 1.06e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17210 on 993 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.05415,    Adjusted R-squared:  0.0532 
## F-statistic: 56.85 on 1 and 993 DF,  p-value: 1.058e-13

And now the box plot:

ggplot(data = opm94, mapping = aes(x = factor(minority), y = sal, col = factor(minority))) + geom_boxplot()
## Warning: Removed 5 rows containing non-finite values (stat_boxplot).

ggplot(data = opm94, mapping = aes(x = factor(race), y = sal, col = factor(race))) + geom_boxplot()
## Warning: Removed 5 rows containing non-finite values (stat_boxplot).

According to the data, minorities make $9,250 less than nonminorities. Maybe qualifications are a factor in this data.

Calculate a multiple regression with ‘sal’, ‘minority’, ‘yos’, ‘grade’ and ‘edyrs’:

lm(sal ~ minority + grade + yos + edyrs, data = opm94) %>% summary()
## 
## Call:
## lm(formula = sal ~ minority + grade + yos + edyrs, data = opm94)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15344  -4565   -504   3227  45231 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15860.88    1526.44 -10.391  < 2e-16 ***
## minority      -440.38     497.64  -0.885    0.376    
## grade         4172.61      87.52  47.677  < 2e-16 ***
## yos            312.03      26.25  11.885  < 2e-16 ***
## edyrs          838.50     122.85   6.826 1.52e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6773 on 990 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.8539, Adjusted R-squared:  0.8533 
## F-statistic:  1447 on 4 and 990 DF,  p-value: < 2.2e-16

Minorities make on average $440 less than nonminorities. No matter their qualifications.

Make a Correlation Matrix with the variables ‘sal’,‘male’,‘minority’,‘promo’:

opm94 %>% select(sal, female01, minority, promo01) %>% cor(use = "pairwise.complete.obs") %>% round(2)
##            sal female01 minority promo01
## sal       1.00    -0.36    -0.23   -0.15
## female01 -0.36     1.00     0.12    0.07
## minority -0.23     0.12     1.00    0.04
## promo01  -0.15     0.07     0.04    1.00

In this sample, minorities and females have less opportunity to earn a promotion and tend to earn a lower wage than their counterparts. In 1994, the United States displayed obvious discrimination on women and minorities. Times were a little different back then. Maybe America has changed since then.

Lets observe the data in 2008

Starting with the wage difference with gender

Calculate the regression between men and women:

lm(salary ~ female01, data = opm2008) %>% summary()
## 
## Call:
## lm(formula = salary ~ female01, data = opm2008)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -51916 -22631  -5674  18366 139731 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  74840.5      444.0  168.56   <2e-16 ***
## female01    -10938.5      606.8  -18.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28810 on 9058 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.03463,    Adjusted R-squared:  0.03452 
## F-statistic: 324.9 on 1 and 9058 DF,  p-value: < 2.2e-16

Now the BoxPlot:

ggplot(data = opm2008, mapping = aes(x = factor(male), y = salary, col = factor(male))) + geom_boxplot()
## Warning: Removed 8 rows containing non-finite values (stat_boxplot).

From our data, the average wage for females is $10,938.50 less than men. It’s a little better than 1994, but it isn’t enough to make up for the hard work women go through.

Calculate the multiple regression for ‘salary’ with ‘female01’, ‘grade’, ‘yos’, and ‘edyrs’:

lm(salary ~ female01 + grade + yos + edyrs, data = opm2008) %>% summary()
## 
## Call:
## lm(formula = salary ~ female01 + grade + yos + edyrs, data = opm2008)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -31034  -8124   -940   6604 114201 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -30919.90     813.94 -37.988  < 2e-16 ***
## female01      -988.93     251.05  -3.939 8.24e-05 ***
## grade         7485.34      48.05 155.794  < 2e-16 ***
## yos            381.95      12.04  31.726  < 2e-16 ***
## edyrs         1334.25      59.15  22.559  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11660 on 9055 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.842,  Adjusted R-squared:  0.8419 
## F-statistic: 1.206e+04 on 4 and 9055 DF,  p-value: < 2.2e-16

In this 2008 sample, women are still getting $988 less than their male counterpart regardless of their suitability.

Now let’s calculate minority wage

Start with the average salary on different races:

opm2008 %>% group_by(race) %>% summarise(Mean_Salary = mean(salary, na.rm = TRUE))
## # A tibble: 5 x 2
##   race            Mean_Salary
##   <fct>                 <dbl>
## 1 Black                62048.
## 2 White                71930.
## 3 Latino               63861.
## 4 Asian                74138.
## 5 American Indian      53793.

It looks like Asians gained a higher average salary than Whites, however the wage difference between whites and other minorities did not change. Native Americans are earning 18,000 less than whites. Ridiculous.

Calculate the regression between ‘minority’ and ‘salary’:

lm(salary ~ minority, data = opm2008) %>% summary()
## 
## Call:
## lm(formula = salary ~ minority, data = opm2008)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50239 -23433  -5589  18288 151119 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  71930.3      377.4  190.59   <2e-16 ***
## minority     -8476.9      640.6  -13.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29040 on 9064 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.01895,    Adjusted R-squared:  0.01885 
## F-statistic: 175.1 on 1 and 9064 DF,  p-value: < 2.2e-16

From this regression, we can conclude that the average salary of a minority is $8,476 less than nonminorities.

Lets see this visually with a boxplot:

ggplot(data = opm2008, mapping = aes(x = factor(minority), y = salary, col = factor(minority))) + geom_boxplot()
## Warning: Removed 8 rows containing non-finite values (stat_boxplot).

This boxplot shows that while there are a few outliers that earn more than nonminorities, the majority of minorities earn less.

ggplot(data = opm2008, mapping = aes(x = factor(race), y = salary, col = factor(race))) + geom_boxplot()
## Warning: Removed 8 rows containing non-finite values (stat_boxplot).

Calculate the Multiple Regression:

lm(salary ~ edyrs + grade + race, data = opm2008) %>% summary()
## 
## Call:
## lm(formula = salary ~ edyrs + grade + race, data = opm2008)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32514  -8394   -620   6974 109147 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -22804.82     783.48 -29.107  < 2e-16 ***
## edyrs                  838.63      60.36  13.894  < 2e-16 ***
## grade                 8013.44      47.69 168.043  < 2e-16 ***
## raceWhite             -615.12     328.19  -1.874  0.06093 .  
## raceLatino           -1873.92     600.66  -3.120  0.00182 ** 
## raceAsian              107.73     649.22   0.166  0.86822    
## raceAmerican Indian   -802.19     836.12  -0.959  0.33737    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12290 on 9059 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.8244, Adjusted R-squared:  0.8243 
## F-statistic:  7090 on 6 and 9059 DF,  p-value: < 2.2e-16

With the exception of Asians, other minorities make less on average with the same capabilities.

Now calculate the correlation matrix with ‘salary’, ‘female01’, ‘minority’, and ‘promo01’:

opm2008 %>% select(salary, female01, minority, promo01) %>% cor(use = "pairwise.complete.obs") %>% round(2)
##          salary female01 minority promo01
## salary     1.00    -0.19    -0.14   -0.22
## female01  -0.19     1.00     0.15    0.04
## minority  -0.14     0.15     1.00    0.01
## promo01   -0.22     0.04     0.01    1.00

In this sample it shows that females and minorities are likely to have a lower wage and a low tendency of getting promotions.

Inference

As we have already established that gender and race do have an influence on salary, can this sample represent appropriate data for the population?

Gender

Lets look at the relationship between gender and salary calculated with a regression:

lm(salary ~ female01, data = opm2008) %>% summary()
## 
## Call:
## lm(formula = salary ~ female01, data = opm2008)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -51916 -22631  -5674  18366 139731 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  74840.5      444.0  168.56   <2e-16 ***
## female01    -10938.5      606.8  -18.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28810 on 9058 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.03463,    Adjusted R-squared:  0.03452 
## F-statistic: 324.9 on 1 and 9058 DF,  p-value: < 2.2e-16
lm(salary ~ female01, data = opm2008) %>% confint()
##                 2.5 %    97.5 %
## (Intercept)  73970.19 75710.872
## female01    -12128.09 -9748.986

As we already know, females make 10,000 less than males in this sample. The coefficient on ‘female01’ is statistically significant (p=2*10^(-16)) and the t-value is -18.02. There is a negative relationship between females and salary in the population. In this sample, we are 95% confident that the average salary of females is 12,128 to 9,748 less than their male counterparts in the population.

Lets add qualifications:

lm(salary ~ female01 + yos + edyrs + grade, dat = opm2008) %>% summary()
## 
## Call:
## lm(formula = salary ~ female01 + yos + edyrs + grade, data = opm2008)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -31034  -8124   -940   6604 114201 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -30919.90     813.94 -37.988  < 2e-16 ***
## female01      -988.93     251.05  -3.939 8.24e-05 ***
## yos            381.95      12.04  31.726  < 2e-16 ***
## edyrs         1334.25      59.15  22.559  < 2e-16 ***
## grade         7485.34      48.05 155.794  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11660 on 9055 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.842,  Adjusted R-squared:  0.8419 
## F-statistic: 1.206e+04 on 4 and 9055 DF,  p-value: < 2.2e-16
lm(salary ~ female01 + yos + edyrs + grade, dat = opm2008) %>% confint()
##                   2.5 %      97.5 %
## (Intercept) -32515.3998 -29324.4034
## female01     -1481.0436   -496.8178
## yos            358.3488    405.5475
## edyrs         1218.3074   1450.1834
## grade         7391.1620   7579.5260

The difference with the salary has gotten smaller but there’s still a wage gap. It remains statistically significant. Now we are 95% confident that the wage is 1,481 to 496 less than males with the same experience and grade in the population.

Race

Lets regress ‘minority’ and ‘salary’:

lm(salary ~ minority, data = opm2008) %>% summary()
## 
## Call:
## lm(formula = salary ~ minority, data = opm2008)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50239 -23433  -5589  18288 151119 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  71930.3      377.4  190.59   <2e-16 ***
## minority     -8476.9      640.6  -13.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29040 on 9064 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.01895,    Adjusted R-squared:  0.01885 
## F-statistic: 175.1 on 1 and 9064 DF,  p-value: < 2.2e-16
lm(salary ~ minority, data = opm2008) %>% confint()
##                 2.5 %    97.5 %
## (Intercept) 71190.499 72670.099
## minority    -9732.581 -7221.251

We know minorities get on average 8,476 less than nonminorities’ salary. According to the sample, minorities have a negative relationship with salaries. Since the data is statistically significant, we are 95% confident that minorities get from 9,732 to 7,221 less than nonminorities in the population.

Add the qualifications:

lm(salary ~ minority + yos + edyrs + grade, dat = opm2008) %>% summary()
## 
## Call:
## lm(formula = salary ~ minority + yos + edyrs + grade, data = opm2008)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -31736  -8128   -917   6631 114283 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -32124.26     799.44 -40.184   <2e-16 ***
## minority       340.38     260.92   1.305    0.192    
## yos            380.20      12.03  31.593   <2e-16 ***
## edyrs         1354.16      59.16  22.888   <2e-16 ***
## grade         7514.54      47.87 156.987   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11670 on 9061 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.8417, Adjusted R-squared:  0.8416 
## F-statistic: 1.204e+04 on 4 and 9061 DF,  p-value: < 2.2e-16
lm(salary ~ minority + yos + edyrs + grade, dat = opm2008) %>% confint()
##                   2.5 %      97.5 %
## (Intercept) -33691.3398 -30557.1813
## minority      -171.0779    851.8315
## yos            356.6078    403.7879
## edyrs         1238.1839   1470.1373
## grade         7420.7098   7608.3716

From this data, we can conclude that qualifications do influence minority salary greatly. This regression does conclude that with the appropriate skills, minorities can earn 171 dollars less than or up to 851 dollars more than nonminorities. Since it’s statistically significant, we can safely determine that this can be used for the population.

Results

According to both of these samples, minorities and females deal with an unfair wage gap. For females from 1994 to 2008, their salary is still considerably lower than men. Even with the same experience and qualifications, the pay is still inferior. This displays the apparent lack of respect for women in this country. With 50% of our workforce unable to earn what they deserve, America needs to work harder for their female citizens.

With Minorities, in 1994 sample, their salary was lower than nonminorities, however in 2008, minorities started earning more. As America progresses, the wage difference between minorities and nonminorities begin to close. This country is finally recognizing the importance and the impact that minorities have on this nation.