Load Libraries

library(dplyr)
library(ggplot2)
library(GGally)

Load Data

load(file = "Datasets/OPM94.RData")
load(file = "Datasets/OPM2008.RData")

EXECUTIVE SUMMARY

Throughout history, women have have fewer rights than men and in our more modern society, there is a big debate over whether politics and the economy are being affected by sexism. In terms of wages, there re many claims that there is a gender wage gap between men and women. Statistics pertaining to the gender wage gap have become so prevalent to people that former United States president, Barack Obama, even discussed the matter. Society is working to bridge the wage gap and it is very important that the patterns of gender wage gap are recognized and handled. This research report will aim to explore the gender pay gap, which can be explained as the average hourly pay of women and men. This report will also identify the many different variables contributing to the gap including years in education, grade, as well as, years of service. The disparity in wages is statistically significant and can, without a doubt, be generalized to an even greater population of people. The following data provides substantial evidence proving that gender pay gap is a serious issue.

Research Question

Do women receive lower pay for equal work irrespective of the levek of qualification?

Data

To answer this question, we will use a random sample of federal employees.

load(file = "Datasets/OPM94.RData")
names(opm94)
##  [1] "x"        "sal"      "grade"    "patco"    "major"    "age"     
##  [7] "male"     "vet"      "handvet"  "hand"     "yos"      "edyrs"   
## [13] "promo"    "exit"     "supmgr"   "race"     "minority" "grade4"  
## [19] "promo01"  "supmgr01" "male01"   "exit01"   "vet01"

Exploratory Analysis

There are two datasets provided that will consist of 9,074 observation with 22 varibles in the 2008 dataset and 1,000 observations wth 24 variables including, but not limited to: salary, grade, edyrs, race, sex, yos, vet status, and age.

opm94 %>% group_by(male) %>% summarise(Mean_Salary = mean(sal, na.rm = TRUE))
## # A tibble: 2 x 2
##   male   Mean_Salary
##   <fct>        <dbl>
## 1 female      34223.
## 2 male        46999.
opm2008 %>% group_by(male) %>% summarise(Mean_Salary = mean(salary, na.rm = TRUE))
## Warning: Factor `male` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## # A tibble: 3 x 2
##   male   Mean_Salary
##   <fct>        <dbl>
## 1 Female      63902.
## 2 Male        74841.
## 3 <NA>        73327.

Bivariate Regression

The results of fitting a a bivariate model with salary as the outcome and gender as a predictor for opm94 are shown below:

lm(sal ~ male01, data = opm94) %>% summary()
## 
## Call:
## lm(formula = sal ~ male01, data = opm94)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -31945 -11537  -3092   9591  71883 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  34222.8      749.9   45.64   <2e-16 ***
## male01       12776.6     1046.3   12.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16500 on 993 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.1305, Adjusted R-squared:  0.1297 
## F-statistic: 149.1 on 1 and 993 DF,  p-value: < 2.2e-16
opm94 <- opm94 %>% mutate(female01 = if_else(male01 == 0, 1, 0 ))
lm(sal ~ female01, data = opm94) %>% summary()
## 
## Call:
## lm(formula = sal ~ female01, data = opm94)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -31945 -11537  -3092   9591  71883 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46999.4      729.8   64.40   <2e-16 ***
## female01    -12776.6     1046.3  -12.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16500 on 993 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.1305, Adjusted R-squared:  0.1297 
## F-statistic: 149.1 on 1 and 993 DF,  p-value: < 2.2e-16

The results of fitting a bivariate model with salary as the outcome and gender as a predictor for opm2008 are shown below:

lm(salary ~ male, data = opm2008) %>% summary()
## 
## Call:
## lm(formula = salary ~ male, data = opm2008)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -51916 -22631  -5674  18366 139731 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  63902.0      413.7  154.47   <2e-16 ***
## maleMale     10938.5      606.8   18.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28810 on 9058 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.03463,    Adjusted R-squared:  0.03452 
## F-statistic: 324.9 on 1 and 9058 DF,  p-value: < 2.2e-16
opm2008 <- opm2008 %>% mutate(female = if_else(male == 0, 1, 0 ))
lm(salary ~ female, data = opm2008) %>% summary()
## 
## Call:
## lm(formula = salary ~ female, data = opm2008)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49339 -23945  -5550  18553 145587 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    68985        308     224   <2e-16 ***
## female            NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29320 on 9059 degrees of freedom
##   (14 observations deleted due to missingness)

The table above shows, many predictors of salary are strongly correlated with the predictor variable.

Below, are a few plots for opm94, showing the correlation between salary and grade, education years, and years of service

ggplot(data=opm94) + geom_point(mapping = aes(x=grade, y = sal))
## Warning: Removed 5 rows containing missing values (geom_point).

ggplot(data=opm94) + geom_point(mapping = aes(x=yos, y = sal))
## Warning: Removed 5 rows containing missing values (geom_point).

ggplot(data=opm94) + geom_point(mapping = aes(x=edyrs, y = sal))
## Warning: Removed 5 rows containing missing values (geom_point).

Below, are a few plots for opm2008, showing the correlation between salary and grade, education years, and years of service

ggplot(data=opm2008) + geom_point(mapping = aes(x=grade, y = salary))
## Warning: Removed 8 rows containing missing values (geom_point).

ggplot(data=opm2008) + geom_point(mapping = aes(x=yos, y = salary))
## Warning: Removed 8 rows containing missing values (geom_point).

ggplot(data=opm2008) + geom_point(mapping = aes(x=edyrs, y = salary))
## Warning: Removed 8 rows containing missing values (geom_point).

The most influential variable influencing salary in opm94 and opm2008 for women is ranked as follow: # 1- Grade # 2- Education Years # 3- Years of Service

RESULTS

The resulting model has three main predictors which include grade, education years, and years of service. According to the statistics found, men receive a much gretaer salary than women. In fact, if we examine the means for both gender in 1994, men on average made $12,776 more and in 2008, men made $10,939 more. While the gender wage gap has gotten smaller, there is still a significant disparity between salaries that both genders receive. The conclusion could be extended to a larger population, as the statistics accurately respresent claims that have been made pertaining the wage gap. In other words, this data has enough substantially convincing evidence proving that mean receive a higher salary than women.