Setup

Load packages

library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.2
library(knitr)         
## Warning: package 'knitr' was built under R version 3.6.2
library(summarytools)
## Warning: package 'summarytools' was built under R version 3.6.2
## Registered S3 method overwritten by 'pryr':
##   method      from
##   print.bytes Rcpp
library(descr)
## Warning: package 'descr' was built under R version 3.6.2
## 
## Attaching package: 'descr'
## The following objects are masked from 'package:summarytools':
## 
##     descr, freq
  setwd("C:/Users/ramin/Desktop/2020 winter/Data Analysis/Project")

Load data

load("Datasets/OPM2008.RData")

EXECUTIVE SUMMARY

According to the data opm2008, and the regression of the analysis between gender pay in salary is affected significantly by edyrs and yos. A difference in salary between Male and Female can be observed in this sample – Out of all the people in this data with the same education years (edyrs) and (yos), Males are expected to make $ 6048.69 more than Females. This effect appears to be statistically significant based on a larger population of people, these data provide evidence that women in the federal government are paid less than men.

ANALYSIS

There are 9074 individuals in this analysis and 20 variables: salary, occupation, state, grade, gender, age, edyrs, yos, promo, race, vet, VETmode, exit, major, agency, racesex, supervisor, patco, gradecat, and educ4.

summary(opm2008$male)
## Female   Male   NA's 
##   4853   4215      6

VARIABLES

The outcome variable in this analysis is income salary measured numerically in US Dollars. Since salary is numeric, we can use a linear model. It ranges from $19,646 to $214,572 US Dollars with a mean of $68,988.

The independent variable of interest is gender (male or female). In the dataset, the class of the variable is categorical, male and female are each measured separately to conduct the difference in Salary. There are 53 percent females and 47 percent males with a mean salary of $63,902 and $74,841 respectively.

Other variables that can potentially affect salary are the following:

• Occupation: The kind of (categorical)job the person has. • State: The State where the person is working(categorical) and getting their salary from. • edyrs: Years of education(numerical) for the person who is working. • yos: Years of service (numerical), the amount of years the person has been working. • Race: The race of the individual (categorical). • Major: What the person has studied in University (categorical). • Agency: The agency the person works for (categorical). • Educ4 : The persons education level (categorical).

EXPLORATORY ANALYSIS

Calculating the means in salary of male and females shows that there is a significant difference of the average Males having a higher salary than Females

opm2008 %>% select(male, salary) %>% group_by(male) %>% summarise(mean_salary = mean(salary, na.rm = T),SD = sd(salary, na.rm = T))
## Warning: Factor `male` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## # A tibble: 3 x 3
##   male   mean_salary     SD
##   <fct>        <dbl>  <dbl>
## 1 Female      63902. 26928.
## 2 Male        74841. 30834.
## 3 <NA>        73327. 18970.

MODELING

The results of fitting a bivariate model with Salary as the outcome and Male as the single predictors are as follows:

lm(salary ~ male, data = opm2008) %>% summary()
## 
## Call:
## lm(formula = salary ~ male, data = opm2008)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -51916 -22631  -5674  18366 139731 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  63902.0      413.7  154.47   <2e-16 ***
## maleMale     10938.5      606.8   18.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28810 on 9058 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.03463,    Adjusted R-squared:  0.03452 
## F-statistic: 324.9 on 1 and 9058 DF,  p-value: < 2.2e-16

They show that the expected salary of a Female is $63,902. Males are expected to make $10,938.50 more than Females. This difference is statistically significant at the 97.5% confidence level with the following confidence intervals that outline the differences in the larger population of Male:

lm(salary ~ male, data = opm2008) %>% confint()
##                 2.5 %   97.5 %
## (Intercept) 63091.101 64712.88
## maleMale     9748.986 12128.09

However, there are other characteristics that might influence salary and might be correlated with Male. As a result, the coefiicient on Male could possibly be biased:

opm2008 %>% select(salary, grade, age, edyrs, yos) %>% 
  cor(use = "pairwise.complete.obs") %>%
  round(2)
##        salary grade   age edyrs   yos
## salary   1.00  0.91  0.23  0.52  0.31
## grade    0.91  1.00  0.13  0.52  0.22
## age      0.23  0.13  1.00 -0.02  0.60
## edyrs    0.52  0.52 -0.02  1.00 -0.11
## yos      0.31  0.22  0.60 -0.11  1.00

As the table above shows, many potential predictors of salary are correlated with each other. Using all of them in the model might be not a good idea because of variance inflation. The variables that add little explanatory power to the model will be dropped.

One of the most essential differences among Male and Females that influences salary is their education years edyrs. Edyrs is highly correlated with age and yos. The first adjustment to the model is adding the edyrs variable:

lm(salary ~ male + edyrs, data = opm2008) %>% summary()
## 
## Call:
## lm(formula = salary ~ male + edyrs, data = opm2008)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65917 -17416  -3778  14867 122012 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -20124.9     1550.3  -12.98   <2e-16 ***
## maleMale      6070.2      531.0   11.43   <2e-16 ***
## edyrs         5859.4      105.2   55.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24860 on 9057 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.2809, Adjusted R-squared:  0.2808 
## F-statistic:  1769 on 2 and 9057 DF,  p-value: < 2.2e-16

The adjusted model appears to be a significant improvement over the bivariate model. It explains 28.08 % percent of the variation in the outcome as opposed to 3.45% percent of the variation in the first model - this shows a substantial improvement comparing the two models.

Next adjustment is made by adding age to the model:

lm(salary ~ male + edyrs + age, data = opm2008) %>% summary()
## 
## Call:
## lm(formula = salary ~ male + edyrs + age, data = opm2008)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -73296 -16620  -3505  14411 111499 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -50822.71    1882.34  -27.00   <2e-16 ***
## maleMale      5234.69     512.16   10.22   <2e-16 ***
## edyrs         5927.22     101.31   58.51   <2e-16 ***
## age            641.22      23.96   26.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23940 on 9056 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.3336, Adjusted R-squared:  0.3334 
## F-statistic:  1511 on 3 and 9056 DF,  p-value: < 2.2e-16

The coefficent on age is small and not too significant. The improvement in the Adjusted R-squared is minor: 33.34 - 28.08 = 5.26 percentage point and adding the variable doesn’t significantly improve the model. Therefore, age will be dropped from the model.

Adding yos to the model yields the following coefficients:

lm(salary ~ male + edyrs + yos, data = opm2008) %>% summary()
## 
## Call:
## lm(formula = salary ~ male + edyrs + yos, data = opm2008)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -79104 -15166  -3248  12682 128264 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -43377.57    1481.95  -29.27   <2e-16 ***
## maleMale      6048.69     477.55   12.67   <2e-16 ***
## edyrs         6348.08      95.19   66.69   <2e-16 ***
## yos           1007.68      21.77   46.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22360 on 9056 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.4185, Adjusted R-squared:  0.4183 
## F-statistic:  2172 on 3 and 9056 DF,  p-value: < 2.2e-16

The coefficient on yos appears to be stattistically significant and adding the variable improves the Adjusted R-squred.

Adjusting the model by adding grade and age does not significantly improve it.

SUMMARY

The three predictors to this model are Gender (Male), education years (edyrs) and years of service (yos). According to the data in this sample, men are expected to make $6048.69 more than women who have the same education years and years of service. The difference is statistically significant so this data can be extended to larger population of people. Men receive more salary than Females do in equal terms of education years and years of service. These data provide convincing evidence that women in the federal government are paid less than men.

APPENDIX

opm2008 %>% ggplot(mapping = aes(x = male, y = salary)) + geom_boxplot() #boxplot sal between male or female
## Warning: Removed 8 rows containing non-finite values (stat_boxplot).

ggplot(data = opm2008, aes(x = salary)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 8 rows containing non-finite values (stat_bin).

ggplot(data = opm2008, aes(x = grade)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = opm2008, aes(x = age)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = opm2008, aes(x = edyrs)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = opm2008, aes(x = yos)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = opm2008, aes(x = grade, y = salary)) + geom_point()
## Warning: Removed 8 rows containing missing values (geom_point).

ggplot(data = opm2008, aes(x = age, y = salary)) + geom_point()
## Warning: Removed 8 rows containing missing values (geom_point).

ggplot(data = opm2008, aes(x = edyrs, y = salary)) + geom_point()
## Warning: Removed 8 rows containing missing values (geom_point).

ggplot(data = opm2008, aes(x = yos, y = salary)) + geom_point()
## Warning: Removed 8 rows containing missing values (geom_point).

plot(x = opm2008$male, y = opm2008$salary)