library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.2
library(knitr)
## Warning: package 'knitr' was built under R version 3.6.2
library(summarytools)
## Warning: package 'summarytools' was built under R version 3.6.2
## Registered S3 method overwritten by 'pryr':
## method from
## print.bytes Rcpp
library(descr)
## Warning: package 'descr' was built under R version 3.6.2
##
## Attaching package: 'descr'
## The following objects are masked from 'package:summarytools':
##
## descr, freq
setwd("C:/Users/ramin/Desktop/2020 winter/Data Analysis/Project")
load("Datasets/OPM2008.RData")
According to the data opm2008, and the regression of the analysis between gender pay in salary is affected significantly by edyrs and yos. A difference in salary between Male and Female can be observed in this sample – Out of all the people in this data with the same education years (edyrs) and (yos), Males are expected to make $ 6048.69 more than Females. This effect appears to be statistically significant based on a larger population of people, these data provide evidence that women in the federal government are paid less than men.
There are 9074 individuals in this analysis and 20 variables: salary, occupation, state, grade, gender, age, edyrs, yos, promo, race, vet, VETmode, exit, major, agency, racesex, supervisor, patco, gradecat, and educ4.
summary(opm2008$male)
## Female Male NA's
## 4853 4215 6
The outcome variable in this analysis is income salary measured numerically in US Dollars. Since salary is numeric, we can use a linear model. It ranges from $19,646 to $214,572 US Dollars with a mean of $68,988.
The independent variable of interest is gender (male or female). In the dataset, the class of the variable is categorical, male and female are each measured separately to conduct the difference in Salary. There are 53 percent females and 47 percent males with a mean salary of $63,902 and $74,841 respectively.
Other variables that can potentially affect salary are the following:
• Occupation: The kind of (categorical)job the person has. • State: The State where the person is working(categorical) and getting their salary from. • edyrs: Years of education(numerical) for the person who is working. • yos: Years of service (numerical), the amount of years the person has been working. • Race: The race of the individual (categorical). • Major: What the person has studied in University (categorical). • Agency: The agency the person works for (categorical). • Educ4 : The persons education level (categorical).
Calculating the means in salary of male and females shows that there is a significant difference of the average Males having a higher salary than Females
opm2008 %>% select(male, salary) %>% group_by(male) %>% summarise(mean_salary = mean(salary, na.rm = T),SD = sd(salary, na.rm = T))
## Warning: Factor `male` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## # A tibble: 3 x 3
## male mean_salary SD
## <fct> <dbl> <dbl>
## 1 Female 63902. 26928.
## 2 Male 74841. 30834.
## 3 <NA> 73327. 18970.
The results of fitting a bivariate model with Salary as the outcome and Male as the single predictors are as follows:
lm(salary ~ male, data = opm2008) %>% summary()
##
## Call:
## lm(formula = salary ~ male, data = opm2008)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51916 -22631 -5674 18366 139731
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63902.0 413.7 154.47 <2e-16 ***
## maleMale 10938.5 606.8 18.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28810 on 9058 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.03463, Adjusted R-squared: 0.03452
## F-statistic: 324.9 on 1 and 9058 DF, p-value: < 2.2e-16
They show that the expected salary of a Female is $63,902. Males are expected to make $10,938.50 more than Females. This difference is statistically significant at the 97.5% confidence level with the following confidence intervals that outline the differences in the larger population of Male:
lm(salary ~ male, data = opm2008) %>% confint()
## 2.5 % 97.5 %
## (Intercept) 63091.101 64712.88
## maleMale 9748.986 12128.09
However, there are other characteristics that might influence salary and might be correlated with Male. As a result, the coefiicient on Male could possibly be biased:
opm2008 %>% select(salary, grade, age, edyrs, yos) %>%
cor(use = "pairwise.complete.obs") %>%
round(2)
## salary grade age edyrs yos
## salary 1.00 0.91 0.23 0.52 0.31
## grade 0.91 1.00 0.13 0.52 0.22
## age 0.23 0.13 1.00 -0.02 0.60
## edyrs 0.52 0.52 -0.02 1.00 -0.11
## yos 0.31 0.22 0.60 -0.11 1.00
As the table above shows, many potential predictors of salary are correlated with each other. Using all of them in the model might be not a good idea because of variance inflation. The variables that add little explanatory power to the model will be dropped.
One of the most essential differences among Male and Females that influences salary is their education years edyrs. Edyrs is highly correlated with age and yos. The first adjustment to the model is adding the edyrs variable:
lm(salary ~ male + edyrs, data = opm2008) %>% summary()
##
## Call:
## lm(formula = salary ~ male + edyrs, data = opm2008)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65917 -17416 -3778 14867 122012
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20124.9 1550.3 -12.98 <2e-16 ***
## maleMale 6070.2 531.0 11.43 <2e-16 ***
## edyrs 5859.4 105.2 55.70 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24860 on 9057 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.2809, Adjusted R-squared: 0.2808
## F-statistic: 1769 on 2 and 9057 DF, p-value: < 2.2e-16
The adjusted model appears to be a significant improvement over the bivariate model. It explains 28.08 % percent of the variation in the outcome as opposed to 3.45% percent of the variation in the first model - this shows a substantial improvement comparing the two models.
Next adjustment is made by adding age to the model:
lm(salary ~ male + edyrs + age, data = opm2008) %>% summary()
##
## Call:
## lm(formula = salary ~ male + edyrs + age, data = opm2008)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73296 -16620 -3505 14411 111499
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -50822.71 1882.34 -27.00 <2e-16 ***
## maleMale 5234.69 512.16 10.22 <2e-16 ***
## edyrs 5927.22 101.31 58.51 <2e-16 ***
## age 641.22 23.96 26.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23940 on 9056 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.3336, Adjusted R-squared: 0.3334
## F-statistic: 1511 on 3 and 9056 DF, p-value: < 2.2e-16
The coefficent on age is small and not too significant. The improvement in the Adjusted R-squared is minor: 33.34 - 28.08 = 5.26 percentage point and adding the variable doesn’t significantly improve the model. Therefore, age will be dropped from the model.
Adding yos to the model yields the following coefficients:
lm(salary ~ male + edyrs + yos, data = opm2008) %>% summary()
##
## Call:
## lm(formula = salary ~ male + edyrs + yos, data = opm2008)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79104 -15166 -3248 12682 128264
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -43377.57 1481.95 -29.27 <2e-16 ***
## maleMale 6048.69 477.55 12.67 <2e-16 ***
## edyrs 6348.08 95.19 66.69 <2e-16 ***
## yos 1007.68 21.77 46.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22360 on 9056 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.4185, Adjusted R-squared: 0.4183
## F-statistic: 2172 on 3 and 9056 DF, p-value: < 2.2e-16
The coefficient on yos appears to be stattistically significant and adding the variable improves the Adjusted R-squred.
Adjusting the model by adding grade and age does not significantly improve it.
The three predictors to this model are Gender (Male), education years (edyrs) and years of service (yos). According to the data in this sample, men are expected to make $6048.69 more than women who have the same education years and years of service. The difference is statistically significant so this data can be extended to larger population of people. Men receive more salary than Females do in equal terms of education years and years of service. These data provide convincing evidence that women in the federal government are paid less than men.
opm2008 %>% ggplot(mapping = aes(x = male, y = salary)) + geom_boxplot() #boxplot sal between male or female
## Warning: Removed 8 rows containing non-finite values (stat_boxplot).
ggplot(data = opm2008, aes(x = salary)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 8 rows containing non-finite values (stat_bin).
ggplot(data = opm2008, aes(x = grade)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = opm2008, aes(x = age)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = opm2008, aes(x = edyrs)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = opm2008, aes(x = yos)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = opm2008, aes(x = grade, y = salary)) + geom_point()
## Warning: Removed 8 rows containing missing values (geom_point).
ggplot(data = opm2008, aes(x = age, y = salary)) + geom_point()
## Warning: Removed 8 rows containing missing values (geom_point).
ggplot(data = opm2008, aes(x = edyrs, y = salary)) + geom_point()
## Warning: Removed 8 rows containing missing values (geom_point).
ggplot(data = opm2008, aes(x = yos, y = salary)) + geom_point()
## Warning: Removed 8 rows containing missing values (geom_point).
plot(x = opm2008$male, y = opm2008$salary)