This assignment uses only synthetic test data from the Linked-Employer-Employee-Data of the IAB (LIAB) provided by the Research Data Center (FDZ) of the German Fedral Employment Agency (BA) at the Institute for Employment Research (IAB). These data are random and should not be used to draw an inferences. This exercise is for educational purposes only. These data are used to provide an example of the structure of administrative employment records and offer students to develop data analysis skills using data with an authentic structure.
This guide contains examples of plots you may find useful in the Pay Equity Analysis. Keep in mind that you can substitute any variable you like for those pictured in the plots provided it is the same level of measurement. Try to avoid blindly replicating everything here. Go in with purpose and look at what you produce to guide your analysis.
Code to create is graph can be revealed by clicking on the Code tabs. More help with ggplot2 is available in this cheat sheet. Color names can be found here if you want to choose your own.
To run any of these codes you must have your firmcl dataframe loaded into RStudio, selected just your business unit, and run library(ggplot2) and library(dplyr).
bu %>% ggplot(aes(x=tentgelt, fill = frau)) + geom_histogram(binwidth = 10) +
labs(x = "Daily Wage in Euors",
y = "Count of Workers",
title = "Distribution of Wages",
subtitle = "Business Unit 1",
caption = "Data source: 2014 Administrative Employment Records, Business Unit 1") +
theme_classic() +
scale_fill_manual(values=c("skyblue", "tomato1")) +
theme(legend.title = element_blank(), legend.position = "top")bu %>% ggplot(aes(x=tentgelt, color = frau, fill = frau)) + geom_density(kernel = "gaussian", alpha = .5) +
labs(x = "Daily Wage in Euors",
y = "Density",
title = "Distribution of Wages",
subtitle = "Business Unit 1",
caption = "Data source: 2014 Administrative Employment Records, Business Unit 1") +
theme_classic() +
scale_fill_manual(values=c("skyblue", "tomato1")) +
scale_color_manual(values = c("skyblue", "tomato1")) +
theme(legend.title = element_blank(), legend.position = "top")Unlike the prior graphs, this one requires you to first summarize the data.
facetbar <- bu %>% group_by(frau, niveau) %>% summarize(avewage = mean(tentgelt))
facetbar %>% ggplot(aes(x=niveau, y=avewage, fill = frau)) + geom_col() +
facet_wrap(~ frau, nrow = 2) +
coord_flip() +
labs(x = "Skill Level",
y = "Average Daily Wage in Euros",
title = "Comparison of Average Wages by Skill and Gender",
caption = "Data souce: 2014 Administrative Employment Records, Business Unit 1") +
theme_classic() +
scale_fill_manual(values=c("skyblue", "tomato1")) +
theme(legend.position = "none")bu %>% ggplot(aes(x=tage_job, y=tentgelt, color=frau)) + geom_point() +
labs(x = "Days in Current Job",
y = "Average Daily Wage in Euros",
title = "Relationship between Wages and Experience for Men and Women",
caption = "Data Source: 2014 Administrative Employment Records, Business Unit 1") +
theme_classic() +
scale_color_manual(values=c("skyblue", "tomato1")) +
theme(legend.title = element_blank(), legend.position = "top")bu %>% ggplot(aes(x=tage_job, y=tentgelt, color=frau)) + geom_point(alpha = .25) +
labs(x = "Days in Current Job",
y = "Average Daily Wage in Euros",
title = "Relationship between Wages and Experience for Men and Women",
subtitle = "Displaying Linear Regression Fit and 95% Confidence Intervals",
caption = "Data Source: 2014 Administrative Employment Records, Business Unit 1") +
theme_classic() +
scale_color_manual(values=c("skyblue", "tomato1")) +
theme(legend.title = element_blank(), legend.position = "top") +
geom_smooth(method = 'lm')bu %>% ggplot(aes(x=tage_job, y=tentgelt, color=frau)) + geom_point(alpha = .25) +
labs(x = "Days in Current Job",
y = "Average Daily Wage in Euros",
title = "Relationship between Wages and Experience for Men and Women",
subtitle = "Displaying Loess Smoothing and 95% Confidence Intervals",
caption = "Data Source: 2014 Administrative Employment Records, Business Unit 1") +
theme_classic() +
scale_color_manual(values=c("skyblue", "tomato1")) +
theme(legend.title = element_blank(), legend.position = "top") +
geom_smooth(method = 'loess')bu %>% ggplot(aes(x=befrist, y=tentgelt, color = frau)) + geom_boxplot() +
labs(x = "Contract Type",
y = "Average Daily Wage in Euros",
title = "Distribution of Wages by Contract Type and Gender",
caption = "Data source: 2014 Administrative Employment Records, Business Unit 1") +
theme_classic() +
scale_color_manual(values=c("skyblue", "tomato1")) +
theme(legend.position = "none") +
facet_wrap(~frau)The font size is handled within the theme_classic() call by setting the base font size (the smallest text will be). All other fonts adjust up relative to the base size you set. The zoom is handled by telling R what ranges of the x and y axis you want to include in the coord_cartesian() call added to the ggplot.
bu %>% ggplot(aes(x=tage_job, y=tentgelt, color=frau)) + geom_point(alpha = .25) +
labs(x = "Days in Current Job",
y = "Average Daily Wage in Euros",
title = "Relationship between Wages and Experience",
subtitle = "Displaying Loess Smoothing and 95% Confidence Intervals",
caption = "Data Source: 2014 Administrative Employment Records, Business Unit 1") +
theme_classic(base_size = 16) +
scale_color_manual(values=c("skyblue", "tomato1")) +
theme(legend.title = element_blank(), legend.position = "top") +
geom_smooth(method = 'loess') +
coord_cartesian(xlim =c(3000, 7000), ylim = c(50, 150))Because of the coord_flip for this graph, the zoom is handled differently. Your limits are provided right inside the coord_flip call and you apply the limits to the y axis since that’s where avewage is assigned in aes() even though it appears on the x axis eventually.
facetbar <- bu %>% group_by(frau, niveau) %>% summarize(avewage = mean(tentgelt))
facetbar %>% ggplot(aes(x=niveau, y=avewage, fill = frau)) + geom_col() +
facet_wrap(~ frau, nrow = 2) +
coord_flip(ylim=c(40,100)) +
labs(x = "Skill Level",
y = "Average Daily Wage in Euros",
title = "Comparison of Average Wages by Skill",
caption = "Data souce: 2014 Administrative Employment Records, Business Unit 1") +
theme_classic(base_size = 16) +
scale_fill_manual(values=c("skyblue", "tomato1")) +
theme(legend.position = "none") The Glassdoor article mentions a more advanced method for analyzing pay equity called an Oaxaca Decomposition. This method can be implemented easily in R with a visual summary of results. However, as cautioned in the reading, it does require lots of data to obtain precise estimates and does impose some strong assumptions. Nonetheless, it can help you to determine whether your descriptive analysis is focusing in the right places.
To use this method you must install package oaxaca. You should also create non-ordered versions of any ordered factors in your analysis to ensure R will display clear variable labels in the output. It would also be a good idea to scale the experience variables to years rather than days if you have not done so already. Finally, you need to create a logical version (true/false) of your gender variable. The code to do each of these things as well as run an Oaxaca Decomposition is below. The output is summarized visually in a plot. Be patient. The model takes some time to fit. For more help understanding what the model does and interpreting results, please see the article Blinder-Oaxaca Decomposition in R on D2L.
#Create non-ordered versions of any ordered factors
firmcl$niveau_no <- factor(firmcl$niveau, ordered = FALSE)
firmcl$schule_no <- factor(firmcl$schule, ordered = FALSE)
#Create experience in years
firmcl$years_erw <- firmcl$tage_erw/365
firmcl$years_job <- firmcl$tage_job/365
#Create a logical variable based on your gender variable equal to 1 for females.
firmcl$frau_tf <- firmcl$frau == "Female"
#Run Oaxaca
results <- oaxaca(formula = tentgelt ~ german + niveau_no + tage_erw + tage_job + schule_no | frau_tf , data = firmcl, R = 30)## oaxaca: oaxaca() performing analysis. Please wait.
##
## Bootstrapping standard errors:
## 1 / 30 (3.33%)
## 3 / 30 (10%)
## 6 / 30 (20%)
## 9 / 30 (30%)
## 12 / 30 (40%)
## 15 / 30 (50%)
## 18 / 30 (60%)
## 21 / 30 (70%)
## 24 / 30 (80%)
## 27 / 30 (90%)
## 30 / 30 (100%)
plot(results, components = c("endowments", "coefficients"))