Final Project: Independent Data Intensive Research

ETR537

Author

Dr. Cansu Tatar

Published

December 14, 2023

Overview

As a final project, you’ll be applying the knowledge and skills gained throughout this semester to conduct the independent analysis. Similar to the case studies provided in this course, your analysis should demonstrate your ability to wrangle, analyze, and communicate findings in response to a research question of interest. You will also be expected to use Quarto to create a reproducible data product, which contains all the code used during each step of your analysis so it can be reproduced by others.

Grading Criteria

Grading criteria for your final project is guided by the Data-Intensive Research Workflow from Learning analytics goes to school: A collaborative approach to improving education (Krumm, Means, & Bienkowski, 2018).

Prepare

The persistent wage gap between genders is a subject of significant socio-economic research and policy debate. This analysis seeks to contribute to this body of work by quantifying and visualizing the differences in median hourly wages between men and women across varying education levels in the United States over a nearly 50-year span, from 1973 to 2022. By examining these trends over time, the aim to uncover patterns and shifts that could inform the understanding of the progress made towards wage equality and the challenges that remain.

The primary research questions focus on the evolution of the wage gap:

  • How has the wage gap between men and woman who have completed high school as their highest level of education changed over the years?

  • In comparison, how has the wage gap between men and women holding bachelor’s degrees developed in the same timeframe?

The dataset used for this analysis comes from the Economic Policy Institute’s State of Working America Data Library. This data, standardized to 2022 dollars, presents a clear view of the real wage disparities.

library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.3.2
Warning: package 'dplyr' was built under R version 4.3.2
Warning: package 'lubridate' was built under R version 4.3.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Wrangle

#load data and rename
wage_data <- read.csv("C:/Users/joeje/OneDrive/Desktop/college_wage_premium.csv")

#Checking data structures
str(wage_data)
'data.frame':   50 obs. of  7 variables:
 $ year                  : int  2022 2021 2020 2019 2018 2017 2016 2015 2014 2013 ...
 $ high_school           : num  21.9 22.3 22.7 21.6 21.5 ...
 $ bachelors_degree      : num  41.6 41.3 41.6 39.6 38.9 ...
 $ men_high_school       : num  24.1 24.4 25.1 24 23.7 ...
 $ men_bachelors_degree  : num  49 47.8 48.1 45.7 45 ...
 $ women_high_school     : num  18.9 19.4 19.4 18.5 18.5 ...
 $ women_bachelors_degree: num  34.4 35.1 35.4 33.8 33 ...
head(wage_data)
  year high_school bachelors_degree men_high_school men_bachelors_degree
1 2022       21.94            41.60           24.08                49.01
2 2021       22.28            41.32           24.36                47.83
3 2020       22.70            41.65           25.09                48.15
4 2019       21.64            39.61           23.99                45.74
5 2018       21.50            38.87           23.72                44.97
6 2017       21.26            38.65           23.47                44.50
  women_high_school women_bachelors_degree
1             18.93                  34.39
2             19.36                  35.08
3             19.35                  35.41
4             18.48                  33.80
5             18.49                  33.03
6             18.31                  33.01
#Create new variables for the gender wage gap
wage_data <- wage_data %>%
  mutate(
    gap_high_school = (men_high_school - women_high_school)/men_high_school*100,
    gap_bachelors = (men_bachelors_degree - women_bachelors_degree)/men_bachelors_degree*100
  )

#Check new variable
head(wage_data)
  year high_school bachelors_degree men_high_school men_bachelors_degree
1 2022       21.94            41.60           24.08                49.01
2 2021       22.28            41.32           24.36                47.83
3 2020       22.70            41.65           25.09                48.15
4 2019       21.64            39.61           23.99                45.74
5 2018       21.50            38.87           23.72                44.97
6 2017       21.26            38.65           23.47                44.50
  women_high_school women_bachelors_degree gap_high_school gap_bachelors
1             18.93                  34.39        21.38704      29.83065
2             19.36                  35.08        20.52545      26.65691
3             19.35                  35.41        22.87764      26.45898
4             18.48                  33.80        22.96790      26.10407
5             18.49                  33.03        22.04890      26.55103
6             18.31                  33.01        21.98551      25.82022

Analyze

bach_gap_plot

#Descriptive Statistics
summary_statistics <- wage_data %>%
  summarize(
    mean_gap_high_school = mean(gap_high_school, na.rm = TRUE),
    mean_gap_bachelors = mean(gap_bachelors, na.rm = TRUE),
    sd_gap_high_school = sd(gap_high_school, na.rm = TRUE),
    sd_gap_bachelors = sd(gap_bachelors, na.rm = TRUE)
  )

#Print
print(summary_statistics)
  mean_gap_high_school mean_gap_bachelors sd_gap_high_school sd_gap_bachelors
1             26.05043           27.85043           5.529081         4.246928
#Trend Analysis; plot for high school education level
hs_gap_plot <- ggplot(wage_data, aes(x = year, y = gap_high_school)) + geom_line()+
  labs(title = "Trend of Wage Gap Over Time For High School Education Level",
       x = "Year",
       y = "Wage Gap (%)") +
  theme_minimal()

#Plot for bachelor's degree
bach_gap_plot <- ggplot(wage_data, aes(x=year, y = gap_bachelors)) + geom_line() +
  labs(title = "Trend of Wage Gap Over Time for Bachelor's Degree Holders",
       x = "Year",
       y = "Wage Gap (%)") +
  theme_minimal()

#Display
hs_gap_plot

bach_gap_plot

Analyzing the first graph, Trend of Wage Gap Over Time for High School Education Level, there is a downward trend from the 1970s to the early 2000s, indicating that the wage gap during this time between men and women with high school education was decreasing. This suggests that over these decades, women’s earnings relative to men’s improved. After the early 2000s, the wage gap seems to stabilize somewhat and fluctuates roughly. This could imply that while the situation improved significantly from the 1970s through the 2000s, progress has slowed down, and the wage gap remained relatively constant since then. In the most recent years on the graph, there is a slight uptick that could be influenced by economic or social factors, such as economic downturns, shifts in the labor market, or changes in social policies. While there has been progress in reducing the gender wage gap among high school graduates over the last 50 years, the rate of improvement has slows in recent years.

Analyzing the second graph, Trend of Wage Gap Over Time for Bachelor’s Degree Holders, in the early part of the graph, there are some fluctuations, followed by a sharp decline in the wage gap from the late 1970s into the early 2000s. This suggests significant progress in closing the wage gap among individuals with a bachelor’s degree during this period. There is a period where the wage gap appears more stable but with fluctuations, suggesting variability year-to-year in the progress towards closing the wage gap. In the most recent years, there is a sharp increase in the wage gap. This should be investigated into what are the possible causes of this rise, such as changes in labor market dynamics, policy shifts, or socio-economic factors that could influence wage disparity.

Comparing both graphs, there were noticeable improvements in wage equality from the 1970s to the early 2000s, but the group with bachelor’s degrees had a more volatile path with sharper fluctuations. We also notice the wage gap for high school graduates appears to have stabilized in recent decades, while the gap for bachelor’s degree holders has shown a troubling increase in recent years. This increase in the wage gap for bachelor’s degree holders could indicate that higher education does not insulate against gender-based wage disparities and these disparities might be growing in certain sectors or roles that require higher education.

Communication

Between the 1970s and 2022, the examination of median hourly wages reveal a narrative concerning the gender wage gap across different educational levels in the United States. For individuals with high school diplomas, there has been a narrowing of the wage gap form the 1970s through the early 2000s. However, it is important to note that the rate of improvement has decelerated starting in the early 2000s to now. In contrast, the wage gap for those holding bachelor’s degrees presents a more complex trajectory. While there was a period of decline in the wage gap for bachelor’s degree holders, this trend also had significant fluctuations. The analysis here has showed a recent uptick in the wage gap for this group, suggesting a regression that undermines the progress previously made towards wage equalization. The findings show that while educational attainment is beneficial in reducing the wage gap, it does not entirely insulate against systemic issues of gender-wage disparities. The reversal in progress, especially pronounced among bachelor’s degree holders, underscores the persistent nature of the wage equality challenge.

This situation demands proactive measures to ensure that the pursuit of wage equality continues to be a focal point of social and economic policy. Policymakers and advocacy groups could push for wage equity legislation and hold corporations accountable for fostering equitable compensation. Employers and organizations could critically assess their wage practices through comprehensive internal audits. Additionally, educational institutions could institute and bolster programs designed to facilitate women’s entry into higher paying fields where they have historically been underrepresented, effectively pushing to reshaping the employment landscape. These strategies collectively represent a multi-facited approach to bridging the wage gap.

This wage gap analysis does acknowledge certain limitations. This analysis does not factor in occupational segregation where women and men may be concentrated in different jobs or industries with different pay scales. The focus on median hourly wages means that the broader compensation (bonuses, overtime pay, and other benefits) falls outside the scope of the analysis. Even though the wages have been adjusted for inflation reflecting 2022 dollars, cost-of-living adjustments was not included in the analysis, which bear significant influence on real income disparities. When looking at the data, it is important to be careful about how it is to be understood and used. The information should be used to encourage more work towards making wages fair for everyone, especially since the data indicates progress has slowed. It is also important to handle the data responsibly, ensuring privacy and compliance with data protection laws, given the sensitivity or wage-related information.

Economic Policy Institute, & Asaniczka. (2023). USA Wage Comparison for College vs. High School [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DS/3836693

Assessment

This assignment is worth 18 points total. You will receive 1,5 point (15 points total) for each of the project criteria adequately addressed by your data product, and 3 point for rendering and publishing your work in GitHub/QuartoPubs/RPubs environment.