Final Data Analysis Project

Ruplal Lama

2023-12-11

Issue Description

  Exploring Wage Disparities in the U.S

This project delves into a comprehensive analysis of hourly wages across the United States, with a special emphasis on understanding how these wages vary among various demographic groups. We’ll be exploring key differences in median and average earnings based on gender, racial background, and educational attainment.

Questions

  1. How has an income inequalities between men and women been in the past 50 years? Has their level of educational attainment made difference and how it is in present?

  2. Does the U.S. exhibit differences in wages across racial groups, and how have these variations in median hourly earnings evolved over the past 50 years for diverse racial demographics?

Data Source

The data is sourced from the Economic Policy Institute’s (EPI) State of Working America Data Library. It is a trusted and reliable source for economic data. EPI provides researchers, media, and the public with easily accessible, up-to-date, and comprehensive historical data on the American labor force.

Economic Policy Institute https://www.epi.org/data/

Documentation

Data on the labor force in the “Employment” section are compiled from EPI analysis of basic monthly Current Population Survey microdata. Data reflect 12-month moving averages as of the latest month of data.

Demographics Data represent people ages 16 and older unless otherwise noted.

Race/ethnicity

Race/ethnicity categories are mutually exclusive.

Black: Black non-Hispanic

White: White non-Hispanic

Hispanic: Hispanic any race

Education

Educational categories are mutually exclusive and represent the highest education level attained for all individuals ages 16 and older.

Additional Documentation and Data dictioary can be accessed in following URL https://www.epi.org/data

Setup and Loading Data

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
wages_edu_gender <- read_csv("wages_edu_gender.csv")
## Rows: 50 Columns: 31
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (30): Less than HS, High school, Some college, Bachelor's degree, Advanc...
## dbl  (1): Date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
wages_race <- read_csv("wages_race.csv")
## Rows: 50 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Median, Average, White Median, White Average, Black Median, Black A...
## dbl (1): Date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Description of the Data

str(wages_edu_gender)
## spc_tbl_ [50 × 31] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Date                         : num [1:50] 1973 1974 1975 1976 1977 ...
##  $ Less than HS                 : chr [1:50] "$18.06" "$17.68" "$17.30" "$17.52" ...
##  $ High school                  : chr [1:50] "$22.22" "$21.60" "$21.55" "$21.76" ...
##  $ Some college                 : chr [1:50] "$24.08" "$23.32" "$23.30" "$23.49" ...
##  $ Bachelor's degree            : chr [1:50] "$32.80" "$31.69" "$31.45" "$31.46" ...
##  $ Advanced degree              : chr [1:50] "$38.16" "$38.37" "$38.41" "$37.50" ...
##  $ Less than HS Share           : chr [1:50] "31.6%" "30.4%" "28.2%" "27.1%" ...
##  $ High school Share            : chr [1:50] "36.3%" "36.0%" "36.3%" "36.2%" ...
##  $ Some college Share           : chr [1:50] "17.6%" "18.5%" "19.3%" "20.0%" ...
##  $ Bachelor's degree Share      : chr [1:50] "10.3%" "10.6%" "11.5%" "11.6%" ...
##  $ Advanced degree Share        : chr [1:50] "4.2%" "4.6%" "4.7%" "5.1%" ...
##  $ Men Less than HS             : chr [1:50] "$21.18" "$20.63" "$20.00" "$20.36" ...
##  $ Men High school              : chr [1:50] "$26.90" "$26.15" "$26.02" "$26.14" ...
##  $ Men Some college             : chr [1:50] "$27.67" "$26.79" "$26.93" "$27.10" ...
##  $ Men Bachelor's degree        : chr [1:50] "$37.69" "$36.62" "$36.21" "$36.42" ...
##  $ Men Advanced degree          : chr [1:50] "$40.09" "$41.03" "$40.86" "$40.31" ...
##  $ Men Less than HS Share       : chr [1:50] "33.4%" "32.0%" "29.9%" "29.0%" ...
##  $ Men High school Share        : chr [1:50] "32.6%" "32.4%" "32.8%" "32.8%" ...
##  $ Men Some college Share       : chr [1:50] "18.3%" "19.3%" "19.7%" "20.4%" ...
##  $ Men Bachelor's degree Share  : chr [1:50] "10.5%" "10.6%" "11.7%" "11.7%" ...
##  $ Men Advanced degree Share    : chr [1:50] "5.2%" "5.8%" "5.8%" "6.1%" ...
##  $ Women Less than HS           : chr [1:50] "$12.89" "$12.87" "$12.91" "$12.96" ...
##  $ Women High school            : chr [1:50] "$16.97" "$16.49" "$16.54" "$17.01" ...
##  $ Women Some college           : chr [1:50] "$18.41" "$17.91" "$17.91" "$18.37" ...
##  $ Women Bachelor's degree      : chr [1:50] "$25.50" "$24.70" "$24.44" "$24.52" ...
##  $ Women Advanced degree        : chr [1:50] "$32.73" "$30.78" "$32.14" "$31.05" ...
##  $ Women Less than HS Share     : chr [1:50] "28.9%" "28.0%" "25.9%" "24.5%" ...
##  $ Women High school Share      : chr [1:50] "41.7%" "41.1%" "41.1%" "41.0%" ...
##  $ Women Some college Share     : chr [1:50] "16.6%" "17.5%" "18.7%" "19.5%" ...
##  $ Women Bachelor's degree Share: chr [1:50] "10.1%" "10.6%" "11.1%" "11.4%" ...
##  $ Women Advanced degree Share  : chr [1:50] "2.7%" "2.9%" "3.2%" "3.6%" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Date = col_double(),
##   ..   `Less than HS` = col_character(),
##   ..   `High school` = col_character(),
##   ..   `Some college` = col_character(),
##   ..   `Bachelor's degree` = col_character(),
##   ..   `Advanced degree` = col_character(),
##   ..   `Less than HS Share` = col_character(),
##   ..   `High school Share` = col_character(),
##   ..   `Some college Share` = col_character(),
##   ..   `Bachelor's degree Share` = col_character(),
##   ..   `Advanced degree Share` = col_character(),
##   ..   `Men Less than HS` = col_character(),
##   ..   `Men High school` = col_character(),
##   ..   `Men Some college` = col_character(),
##   ..   `Men Bachelor's degree` = col_character(),
##   ..   `Men Advanced degree` = col_character(),
##   ..   `Men Less than HS Share` = col_character(),
##   ..   `Men High school Share` = col_character(),
##   ..   `Men Some college Share` = col_character(),
##   ..   `Men Bachelor's degree Share` = col_character(),
##   ..   `Men Advanced degree Share` = col_character(),
##   ..   `Women Less than HS` = col_character(),
##   ..   `Women High school` = col_character(),
##   ..   `Women Some college` = col_character(),
##   ..   `Women Bachelor's degree` = col_character(),
##   ..   `Women Advanced degree` = col_character(),
##   ..   `Women Less than HS Share` = col_character(),
##   ..   `Women High school Share` = col_character(),
##   ..   `Women Some college Share` = col_character(),
##   ..   `Women Bachelor's degree Share` = col_character(),
##   ..   `Women Advanced degree Share` = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(wages_race)
## spc_tbl_ [50 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Date            : num [1:50] 2022 2021 2020 2019 2018 ...
##  $ Median          : chr [1:50] "$22.88" "$23.05" "$23.64" "$22.12" ...
##  $ Average         : chr [1:50] "$32.00" "$32.08" "$32.54" "$30.36" ...
##  $ White Median    : chr [1:50] "$24.96" "$25.40" "$25.98" "$24.39" ...
##  $ White Average   : chr [1:50] "$34.49" "$34.50" "$34.86" "$32.79" ...
##  $ Black Median    : chr [1:50] "$19.60" "$19.45" "$19.85" "$18.45" ...
##  $ Black Average   : chr [1:50] "$25.61" "$25.40" "$26.03" "$24.09" ...
##  $ Hispanic Median : chr [1:50] "$18.93" "$19.14" "$19.21" "$18.19" ...
##  $ Hispanic Average: chr [1:50] "$24.84" "$24.90" "$25.29" "$23.49" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Date = col_double(),
##   ..   Median = col_character(),
##   ..   Average = col_character(),
##   ..   `White Median` = col_character(),
##   ..   `White Average` = col_character(),
##   ..   `Black Median` = col_character(),
##   ..   `Black Average` = col_character(),
##   ..   `Hispanic Median` = col_character(),
##   ..   `Hispanic Average` = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Data Cleaning

Deleting unnecessary column from the data

# Remove columns that contain the word "share", "less than", "some"
wages_edu_gender <- wages_edu_gender %>%
  select(
    -contains("share"),
    -contains("less than"),
    -contains("some"))
# Function to clean wage columns (remove $ and commas, convert to numeric)
clean_wage_column <- function(column) {
  as.numeric(gsub("\\$", "", gsub(",", "", column)))
}

# Apply this function to all character columns that are actually numeric 
wages_edu_gender <- wages_edu_gender %>%
  mutate(across(where(is.character), clean_wage_column))
str(wages_edu_gender)
## tibble [50 × 10] (S3: tbl_df/tbl/data.frame)
##  $ Date                   : num [1:50] 1973 1974 1975 1976 1977 ...
##  $ High school            : num [1:50] 22.2 21.6 21.6 21.8 21.5 ...
##  $ Bachelor's degree      : num [1:50] 32.8 31.7 31.4 31.5 31.1 ...
##  $ Advanced degree        : num [1:50] 38.2 38.4 38.4 37.5 37.4 ...
##  $ Men High school        : num [1:50] 26.9 26.1 26 26.1 26 ...
##  $ Men Bachelor's degree  : num [1:50] 37.7 36.6 36.2 36.4 36.1 ...
##  $ Men Advanced degree    : num [1:50] 40.1 41 40.9 40.3 40.6 ...
##  $ Women High school      : num [1:50] 17 16.5 16.5 17 16.7 ...
##  $ Women Bachelor's degree: num [1:50] 25.5 24.7 24.4 24.5 23.9 ...
##  $ Women Advanced degree  : num [1:50] 32.7 30.8 32.1 31.1 30.3 ...

Data analysis and answering the questions

Has there been an income inequalities between men and women in the past? Has their level of educational attainment made difference?

# Reshaping the data for easier plotting
wages_long <- wages_edu_gender %>%
  gather(key = "category", value = "wage", -Date) %>%
  separate(category, into = c("Gender", "Education"), sep = "\\s") %>%
  unite("Gender_Education", Gender, Education, sep = " - ")
## Warning: Expected 2 pieces. Additional pieces discarded in 300 rows [151, 152, 153, 154,
## 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170,
## ...].

Plotting

ggplot(wages_long, aes(x = Date, y = wage, color = Gender_Education, group = Gender_Education)) +
  geom_line() +
  labs(title = "Gender Wage Gap Over the Years Across Different Education Levels",
       x = "Year",
       y = "Wage",
       color = "Category") +
  theme_minimal() +
  theme(legend.position = "right")

Observations from the graph:

For all education levels and genders, there has been a general increase in wages over time. The gap between high school and advanced degree wages is notable, with advanced degrees earning significantly more. There is a noticeable difference in wages between men and women at each education level. Men generally earn more than women in corresponding education categories. This is clear from comparing ‘Men High school’ with ‘Women High school’, ‘Men Bachelor’s degree’ with ‘Women Bachelor’s degree’, and ‘Men Advanced degree’ with ‘Women Advanced degree’.

# 

# Cleaning Second dataset


```r
str(wages_race)
## spc_tbl_ [50 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Date            : num [1:50] 2022 2021 2020 2019 2018 ...
##  $ Median          : chr [1:50] "$22.88" "$23.05" "$23.64" "$22.12" ...
##  $ Average         : chr [1:50] "$32.00" "$32.08" "$32.54" "$30.36" ...
##  $ White Median    : chr [1:50] "$24.96" "$25.40" "$25.98" "$24.39" ...
##  $ White Average   : chr [1:50] "$34.49" "$34.50" "$34.86" "$32.79" ...
##  $ Black Median    : chr [1:50] "$19.60" "$19.45" "$19.85" "$18.45" ...
##  $ Black Average   : chr [1:50] "$25.61" "$25.40" "$26.03" "$24.09" ...
##  $ Hispanic Median : chr [1:50] "$18.93" "$19.14" "$19.21" "$18.19" ...
##  $ Hispanic Average: chr [1:50] "$24.84" "$24.90" "$25.29" "$23.49" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Date = col_double(),
##   ..   Median = col_character(),
##   ..   Average = col_character(),
##   ..   `White Median` = col_character(),
##   ..   `White Average` = col_character(),
##   ..   `Black Median` = col_character(),
##   ..   `Black Average` = col_character(),
##   ..   `Hispanic Median` = col_character(),
##   ..   `Hispanic Average` = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(wages_race)
##       Date         Median            Average          White Median      
##  Min.   :1973   Length:50          Length:50          Length:50         
##  1st Qu.:1985   Class :character   Class :character   Class :character  
##  Median :1998   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1998                                                           
##  3rd Qu.:2010                                                           
##  Max.   :2022                                                           
##  White Average      Black Median       Black Average      Hispanic Median   
##  Length:50          Length:50          Length:50          Length:50         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Hispanic Average  
##  Length:50         
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
# Function to remove dollar signs and convert columns to numeric
clean_wage <- function(column) {
  as.numeric(gsub("\\$", "", column))
}

# Applying the function to all wage columns
wage_columns <- c("Median", "Average", "White Median", "White Average", 
                  "Black Median", "Black Average", "Hispanic Median", "Hispanic Average")

wages_race <- wages_race %>%
  mutate(across(all_of(wage_columns), clean_wage))
str(wages_race)
## tibble [50 × 9] (S3: tbl_df/tbl/data.frame)
##  $ Date            : num [1:50] 2022 2021 2020 2019 2018 ...
##  $ Median          : num [1:50] 22.9 23.1 23.6 22.1 21.9 ...
##  $ Average         : num [1:50] 32 32.1 32.5 30.4 29.8 ...
##  $ White Median    : num [1:50] 25 25.4 26 24.4 24 ...
##  $ White Average   : num [1:50] 34.5 34.5 34.9 32.8 32.4 ...
##  $ Black Median    : num [1:50] 19.6 19.4 19.9 18.4 17.6 ...
##  $ Black Average   : num [1:50] 25.6 25.4 26 24.1 23.5 ...
##  $ Hispanic Median : num [1:50] 18.9 19.1 19.2 18.2 17.5 ...
##  $ Hispanic Average: num [1:50] 24.8 24.9 25.3 23.5 22.8 ...
summary(wages_race)
##       Date          Median         Average       White Median   White Average  
##  Min.   :1973   Min.   :18.78   Min.   :22.42   Min.   :19.43   Min.   :23.08  
##  1st Qu.:1985   1st Qu.:19.28   1st Qu.:23.05   1st Qu.:20.13   1st Qu.:23.93  
##  Median :1998   Median :19.81   Median :24.55   Median :21.11   Median :26.00  
##  Mean   :1998   Mean   :20.26   Mean   :25.56   Mean   :21.69   Mean   :27.03  
##  3rd Qu.:2010   3rd Qu.:21.11   3rd Qu.:27.34   3rd Qu.:22.93   3rd Qu.:29.43  
##  Max.   :2022   Max.   :23.64   Max.   :32.54   Max.   :25.98   Max.   :34.86  
##   Black Median   Black Average   Hispanic Median Hispanic Average
##  Min.   :15.87   Min.   :18.36   Min.   :14.27   Min.   :18.01   
##  1st Qu.:16.16   1st Qu.:19.22   1st Qu.:15.50   1st Qu.:18.76   
##  Median :16.90   Median :20.32   Median :15.89   Median :19.25   
##  Mean   :17.10   Mean   :20.98   Mean   :16.06   Mean   :19.98   
##  3rd Qu.:17.80   3rd Qu.:22.34   3rd Qu.:16.35   3rd Qu.:20.68   
##  Max.   :19.85   Max.   :26.03   Max.   :19.21   Max.   :25.29

Answering Question two

Does a racial wage disparities exists in the U.S? And how has the trend been in the median and average hourly wages over 40 years for different races in the U.S.?

ggplot(wages_race, aes(x = Date)) +
  geom_line(aes(y = `White Median`, color = "White")) +
  geom_line(aes(y = `Black Median`, color = "Black")) +
  geom_line(aes(y = `Hispanic Median`, color = "Hispanic")) +
  labs(
    title = "Racial Wage Disparities Over Time",
    x = "Year",
    y = "Median Wage",
    color = "Race"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Observations from the graph: There has been a general increase in wages across all groups over the years. Significant wage disparities exist between racial groups. White individuals consistently have higher median and average wages compared to Black and Hispanic individuals.

Conclusion

The median wages have been increasing for Americans over the years. Level of education have significantly raised the income for individuals. But, gender and racial income disparities have been existed in the past and it continues to exist today.

Any suggestions and questions?

Thank You