1. Introduction

  1. Preface: This report presents an analysis of gender equality in the European labor market, focusing on the gender pay gap across various countries and industries from 2010 to 2021. The dataset, “pay_gap_Europe.csv,” provides a detailed examination of the unadjusted gender pay gap (GPG), which measures the difference between the average gross hourly earnings of male and female paid employees as a percentage of male earnings. The data also includes supplementary information such as GDP per capita, urbanization rates, and economic sector breakdowns, offering a holistic perspective on the economic and social factors that may influence gender-based pay disparities. This analysis seeks to enhance understanding of gender equity issues in the workplace, promoting efforts toward building more inclusive and equitable societies.

  2. Objective: The primary objective of this data analysis is to explore and understand the factors contributing to the gender pay gap in Europe between 2010 and 2021. By analyzing the unadjusted gender pay gap alongside other socioeconomic indicators like GDP per capita and urbanization rates, the report aims to identify patterns and correlations that may explain the persistent disparities in pay between male and female employees. This analysis will serve as a basis for informing policies and initiatives aimed at reducing the gender pay gap and fostering greater gender equality in the labor market across Europe.

  3. Data dictionary:

Variable Class Discription
Country character Countries in Europe
Year double Years from 2010 to 2021
GDP double GDP per capita (euros)
Urban_population double Urban population (%)
Industry double Pay gap in Industry, construction and services (%)
Business double Pay gap in Business economy (%)
Mining double Pay gap in Mining and quarrying (%)
Manufacturing double Pay gap in Manufacturing (%)
Electricity_supply double Pay gap in Electricity, gas, steam and air conditioning supply (%)
Water_supply double Pay gap in Water supply; sewerage, waste management and remediation activities (%)
Construction double Pay gap in Construction (%)
Retail trade double Pay gap in Wholesale and retail trade; repair of motor vehicles and motorcycles (%)
Transportation double Pay gap in Transportation and storage (%)
Accommodation double Pay gap in Accommodation and food service activities (%)
Information double Pay gap in Information and communication (%)
Financial double Pay gap in Financial and insurance activities (%)
Real estate double Pay gap in Real estate activities (%)
Professional_scientific double Pay gap in Professional, scientific and technical activities (%)
Administrative double Pay gap in Administrative and support service activities (%)
Public_administration double Pay gap in Public administration and defence; compulsory social security (%)
Education double Pay gap in Education (%)
Human_health double Pay gap in Human health and social work activities (%)
Arts double Pay gap in Arts, entertainment and recreation (%)
Other double Pay gap in Other service activities (%)
  1. Load the data and take a look at the first 5 rows:
pay_gap_Europe <- read_csv("~/Documents/SIMMONS/Micro-Internship/2. Jul - Sep 2024/pay_gap_Europe.csv")
## Rows: 324 Columns: 24
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): Country
## dbl (23): Year, GDP, Urban_population, Industry, Business, Mining, Manufactu...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(pay_gap_Europe,5)

2. Preliminary Exploratory Data Analysis

glimpse(pay_gap_Europe)
## Rows: 324
## Columns: 24
## $ Country                 <chr> "Austria", "Austria", "Austria", "Austria", "A…
## $ Year                    <dbl> 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017…
## $ GDP                     <dbl> 35390, 36300, 36390, 36180, 36130, 36140, 3639…
## $ Urban_population        <dbl> 57.40, 57.12, 57.15, 57.34, 57.53, 57.72, 57.9…
## $ Industry                <dbl> 24.0, 23.5, 22.9, 22.3, 22.2, 21.8, 20.8, 20.7…
## $ Business                <dbl> 25.2, 24.7, 24.3, 23.8, 23.8, 23.4, 22.3, 22.3…
## $ Mining                  <dbl> 18.3, NA, NA, NA, 15.9, 13.7, 14.4, 10.9, 7.9,…
## $ Manufacturing           <dbl> 24.4, NA, NA, NA, 23.0, 22.7, 21.9, 21.7, 21.4…
## $ Electricity_supply      <dbl> 23.6, NA, NA, NA, 19.8, 17.6, 13.2, 13.0, 14.4…
## $ Water_supply            <dbl> 12.2, NA, NA, NA, 10.0, 9.3, 8.2, 8.4, 8.1, 8.…
## $ Construction            <dbl> 9.9, NA, NA, NA, 8.2, 8.2, 8.3, 8.3, 8.3, 8.2,…
## $ `Retail trade`          <dbl> 27.5, NA, NA, NA, 23.4, 23.3, 23.3, 23.2, 23.2…
## $ Transportation          <dbl> 7.3, NA, NA, NA, 10.6, 11.6, 14.5, 12.4, 11.7,…
## $ Accommodation           <dbl> 9.9, NA, NA, NA, 7.4, 6.4, 5.9, 5.7, 5.4, 4.9,…
## $ Information             <dbl> 21.2, NA, NA, NA, 22.9, 22.4, 20.9, 20.6, 20.7…
## $ Financial               <dbl> 30.3, NA, NA, NA, 30.4, 30.3, 27.1, 28.4, 28.2…
## $ `Real estate`           <dbl> 27.0, NA, NA, NA, 27.8, 28.0, 28.7, 29.0, 29.2…
## $ Professional_scientific <dbl> 34.0, NA, NA, NA, 31.5, 31.3, 30.4, 29.4, 28.3…
## $ Administrative          <dbl> 22.5, NA, NA, NA, 19.5, 20.0, 17.8, 17.4, 17.1…
## $ Public_administration   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ Education               <dbl> 27.8, NA, NA, NA, 24.3, 24.2, 24.3, 23.7, 23.6…
## $ Human_health            <dbl> 12.0, NA, NA, NA, 12.8, 12.9, 14.5, 15.0, 15.3…
## $ Arts                    <dbl> 34.0, NA, NA, NA, 26.6, 26.2, 20.8, 19.1, 18.3…
## $ Other                   <dbl> 32.0, NA, NA, NA, 28.8, 28.3, 27.8, 26.9, 26.4…

From taking a glimpse at the dataset, we can see that except for Country whose class is character, others’ class is all double or numeric values. This makes good sense as this dataset is about comparing the pay gap percentage between genders among various fields/ careers. We can easily see that there are a substantial amount of NA values shown above so we’ll check for NA values and clean the data.

# check for NA value in the dataset
colSums(is.na(pay_gap_Europe))
##                 Country                    Year                     GDP 
##                       0                       0                       0 
##        Urban_population                Industry                Business 
##                       0                       3                       4 
##                  Mining           Manufacturing      Electricity_supply 
##                      21                       6                      23 
##            Water_supply            Construction            Retail trade 
##                      12                      12                       7 
##          Transportation           Accommodation             Information 
##                       6                       9                       9 
##               Financial             Real estate Professional_scientific 
##                       6                      13                       6 
##          Administrative   Public_administration               Education 
##                       9                      66                       9 
##            Human_health                    Arts                   Other 
##                       6                      13                      11
# Replace NA value by mean of the column's value in the dataset
# Function to replace NA with column mean
replace_na_with_mean <- function(x) {
  if(is.numeric(x)) {
    return(ifelse(is.na(x), mean(x, na.rm = TRUE), x))
  } else {
    return(x)
  }
}

# Apply the function to all columns in the dataset
pay_gap_Europe <- pay_gap_Europe %>%
  mutate(across(everything(), replace_na_with_mean))

# Check if there are any remaining NA values
colSums(is.na(pay_gap_Europe))
##                 Country                    Year                     GDP 
##                       0                       0                       0 
##        Urban_population                Industry                Business 
##                       0                       0                       0 
##                  Mining           Manufacturing      Electricity_supply 
##                       0                       0                       0 
##            Water_supply            Construction            Retail trade 
##                       0                       0                       0 
##          Transportation           Accommodation             Information 
##                       0                       0                       0 
##               Financial             Real estate Professional_scientific 
##                       0                       0                       0 
##          Administrative   Public_administration               Education 
##                       0                       0                       0 
##            Human_health                    Arts                   Other 
##                       0                       0                       0
# View the first few rows of the cleaned dataset
head(pay_gap_Europe)

Most of the numeric columns has NA values, which run from 3 NA values to 66 NA vlaues per column. The code above itterated through each column to see if there is any NA value. If yes, it will replace the NA value by the mean of all the existing values in that column.

# Summary Statistics
summary(pay_gap_Europe)
##    Country               Year           GDP        Urban_population
##  Length:324         Min.   :2010   Min.   : 5080   Min.   :52.66   
##  Class :character   1st Qu.:2013   1st Qu.:13045   1st Qu.:65.62   
##  Mode  :character   Median :2016   Median :22330   Median :73.28   
##                     Mean   :2016   Mean   :28012   Mean   :73.46   
##                     3rd Qu.:2018   3rd Qu.:36382   3rd Qu.:84.89   
##                     Max.   :2021   Max.   :84750   Max.   :98.12   
##     Industry         Business         Mining        Manufacturing  
##  Min.   :-0.200   Min.   : 5.40   Min.   :-26.600   Min.   : 1.70  
##  1st Qu.: 9.675   1st Qu.:13.80   1st Qu.:  3.875   1st Qu.:14.28  
##  Median :14.500   Median :16.10   Median :  9.600   Median :20.05  
##  Mean   :13.862   Mean   :16.61   Mean   :  9.529   Mean   :19.26  
##  3rd Qu.:17.625   3rd Qu.:19.90   3rd Qu.: 16.500   3rd Qu.:24.00  
##  Max.   :29.900   Max.   :30.20   Max.   : 43.700   Max.   :33.60  
##  Electricity_supply  Water_supply      Construction       Retail trade  
##  Min.   :-2.000     Min.   :-33.200   Min.   :-28.3000   Min.   : 7.00  
##  1st Qu.: 7.375     1st Qu.: -2.600   1st Qu.: -8.4000   1st Qu.:16.57  
##  Median :11.512     Median :  2.500   Median :  0.5500   Median :20.66  
##  Mean   :11.512     Mean   :  2.211   Mean   : -0.6875   Mean   :20.66  
##  3rd Qu.:16.200     3rd Qu.:  8.000   3rd Qu.:  8.0000   3rd Qu.:24.60  
##  Max.   :49.200     Max.   : 20.900   Max.   : 23.5000   Max.   :38.50  
##  Transportation    Accommodation    Information      Financial    
##  Min.   :-25.100   Min.   : 0.40   Min.   : 7.30   Min.   : 4.90  
##  1st Qu.:  0.675   1st Qu.: 7.60   1st Qu.:14.40   1st Qu.:23.18  
##  Median :  5.400   Median :10.00   Median :18.40   Median :28.10  
##  Mean   :  4.345   Mean   :10.89   Mean   :19.22   Mean   :27.94  
##  3rd Qu.: 10.400   3rd Qu.:13.43   3rd Qu.:24.43   3rd Qu.:32.02  
##  Max.   : 36.800   Max.   :29.70   Max.   :33.40   Max.   :45.10  
##   Real estate     Professional_scientific Administrative   
##  Min.   :-47.90   Min.   :-1.80           Min.   :-33.200  
##  1st Qu.:  9.50   1st Qu.:15.00           1st Qu.:  6.500  
##  Median : 14.20   Median :19.30           Median :  9.600  
##  Mean   : 13.26   Mean   :19.13           Mean   :  8.116  
##  3rd Qu.: 18.82   3rd Qu.:23.90           3rd Qu.: 14.100  
##  Max.   : 44.80   Max.   :36.20           Max.   : 27.700  
##  Public_administration   Education      Human_health        Arts       
##  Min.   :-5.500        Min.   :-3.00   Min.   :-6.80   Min.   :-16.80  
##  1st Qu.: 6.275        1st Qu.: 7.30   1st Qu.:13.30   1st Qu.: 10.20  
##  Median : 9.401        Median :11.27   Median :18.95   Median : 15.25  
##  Mean   : 9.401        Mean   :11.27   Mean   :18.50   Mean   : 17.50  
##  3rd Qu.:13.400        3rd Qu.:14.65   3rd Qu.:24.60   3rd Qu.: 20.30  
##  Max.   :23.300        Max.   :36.00   Max.   :37.60   Max.   : 68.60  
##      Other       
##  Min.   :-11.90  
##  1st Qu.: 11.18  
##  Median : 17.11  
##  Mean   : 17.11  
##  3rd Qu.: 23.12  
##  Max.   : 48.10

Based on the summary statistics, the dataset reveals significant gender pay gaps across various sectors in Europe from 2010 to 2021. The Financial sector shows the highest median gap (28.10%), while Construction has the lowest mean (-0.6875%), potentially indicating a reverse gap in some cases. There’s considerable variability across sectors, with some like Real estate and Water supply showing wide ranges from negative to positive values, suggesting complex dynamics in gender pay disparities. Public sector jobs (administration, education, health) generally show smaller gaps than private sector industries. The Arts sector demonstrates the highest maximum gap (68.60%), indicating extreme disparities in some contexts. GDP data suggests economic diversity among the countries studied, which may influence pay gap trends. Overall, the data indicates that gender pay gaps vary substantially by sector, country, and time, highlighting the need for nuanced analysis in addressing this issue.

# Select only numeric columns
numeric_columns <- pay_gap_Europe %>% 
  select_if(is.numeric)

# Calculate the correlation matrix
cor_matrix <- cor(numeric_columns, use = "complete.obs")

# Create a correlation plot
corrplot(cor_matrix, method = "circle", 
         type = "upper", 
         tl.col = "black", 
         tl.srt = 45, 
         tl.cex = 0.45,  # Adjust text size
         mar = c(0,0,2,0))

#print(cor_matrix)

The correlation matrix shows several important points about the gender pay gap. Countries with higher GDP and more urban populations tend to have higher pay gaps. The Industry and Business sectors are closely linked to other sectors, meaning that when pay gaps are higher in these areas, other sectors are affected too. In contrast, sectors like Public Administration, Education, and Health have weaker links, meaning they may be influenced by different factors. Some sectors, like Mining and Manufacturing, show negative correlations with GDP, meaning their pay gaps may shrink as countries become wealthier and more urban. Overall, these findings show that the gender pay gap is shaped by different factors across sectors.

3. Data Analysis

# Calculate average pay gaps by sector
sector_gaps <- pay_gap_Europe %>%
  select(-Country, -Year, -GDP, -Urban_population) %>%
  summarise(across(everything(), mean, na.rm = TRUE)) %>%
  pivot_longer(cols = everything(), names_to = "Sector", values_to = "AvgGap") %>%
  arrange(desc(AvgGap))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(everything(), mean, na.rm = TRUE)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))
# Visualize top and bottom 5 sectors
ggplot(sector_gaps %>% slice(c(1:5, (n()-4):n())), aes(x = reorder(Sector, AvgGap), y = AvgGap)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Top 5 and Bottom 5 Sectors by Average Pay Gap", x = "Sector", y = "Average Pay Gap (%)")

The graph illustrates the top five and bottom five sectors by average pay gap. The Financial sector has the largest pay gap, followed by Retail Trade, Manufacturing, Information, and Professional and Scientific Services, indicating significant gender disparities in these industries. On the other hand, sectors like Construction, Water Supply, and Transportation have much smaller gaps, with Construction showing almost no gender pay disparity. This suggests that financial and retail industries may require more targeted interventions to address gender inequality, while the lower-gap sectors can serve as models for effective equality policies.

# Categorize sectors 
public_sectors <- c("Public_administration", "Education", "Human_health")
private_sectors <- setdiff(names(pay_gap_Europe), c("Country", "Year", "GDP", "Urban_population", public_sectors))

# Compare public vs private
pay_gap_Europe %>%
  pivot_longer(cols = c(all_of(public_sectors), all_of(private_sectors)), names_to = "Sector", values_to = "Gap") %>%
  mutate(SectorType = ifelse(Sector %in% public_sectors, "Public", "Private")) %>%
  group_by(Year, SectorType) %>%
  summarise(AvgGap = mean(Gap, na.rm = TRUE)) %>%
  ggplot(aes(x = Year, y = AvgGap, color = SectorType)) +
  geom_line() +
  labs(title = "Public vs Private Sector Pay Gap Over Time", y = "Average Pay Gap (%)")
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.

The graph comparing public and private sector pay gaps over time shows a consistently higher gender pay gap in the private sector compared to the public sector. Both sectors have seen a gradual decline in the pay gap since 2010, with the private sector reducing its gap from around 15% to just above 12% by 2020, while the public sector’s gap has decreased from 14% to about 13%. This suggests that while both sectors are improving, the public sector demonstrates more equitable pay practices.

# Correlation analysis
cor_matrix <- cor(pay_gap_Europe[, c("GDP", "Urban_population", private_sectors, public_sectors)], use = "complete.obs")

# Scatter plot for GDP vs overall pay gap
pay_gap_Europe %>%
  mutate(OverallGap = rowMeans(select(., all_of(c(private_sectors, public_sectors))), na.rm = TRUE)) %>%
  ggplot(aes(x = GDP, y = OverallGap)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "GDP vs Overall Pay Gap", x = "GDP", y = "Overall Pay Gap (%)")
## `geom_smooth()` using formula = 'y ~ x'

The scatter plot of GDP vs Overall Pay Gap shows a slight positive correlation between a country’s GDP and its overall gender pay gap. As GDP increases, there’s a tendency for the pay gap to widen, albeit with significant variation. This suggests that economic development alone doesn’t necessarily lead to greater pay equality, and that more targeted interventions may be necessary to address gender pay disparities in wealthier nations.

# Overall trend
pay_gap_Europe %>%
  pivot_longer(cols = c(all_of(private_sectors), all_of(public_sectors)), names_to = "Sector", values_to = "Gap") %>%
  group_by(Year) %>%
  summarise(AvgGap = mean(Gap, na.rm = TRUE)) %>%
  ggplot(aes(x = Year, y = AvgGap)) +
  geom_line() +
  labs(title = "Overall Pay Gap Trend Over Time", y = "Average Pay Gap (%)")

The graph depicting the Overall Pay Gap Trend Over Time reveals a gradual decrease in the average gender pay gap across Europe from 2010 to 2020. The gap has reduced from about 15% to approximately 12% over this period. This trend indicates slow but steady progress in reducing gender pay disparities, possibly reflecting the impact of policy measures and societal changes aimed at promoting gender equality in the workplace.

# Boxplot to visualize outliers
pay_gap_Europe %>%
  pivot_longer(cols = c(all_of(private_sectors), all_of(public_sectors)), names_to = "Sector", values_to = "Gap") %>%
  ggplot(aes(x = Sector, y = Gap)) +
  geom_boxplot() +
  coord_flip() +
  labs(title = "Pay Gap Distribution and Outliers by Sector", y = "Pay Gap (%)")

The boxplot displaying Pay Gap Distribution and Outliers by Sector shows significant variation in pay gaps across different sectors. Financial and Professional_scientific sectors tend to have higher median pay gaps, while sectors like Construction and Transportation show lower gaps. Several sectors, particularly Water Supply and Construction, exhibit notable outliers on both ends of the spectrum, suggesting complex dynamics within these industries that may require further investigation to understand the extreme cases of both high and low (or negative) pay gaps.

# Sectors with negative correlation to GDP
negative_cor_sectors <- names(which(cor_matrix["GDP", ] < 0))

# Trend analysis for these sectors
pay_gap_Europe %>%
  pivot_longer(cols = all_of(negative_cor_sectors), names_to = "Sector", values_to = "Gap") %>%
  ggplot(aes(x = Year, y = Gap, color = Sector)) +
  geom_line() +
  facet_wrap(~Sector, scales = "free_y") +
  labs(title = "Trends in Sectors with Negative GDP Correlation", y = "Pay Gap (%)")

The faceted line graph showing Trends in Sectors with Negative GDP Correlation reveals that sectors such as Manufacturing, Retail Trade, and Water Supply tend to have decreasing pay gaps as GDP increases. This contrasts with the overall trend and suggests that these sectors may become more equitable in terms of gender pay as economies develop. However, the trends are not uniform across all negatively correlated sectors, indicating that sector-specific factors play a significant role in determining pay gap dynamics.

4. Conclusion