Gender Statistics - World Bank 2020

Part I – Introduction

According to Wikipedia: “Gender equality is the state of equal ease of access to resources and opportunities regardless of gender, including economic participation and decision-making; and the state of valuing different behaviors, aspirations and needs equally, regardless of gender.”

Although the fight for gender equality dates back centuries, it recently has gained a “cohesive” momentum empowering the voices of those - primarily females, who have been long put at disadvantage in many aspects of life, from housing, economic opportunities, to essential needs and rights such as voting and education that some of us, males, take for granted.

The Women’s march of 2017 signaled a turning point in this fight when millions around the world united in protest against the oppression of equal rights. The March was primarily fueled by public continued remarks undermining women at all levels in society by the then candidate to the presidency of the United States - Donald J. Trump, and it’s start was marked by his win of the 2016 elections.

Although many voices have been heard, and women in the world appear to be taking back their rights to excel, the response by governments accross the world to this movement still shows there is much work to do. The following document examines the most recent Gender Statistics Data from the World Bank, reviewing some of the aspects which to-date continue to limit the opportunities for females to contribute to the progress of our society.

Definition of Terms:

Agency: an individuals capacity to make decisions about their own life and act on them to achieve a desired outcome, free of violence, retribution, or fear.

Education: a group of qualitative and quantitative variables that represent male and female degree of schooling and their respective opportunities in a given country.

Per Capita Gross National Income - GNI: the total amount of money earned by a nation’s people and businesses including investment income, regardless fo where it was earned, as well as money received from abroad such as foreign investment and economic development aid divided by the country’s population.

Human Capital Index - HCI: is an indicator of a country’s ability to mobilize the economic and professional potential of its citizens. It measures how much capital a given country loses through the lack of education and health. The HCI ranges between 0 and 1, where 1 indicates the maximum potential for a given country has been reached

Part II – The Dataset

The data for this study was obtained from the World Bank and it contains quantitative and qualitative variables pertaining to Agency, Economic and Social Context, Economic Opportunities, Education, Health, and Public Life and Decision Making for genders female and male across 217 nations for the years of 1960-2020. (last update: March 2021).

The data is published annually, and it was first released in July 2021. It is collected through API Harvesting methods on a quarterly basis by the World Bank’s Gender Group, and the Development Economics Data Group. There are multiple sources of data combined for this particular report - including World Bank datasets, International Gender data portals.

Although the data set was composed by an reputable institution, it is worth mentioning that its data may have a certain degree of bias which is due to the fact that World Bank relies on nations to report their own figures, as a result some nations may report statistics in a manner that would favor them or would not show national conflict or portray a negative image of such nation.

Data Upload and Wrangling

First off, we load packages we’ll use with our data
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.0     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## 
## Attaching package: 'reshape'
## The following object is masked from 'package:dplyr':
## 
##     rename
## The following objects are masked from 'package:tidyr':
## 
##     expand, smiths
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:reshape':
## 
##     rename
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## Linking to GEOS 3.8.1, GDAL 3.1.4, PROJ 6.3.1
## 
## Attaching package: 'rio'
## The following object is masked from 'package:plotly':
## 
##     export
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
## 
## Attaching package: 'raster'
## The following object is masked from 'package:patchwork':
## 
##     area
## The following object is masked from 'package:plotly':
## 
##     select
## The following object is masked from 'package:dplyr':
## 
##     select
## The following object is masked from 'package:tidyr':
## 
##     extract
## rgdal: version: 1.5-23, (SVN revision 1121)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.1.4, released 2020/10/20
## Path to GDAL shared files: /Library/Frameworks/R.framework/Versions/4.0/Resources/library/rgdal/gdal
## GDAL binary built with GEOS: TRUE 
## Loaded PROJ runtime: Rel. 6.3.1, February 10th, 2020, [PJ_VERSION: 631]
## Path to PROJ shared files: /Library/Frameworks/R.framework/Versions/4.0/Resources/library/rgdal/proj
## Linking to sp version:1.4-5
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading rgdal.
Then we upload our CSV data file, which we’ll be using in our analysis.
Gender_Data_Raw<- read.csv("/Users/cruz-diazgroup/Desktop/1. William/1. School/DATA SCIENCE/2. DATA 110 Vis. & Com./3. Projects/Will_Lopez_Project2_Subject_Data_approval/Gender_StatsData.csv", check.names = FALSE, header = TRUE, sep = ",") 
and create a copy of the data loaded to leave "_Data_Raw" as a back up df, inspect it, review its structure and take a look at its contents.
Gender_Data <- Gender_Data_Raw
#View(Gender_Data)
#str(Gender_Data)
#head(Gender_Data)
#tail(Gender_Data)
It appears that there is an undefined columnn which contains no information in the csv file, which needs to be taken out. So we get the column number here.
which(colnames(Gender_Data) == '')
## [1] 66
and then remove it.
Gender_Data<- Gender_Data[-c(66)] 
Because our uploaded data is in a less than desired working format, we will need to clean it up a bit. First we’ll remove the column “Indicator Name” and leave “Indicator Code” as these columns are essentially the same. Then we’ll change the data to a “LONG” format.
which(colnames(Gender_Data) == "Indicator Name")
## [1] 3
Gender_Data<- Gender_Data[-c(3)] 
I then validate the columns edited have been removed.
#head(Gender_Data)
Because my dataset contains year variables starting in 1960, and the focus of my analysis will be the last 30 years. I will trim the years between 1960-1989 in the df.
which(colnames(Gender_Data) == "1960")
## [1] 4
which(colnames(Gender_Data) == "1989")
## [1] 33
Gender_Data<- Gender_Data[-c(4:33)] 
#head(Gender_Data)
Our data contains unwanted “Country Code” objects, which are intended to group country into regions according to their geography. Since the study will group countries according to their income levels, we will take these Country Groupings out.
Gender_Data <-Gender_Data %>%
 filter(!`Country Code` %in% c("ARB","CSS","CEB","EAR","EAS","EAP","TEA","EMU","ECS","ECA","TEC","EUU","FCS","HPC","HIC","IBD","IBT","IDB","IDX","IDA","LTE","LCN","LAC","TLA","LDC","LMY","LIC","LMC","MEA","MNA","TMN","MIC","NAC","OED","OSS","PSS","PST","PRE","SST","SAS","TSA","SSF","SSA","TSS","UMC","WLD"))
#head(Gender_Data)
During the initial data scoping I also noticed that although the report has the most recent information for 2020. GDP and GNI per capita values are missing, therefore I will use 2019 values in place - amending my data with values of 2019 for missing values in 2020.
Gender_Data<-Gender_Data %>%
mutate(`2020` = coalesce(`2020`,`2019`))
# I now check to see that my data was updated with 2019 GDP per capita information for the year 2020
#Gender_Data%>%
  #filter(`Indicator Code`=="NY.GNP.PCAP.CD")
My data is now in wide format. Therefore, I will now transform it into “Long” format and inspect the resulting data frame.
Gender_Data_Long <- Gender_Data %>%
pivot_longer('1990':'2020', names_to = "year", values_to = "value")
#head(Gender_Data_Long)
#tail(Gender_Data_Long)
The World Bank tiers countries income levels into 4 groups based on country’s GNI per capita. For 2019 these levels are as follows:

Low Income: countries with annual income < $1,026 Lower-Middle Income: countries with annual income between $1,027 - $3,995 Upper-Middle Income: countries with annual income between $3,996 - $12,375 High Income: countries with annual income > $12,376

Since we will be using the latest GNI data available (2019) we will create a column for each of these 4 income categories and notate the income group for each country in our dataset. We’ll first create the categories based on income levels and then we’ll go a right join to get the column information categorizing all countries into their respective income levels.
Country_GNI_Level <-Gender_Data_Long %>%
  filter(year == 2019,`Indicator Code`== "NY.GNP.PCAP.CD") %>%
  mutate("Income Level" = if_else(value < 1026, 'Low', if_else(value>= 1027 & value <= 3995, 'Lower-Middle',if_else(value >= 3996 & value <= 12375,'Upper-Middle','High'))))
#head(Country_GNI_Level)
which(colnames(Country_GNI_Level) == "Country Name")
## [1] 1
which(colnames(Country_GNI_Level) == "Income Level")
## [1] 6
Here we keep only columns pertinent to countries and income classifications and inspect the resulting table
Country_GNI_Level <- Country_GNI_Level[c(1,6)] 
#head(Country_GNI_Level)
We append income levels for each country by joining the two tables
Gender_Data_Long <-left_join(Gender_Data_Long, Country_GNI_Level, by = "Country Name")
#head(Gender_Data_Long)
I now create a df that I’ll later use for data summaries and EDA.
Gender_Data_Wide <-Gender_Data_Long %>%
 pivot_wider(names_from = `Indicator Code`, values_from = value)
#head(Gender_Data_Wide)
I noticed that this data set does not have an “Indicator Class” (i.e. “Agency”,“Education”,“Health”, etc) as noted in the introduction, or an indicator ID for all the unique indicators. Therefore I created a “reference” dataframe with the corresponding Indicator classes for the indicators being examined. Unfortunately, the list of Indicator Classes is not available as a data file, therefore, I had to copy paste from the World Bank’s website into a file append a column for “levels” of measurement categories and save it as a CSV file. I will now upload the file, so I can join it to my dataset at.
Gender_Ref <- read.csv("/Users/cruz-diazgroup/Desktop/1. William/1. School/DATA SCIENCE/2. DATA 110 Vis. & Com./3. Projects/Will_Lopez_Project2_Subject_Data_approval/Gender_Stats_csv/Gender_Ref.csv", check.names = FALSE, header = TRUE, sep = ",") 
I inspect the file here
#head(Gender_Ref)
and now I proceed to joining my dataframe with my ref table I just created, and inspect the resulting joined table.
Gender_Data_Long1 <-left_join(Gender_Data_Long, Gender_Ref, by = "Indicator Code")
#head(Gender_Data_Long1)
#tail(Gender_Data_Long1)
Per capita GDP distribution nations by Income Levels ( Low, Lower-Middle, Upper-Middle & High)

A general look at the data reveals that the per Capita Income of nations is very “skewed” - right skewed, and that there are very large differences in incomes among grouped nations, with many outliers, in particular within nations with High incomes. We can see here, that the vast majority of nations lie within Low to Lower to upper middle incomes groups.

h1<-Gender_Data_Wide %>%
  filter(!is.na(NY.GDP.PCAP.CD), year == 2019,!is.na(`Income Level`))%>% ggplot(
    mapping = aes(x = NY.GDP.PCAP.CD)) +
        geom_histogram(bins = 20, fill = "slategray1") +
  theme(axis.text = element_text(angle = 45)) +
  xlab("USD") + ylab("Freq.")
b1<-Gender_Data_Wide%>%
  filter(!is.na(NY.GDP.PCAP.CD), year == 2019,!is.na(`Income Level`))%>% ggplot(
    mapping = aes(x = NY.GDP.PCAP.CD)) +
        geom_boxplot(fill = "slategray1") +
  theme(axis.text = element_text(angle = 45)) +
  xlab("USD") + ylab("")
q1<-Gender_Data_Wide%>%
  filter(!is.na(NY.GDP.PCAP.CD), year == 2019,!is.na(`Income Level`))%>%
  ggplot(aes(sample = NY.GDP.PCAP.CD)) +
  geom_line(stat = "qq") +
  theme(axis.text = element_text(angle = 45)) +
  xlab("Theorical Quantiles") + ylab("Sample Quantiles")
(h1 + b1 + q1 +  
    plot_layout(ncol = 3, nrow = 1) +
    plot_annotation(title = "GDP per capita all income countries", 
                    subtitle = "Year 2019", 
                    caption = "Global Gender Statistics 2020 - World Bank  )"))

hist1<-Gender_Data_Wide %>%
  filter(!is.na(NY.GDP.PCAP.CD), year == 2019, `Income Level` == "Low")%>% ggplot(
    mapping = aes(x = NY.GDP.PCAP.CD)) +
        geom_histogram(bins = 10, fill = "aquamarine3") +
  theme(axis.text = element_text(angle = 45)) +
    xlab("USD") + ylab("Freq.")
hist2<-Gender_Data_Wide %>%
  filter(!is.na(NY.GDP.PCAP.CD), year == 2019, `Income Level` == "Lower-Middle")%>% ggplot(
    mapping = aes(x = NY.GDP.PCAP.CD)) +
        geom_histogram(bins = 10, fill = "darkseagreen") +
  theme(axis.text = element_text(angle = 45)) +
  xlab("USD") + ylab("Freq.")
hist3<-Gender_Data_Wide %>%
  filter(!is.na(NY.GDP.PCAP.CD), year == 2019, `Income Level` == "Upper-Middle")%>% ggplot(
    mapping = aes(x = NY.GDP.PCAP.CD)) +
        geom_histogram(bins = 10, fill = "deepskyblue3") +
  theme(axis.text = element_text(angle = 45)) +
  xlab("USD") + ylab("Freq.")
hist4<-Gender_Data_Wide %>%
  filter(!is.na(NY.GDP.PCAP.CD), year == 2019, `Income Level` == "High")%>% ggplot(
    mapping = aes(x = NY.GDP.PCAP.CD)) +
        geom_histogram(bins = 10, fill = "deepskyblue4") +
  theme(axis.text = element_text(angle = 45)) +
  xlab("USD") + ylab("Freq.")
box1<-Gender_Data_Wide%>%
  filter(!is.na(NY.GDP.PCAP.CD), year == 2019, `Income Level`== "Low")%>% ggplot(
    mapping = aes(x = NY.GDP.PCAP.CD)) +
        geom_boxplot(fill = "aquamarine3") +
  theme(axis.text = element_text(angle = 45)) +
  xlab("USD") + ylab("")
box2<-Gender_Data_Wide%>%
  filter(!is.na(NY.GDP.PCAP.CD), year == 2019, `Income Level`== "Lower-Middle")%>% ggplot(
    mapping = aes(x = NY.GDP.PCAP.CD)) +
        geom_boxplot(fill = "darkseagreen") +
  theme(axis.text = element_text(angle = 45)) +
  xlab("USD") + ylab("")
box3<-Gender_Data_Wide%>%
  filter(!is.na(NY.GDP.PCAP.CD), year == 2019, `Income Level`== "Upper-Middle")%>% ggplot(
    mapping = aes(x = NY.GDP.PCAP.CD)) +
        geom_boxplot(fill = "deepskyblue3") +
  theme(axis.text = element_text(angle = 45)) +
  xlab("USD") + ylab("")
box4<-Gender_Data_Wide%>%
  filter(!is.na(NY.GDP.PCAP.CD), year == 2019, `Income Level`== "High")%>% ggplot(
    mapping = aes(x = NY.GDP.PCAP.CD)) +
        geom_boxplot(fill = "deepskyblue4") +
  theme(axis.text = element_text(angle = 45)) +
  xlab("USD") + ylab("")
qq1<-Gender_Data_Wide%>%
  filter(`Income Level`== "Low") %>%
  ggplot(aes(sample = NY.GDP.PCAP.CD)) +
  geom_line(stat = "qq") +
  theme(axis.text = element_text(angle = 45)) +
  xlab("Theorical Quantiles") + ylab("Sample Quantiles")
qq2<-Gender_Data_Wide%>%
  filter(`Income Level`== "Lower-Middle") %>%
  ggplot(aes(sample = NY.GDP.PCAP.CD)) +
  geom_line(stat = "qq") +
  theme(axis.text = element_text(angle = 45)) +
  xlab("Theorical Quantiles") + ylab("Sample Quantiles")
qq3<-Gender_Data_Wide%>%
  filter(`Income Level`== "Upper-Middle") %>%
  ggplot(aes(sample = NY.GDP.PCAP.CD)) +
  geom_line(stat = "qq") +
  theme(axis.text = element_text(angle = 45)) +
  xlab("Theorical Quantiles") + ylab("Sample Quantiles")
qq4<-Gender_Data_Wide%>%
  filter(`Income Level`== "High") %>%
  ggplot(aes(sample = NY.GDP.PCAP.CD)) +
  geom_line(stat = "qq") +
  theme(axis.text = element_text(angle = 45)) +
  xlab("Theorical Quantiles") + ylab("Sample Quantiles")
(hist1 + hist2 + hist3 + hist4 + 
   box1 + box2 + box3 + box4 +
   qq1 + qq2 + qq3 + qq4 +
    plot_layout(ncol = 4, nrow = 3) +
    plot_annotation(title = "GDP per capita for Low, Lower-Middle, Upper-Middle, and High income countries", 
                    subtitle = "Year 2019", 
                    caption = "Global Gender Statistics 2020 - World Bank  )"))
## Warning: Removed 23 rows containing non-finite values (stat_qq).
## Warning: Removed 34 rows containing non-finite values (stat_qq).
## Warning: Removed 47 rows containing non-finite values (stat_qq).
## Warning: Removed 70 rows containing non-finite values (stat_qq).

# Filter out year for data
Gender_Data_Wide1<-Gender_Data_Wide %>%
  filter(year == 2020)

Agency Insights

As it was defined earlier “Agency” refers to an individual’s capacity to make decisions about their own life and act on them to achieve a desired outcome, free of violence, retribution, or fear. We looked at some variables pertaining to travel equality between females and males, and found that there are still nations where females do not have the same rights to obtain passports, to travel out of home, or country as males.

# Add labels under the tops of bars
Gender_Data_Long1 %>%
  filter(!is.na(`Income Level`), `Indicator Code` %in% c("SG.APL.PSPT.EQ", "SG.HME.TRVL.EQ","SG.CTR.TRVL.EQ"), year == 2020, !is.na(value)) %>%
  group_by(`Indicator Code`)%>%
  ggplot(aes(x=`Indicator Code`, y=`value`, fill=`Income Level`)) + 
  geom_bar(stat="identity") +
  geom_text(aes(label=`value`), vjust=3.5, colour="black", size=3.5) +
  labs(title = "Female Travel Equality", 
       subtitle = "All Income Nations - 2020", 
       x = "Obtaining a Passport  -  Out of Home Travel  -  Out of Country Travel", y = "Frequencies", 
       caption = "data: world bank",
       fill = "") +
  scale_fill_brewer(palette = "Pastel2") 

Here we go deeper in the numbers reported for “Travel” equality for each income nation as a percentage of the total nations reporting.

Nations in favor of Travel equality represented in % of Total Reported per group

Gender_Data_Wide1%>%
  dplyr::select(`Income Level`, `SG.APL.PSPT.EQ`,`SG.HME.TRVL.EQ`,`SG.CTR.TRVL.EQ`,`Country Name`)%>%
  filter(!is.na(`Income Level`))%>%
  group_by(`Income Level`) %>%
  summarise("Passport Equality %Y" = round((sum(`SG.APL.PSPT.EQ` == "1", na.rm = TRUE) / n_distinct(`Country Name`)*100),digits = 0)," Out-of-home Equality %Y" = round((sum(`SG.HME.TRVL.EQ` == "1", na.rm = TRUE) / n_distinct(`Country Name`)*100),digits = 0), "Out-of-Country Equality %Y" = round((sum(`SG.CTR.TRVL.EQ` == "1", na.rm = TRUE) / n_distinct(`Country Name`)*100),digits = 0),"TTL Reported (each group)" = n_distinct(`Country Name`)) %>%
  arrange(`TTL Reported (each group)`)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 5
##   `Income Level` `Passport Equality… ` Out-of-home Equali… `Out-of-Country Equa…
##   <chr>                        <dbl>                 <dbl>                 <dbl>
## 1 Low                             81                    90                    95
## 2 Lower-Middle                    76                    98                   100
## 3 Upper-Middle                    78                    93                    94
## 4 High                            87                    88                    92
## # … with 1 more variable: TTL Reported (each group) <int>

Nations against of Travel equality represented in % of Total Reported per group

Gender_Data_Wide1%>%
  dplyr::select(`Income Level`, `SG.APL.PSPT.EQ`,`SG.HME.TRVL.EQ`,`SG.CTR.TRVL.EQ`,`Country Name`)%>%
  filter(!is.na(`Income Level`))%>%
  group_by(`Income Level`) %>%
  summarise("Passport Equality %N" = round((sum(`SG.APL.PSPT.EQ` == "0", na.rm = TRUE) / n_distinct(`Country Name`)*100),digits = 0),"Out-of-home Equality %N" = round((sum(`SG.HME.TRVL.EQ` == "0", na.rm = TRUE) / n_distinct(`Country Name`)*100),digits = 0), "Out-of-Country Equality %N" = round((sum(`SG.CTR.TRVL.EQ` == "0", na.rm = TRUE) / n_distinct(`Country Name`)*100),digits = 0),"TTL Reported (each group)" = n_distinct(`Country Name`)) %>%
  arrange(`TTL Reported (each group)`)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 5
##   `Income Level` `Passport Equality… `Out-of-home Equalit… `Out-of-Country Equa…
##   <chr>                        <dbl>                 <dbl>                 <dbl>
## 1 Low                             19                    10                     5
## 2 Lower-Middle                    24                     2                     0
## 3 Upper-Middle                    20                     6                     4
## 4 High                             8                     7                     3
## # … with 1 more variable: TTL Reported (each group) <int>

Insights on Education

Harmonized Test Scores for both male and females vs the Human Capital Index & the wealth of nations.

The Human Capital Index is an indicator of a country’s ability to mobilize the economic and professional potential of its citizens. It measures how much capital a given country loses through the lack of education and health. The HCI ranges between 0 and 1, where 1 indicates the maximum potential for a given country has been reached. In general, citizens of wealthy nations tend to fair better in tests scores when compared to citizens of poorer countries. The graph below demonstrates that tests scores go hand in hand with a nation’s investment in its citizens’ education and wellbeing. A few outliers though, do show that a Country’s GDP per Capita does not necesarily result in high tests scores in the absence of high HCI. Qatar, United Arab Emirates, Kuwait, and Luxembourg for instance have much higher per capita GDP than say Spain, and Italy, nevertheless, Spain and Italy have higher HCI and tests results. The same holds true for The United States vs Estonia and Poland for instance.

fig<-plot_ly(
  Gender_Data_Wide1, x = ~HD.HCI.OVRL, y = ~HD.HCI.HLOS, 
  text = ~paste("Country: ",`Country Name`,"<br>Female scores:", HD.HCI.HLOS.FE, "<br>Male scores:", HD.HCI.HLOS.MA, "<br>Expected years of schooling (f):",HD.HCI.EYRS.FE, "<br>Expected years of schooling (m):",HD.HCI.EYRS.MA),
  color = ~HD.HCI.HLOS, size = ~NY.GDP.PCAP.CD 
)
fig <- fig %>% layout(
    title = '2020 Harmonized Tests Scores vs The Human Capital Index', 
    xaxis = list(
      title = 'Human Capital Index - HCI'
    ),
    yaxis = list(
      title = 'Harmonized Tests Scores'
    )
)
fig
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
## Warning: Ignoring 43 observations
## Warning: `arrange_()` was deprecated in dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## Warning: `line.width` does not currently support multiple values.
Low expectations for education does appear to influence gender performance on tests scores across most income levels.

Another important insight from this review is the education expectation and gender tests results. It appears that gender expectations in terms of years of education does influence the performance on tests scores for males and females. Across all type of income levels, except for the very low income countries, females are expected to complete more years of education than males, as a result, in all groups, except for low income countries, the median and mean female tests scores are higher than those for males. The boxplots also show that females tests scores in general do have higher “max” values than those for males.

# Mean tests scores for females by income country levels
Gender_Data_Wide1$`Income Level` <- factor(Gender_Data_Wide1$`Income Level`, levels = c("Low", "Lower-Middle", "Upper-Middle","High"))

fig2 <- plot_ly(
    data=Gender_Data_Wide1,
    x = ~`Income Level`,
    y = ~HD.HCI.HLOS.FE,
    color = ~`Income Level`,
    type = "box",
    showlegend = FALSE,
    text = ~paste("Country: ",`Country Name`,"<br>Female scores:", HD.HCI.HLOS.FE, 
                  "<br>Male scores:", HD.HCI.HLOS.MA, 
                  "<br>Expected years of schooling (f):", HD.HCI.EYRS.FE, 
                  "<br>Expected years of schooling (m):", HD.HCI.EYRS.MA)
)

fig2 <- fig2 %>% layout(
    title = '2020 Harmonized Female Tests Scores Distribution', 
    xaxis = list(
      title = '"Per Income" Country Categories'
    ),
    yaxis = list(
      title = 'Harmonized Tests Scores'
    )
)
# Mean tests scores for males by income country levels
Gender_Data_Wide1$`Income Level` <- factor(Gender_Data_Wide1$`Income Level`, levels = c("Low", "Lower-Middle", "Upper-Middle","High"))

fig3 <- plot_ly(
    data=Gender_Data_Wide1,
    x = ~`Income Level`,
    y = ~HD.HCI.HLOS.MA,
    color = ~`Income Level`,
    colors = "BrBG",
    type = "box",
    showlegend = FALSE,
    text = ~paste("Country: ",`Country Name`,"<br>Female scores:", HD.HCI.HLOS.FE, 
                  "<br>Male scores:", HD.HCI.HLOS.MA, 
                  "<br>Expected years of schooling (f):", HD.HCI.EYRS.FE, 
                  "<br>Expected years of schooling (m):", HD.HCI.EYRS.MA)
)

fig3 <- fig3 %>% layout(
    title = '2020 Harmonized Male Tests Scores Distribution', 
    xaxis = list(
      title = '"Per Income" Country Categories'
    ),
    yaxis = list(
      title = 'Harmonized Tests Scores'
    )
)
figA <- subplot(fig2, fig3, nrows = 1) 
## Warning: Ignoring 66 observations

## Warning: Ignoring 66 observations
figA <- figA %>% layout(title = "2020 Harmonized Tests Scores Distribution by Income Categories 
(Females   &   Males)",
yaxis = list(title = 'Harmonized Tests Scores'))

figA
# Mean tests scores and expected years of education for both females & males
Gender_Data_Wide1%>%
  dplyr::select(`Income Level`,`HD.HCI.HLOS.FE`,`HD.HCI.HLOS.MA`,`HD.HCI.EYRS.FE`,`HD.HCI.EYRS.MA`)%>%
  filter(!is.na(`Income Level`))%>%
  group_by(`Income Level`) %>%
  summarise(Females = mean(`HD.HCI.HLOS.FE`, na.rm = TRUE), "Exp yrs (f)" = mean(`HD.HCI.EYRS.FE`,na.rm=TRUE), Males =mean(`HD.HCI.HLOS.MA`,na.rm = TRUE), "Exp yrs (m)" = mean(`HD.HCI.EYRS.MA`,na.rm=TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 5
##   `Income Level` Females `Exp yrs (f)` Males `Exp yrs (m)`
##   <fct>            <dbl>         <dbl> <dbl>         <dbl>
## 1 Low               354.          7.10  355.          7.61
## 2 Lower-Middle      386.         10.3   381.         10.2 
## 3 Upper-Middle      418.         12.1   405.         11.8 
## 4 High              495.         13.3   481.         13.1

SOURCES

Gender Statistics sources of Data: https://datacatalog.worldbank.org/dataset/gender-statistics https://www.worldbank.org/en/data/datatopics/gender/data-resources

Gender Statistics indicators https://www.worldbank.org/en/data/datatopics/gender/indicators https://www.investopedia.com/terms/g/gross-national-income-gni.asp https://www.r-spatial.org/r/2018/10/25/ggplot2-sf.html`