The research question this data analysis is going to be answering is How does GDP per the affect the a countries gender inequality index? The dimensions of the dataset is a combination of 2 different datasets. Combined and with a total of 24 countries used, the dimensions are a 576 x 4. The variables in the dataset are country_name, year, gdp_per_capita, and gender_inequality_index. The year range for this dataset is from 2000 up to 2023. The gdp_per_capita dataset is from https://ourworldindata.org/grapher/gender-inequality-index-from-the-human-development-report?country=JOR~IDN~DEU~SWE and the gender_inequality_index dataset is from https://ourworldindata.org/grapher/gdp-per-capita-worldbank?country=USA~DEU~GBR~BRA.
The data analysis done for this project is a descriptive analysis and a an EDA (exploratory data analysis) is also performed. The plot to generate to answer my research question is a scatter plot with a linear regression line to show the relationship between GDP per capita and gender inequality index. This plot will show the relationship between the 2 variables and show the trend. I also created 2 functions, 1 that is able to capture the gdp_per_capita and gender_inequality_index of a particular country and year. The other function is able to summarize, meaning getting the mean, median, sd, min, and max of a country’s gdp.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 4.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.3
## corrplot 0.95 loaded
gdp <- read.csv("gdp-per-capita-worldbank.csv")
gender <- read.csv("gender-inequality-index-from-the-human-development-report.csv")
colnames(gdp)
## [1] "Entity"
## [2] "Code"
## [3] "Year"
## [4] "GDP.per.capita..PPP..constant.2021.international..."
## [5] "World.regions.according.to.OWID"
colnames(gender)
## [1] "Entity" "Code"
## [3] "Year" "Gender.Inequality.Index"
# list of 24 countries
countries <- c("Switzerland", "Japan", "United States", "Canada", "Australia", "Sweden", "Germany", "United Kingdom", "New Zealand", "Denmark", "Norway", "France", "Netherlands", "Singapore", "Italy", "China", "United Arab Emirates", "South Korea", "Spain", "Finland", "Austria", "Iceland", "Belgium", "Ireland")
# rename the gdp and gender datasets columns
gdp <- gdp |>
rename("country_name" = "Entity",
"year" = "Year",
"gdp_per_capita" = "GDP.per.capita..PPP..constant.2021.international...")
gender <- gender |>
rename("country_name" = "Entity",
"year" = "Year",
"gender_inequality_index" = "Gender.Inequality.Index")
# filter the gdp and gender datasets to only have the countries from "countries" present and the years ranging from 2000 to 2023
gdp_val <- gdp |>
filter(
country_name %in% countries,
year >= 2000, year <= 2023)
head(gdp_val)
## country_name Code year gdp_per_capita World.regions.according.to.OWID
## 1 Australia AUS 2000 45013.01
## 2 Australia AUS 2001 45338.68
## 3 Australia AUS 2002 46609.75
## 4 Australia AUS 2003 47500.84
## 5 Australia AUS 2004 48980.89
## 6 Australia AUS 2005 49914.52
gender_inequality <- gender |>
filter(country_name %in% countries,
year >= 2000, year <= 2023)
head(gender_inequality)
## country_name Code year gender_inequality_index
## 1 Australia AUS 2000 0.153
## 2 Australia AUS 2001 0.152
## 3 Australia AUS 2002 0.146
## 4 Australia AUS 2003 0.141
## 5 Australia AUS 2004 0.137
## 6 Australia AUS 2005 0.137
# no NAs present
colSums(is.na(gdp_val))
## country_name Code
## 0 0
## year gdp_per_capita
## 0 0
## World.regions.according.to.OWID
## 0
colSums(is.na(gender_inequality))
## country_name Code year
## 0 0 0
## gender_inequality_index
## 0
# after checking to see every year and country has a data value for both gdp and gender, a data frame is created to have both the data from gdp and gender
combined <- data.frame(
country_name = gdp_val$country_name,
year = gdp_val$year,
gdp_per_capita = gdp_val$gdp_per_capita,
gender_inequality_index = gender_inequality$gender_inequality_index
)
head(combined)
## country_name year gdp_per_capita gender_inequality_index
## 1 Australia 2000 45013.01 0.153
## 2 Australia 2001 45338.68 0.152
## 3 Australia 2002 46609.75 0.146
## 4 Australia 2003 47500.84 0.141
## 5 Australia 2004 48980.89 0.137
## 6 Australia 2005 49914.52 0.137
# Function 1: get the gdp and gender index based on country_name and year
search_country <- function(country, yyyy){
if (!(country %in% combined$country_name) || !is.character(country)) {
cat("Error: Variable", country_name, "is not a character or not found.\n")
return(NULL)
}
if (!(yyyy %in% combined$year) || !is.numeric(yyyy)) {
cat("Error: Variable", yyyy, "is not numeric or not found.\n")
return(NULL)
}
info <- combined |>
filter(country_name == country, year == yyyy)
return(info)
}
# Function 2: summarize the gpd of a country
summarize_gdp_of_country <- function(country){
if (!(country %in% combined$country_name) || !is.character(country)) {
cat("Error: Variable", country_name, "is not a character or not found.\n")
return(NULL)
}
info <- gdp_val |>
filter(country_name == country) |>
summarise(
mean(gdp_per_capita),
median(gdp_per_capita),
sd(gdp_per_capita),
min(gdp_per_capita),
max(gdp_per_capita)
)
return(info)
}
# performing EDA
dim(combined)
## [1] 576 4
str(combined)
## 'data.frame': 576 obs. of 4 variables:
## $ country_name : chr "Australia" "Australia" "Australia" "Australia" ...
## $ year : int 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ...
## $ gdp_per_capita : num 45013 45339 46610 47501 48981 ...
## $ gender_inequality_index: num 0.153 0.152 0.146 0.141 0.137 0.137 0.135 0.136 0.14 0.137 ...
head(combined)
## country_name year gdp_per_capita gender_inequality_index
## 1 Australia 2000 45013.01 0.153
## 2 Australia 2001 45338.68 0.152
## 3 Australia 2002 46609.75 0.146
## 4 Australia 2003 47500.84 0.141
## 5 Australia 2004 48980.89 0.137
## 6 Australia 2005 49914.52 0.137
# testing function 1 and 2
search_country("Australia", 2003)
## country_name year gdp_per_capita gender_inequality_index
## 1 Australia 2003 47500.84 0.141
summarize_gdp_of_country("Australia")
## mean(gdp_per_capita) median(gdp_per_capita) sd(gdp_per_capita)
## 1 53318.65 53675.06 4457.584
## min(gdp_per_capita) max(gdp_per_capita)
## 1 45013.01 60461.16
ggplot(combined, aes(x = gdp_per_capita, y = gender_inequality_index)) +
geom_point(size = 3, color ="#3B7FD9") +
geom_smooth(method = "lm", color = "#000000", se = FALSE) +
labs(title = "GDP per capita vs Gender Inequality Index",
x = "GDP per capita",
y = "Gender Inequality Index") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The final model used for this project is a lm() or a multiple linear regression model.
multiple_model <- lm(gender_inequality_index ~ gdp_per_capita + year, data = combined)
summary(multiple_model)
##
## Call:
## lm(formula = gender_inequality_index ~ gdp_per_capita + year,
## data = combined)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.11872 -0.03767 -0.01205 0.01961 0.58119
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.177e+01 9.577e-01 12.286 <2e-16 ***
## gdp_per_capita -2.367e-07 1.795e-07 -1.318 0.188
## year -5.789e-03 4.773e-04 -12.129 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07708 on 573 degrees of freedom
## Multiple R-squared: 0.224, Adjusted R-squared: 0.2213
## F-statistic: 82.71 on 2 and 573 DF, p-value: < 2.2e-16
Interpretations:
The intercept is 11.177 when all predictors are 0. This is not
practically meaningful, but mathematically its the y-intercept. For the
coefficients, while holding year constant, the gdp per capita
coefficient is -2.367e-7 meaning each dollar increase, the gender
inequality index goes down by about -2.367e-7. The p-val for gdp per
capita is 0.188 which > 0.05 so it is not statisically significant.
While holding gdp per capita constant, the year coefficient is -5.789e-3
meaning each 1 year increase, the inequality index goes down by about
5.789e-3. The p-value for year is 2e-16 which is < 0.05 which means
it is statically significant. The adjusted R-squared value is around
0.2213 which explains about 22.13% of the variance in gender inequality
index from gdp and year.
library(car)
## Warning: package 'car' was built under R version 4.3.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.3
##
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
##
## some
## The following object is masked from 'package:dplyr':
##
## recode
# Linearity Check
crPlots(multiple_model)
# Independence of Observations
plot(resid(multiple_model), type="b",
main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)
# Core diagnostics (covers: linearity, homoscedasticity, normality, influence)
par(mfrow=c(2,2)); plot(multiple_model); par(mfrow=c(1,1))
# Check Multicollinearity
cor(combined[, c("gdp_per_capita", "gender_inequality_index", "year")], use = "complete.obs")
## gdp_per_capita gender_inequality_index year
## gdp_per_capita 1.0000000 -0.1575206 0.2343928
## gender_inequality_index -0.1575206 1.0000000 -0.4708181
## year 0.2343928 -0.4708181 1.0000000
# Normality of Residuals
# Calculate residuals
residuals_multiple <- resid(multiple_model)
# Calculate RMSE for multiple model
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 0.07688281
Results: For component + residuals plots, the GDP per capita shows a downward trend meaning a negative relationship with Gender Inequality Index. There does appear to be a few outliers. The slope bends and doesn’t really stick with the blue dotted line therefore a possibility of a non linear relationship. For year, there is a downward trend meaning a negative relationship with Gender Inequality Index. There does appear to be a few outliers. The slope indicated by the pink line is also a straight line therefore linearity is good.
For Independence, the data points are mostly surrounding 0. There are a few bursts but no sustained patterns. For Residuals vs Fitted, it is a flat cloud therefore linearity is good. In terms of homoscedasticity (equal variance), the spread of residuals is constant which suggests no evidence of heteroscedasticity. For Scale-Location, the spread is also constant so no evidence of heteroscedasticity. For the Q-Q Residuals, there is a slight right tail deviation, not critical though. For Residuals vs Leverage, there are no obvious influential points therefore it is good.
For Multicollinearity, values are quite weak, some being in the negative while others while others don’t even exceed the 0.25 mark, let alone 0.50.
The RMSE is 0.07688281 meaning predictions miss by about 0.07688281 on average for the general inequality index.
Overall, we can conclude that as GDP per capita increases, gender inequality index decreases. This means that a country with a higher GDP per capita tends to have a lower amount of gender inequality. The linearity and homoscedasticity are also good meaning that the model is pretty reliable. Same with the residual normality. Also having the RMSE at a 0.07688281 means the prediction misses are minimal therefore being much more reliable. The adjusted R-squared value is being around 22.13% of variance is around the lower end therefore the model is reliable. In terms of limitations, the component + residuals plots for the GDP per capita shows a downward trend with a few outliers and a slope that bends. This has the possiblity of the GDP per capita and gender inequality index being in a nonlinear relationship.
For future directions, we can also add in other factors, columns, and predictors to enhance our dataset and improve the multiple linear regression. We can also look in deeper to see what other causes of gender inequality. There’s also a closer look such as seeing the trend in a specific country to predict a specific country’s future direction in terms of gender inequality.
“GDP per Capita.” Our World in Data, 2025, ourworldindata.org/grapher/gdp-per-capita-worldbank?country=USADEUGBR~BRA.
“Gender Inequality Index.” Our World in Data, 2025, ourworldindata.org/grapher/gender-inequality-index-from-the-human-development-report?country=JORIDNDEU~SWE.