library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gapminder)
library(ggthemes)
options(scipen = 0)
Lets see have a look at the data we will be working on.
df <- gapminder_unfiltered
head(df)
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
A global health organization, such as the World Health Organization (WHO), that focuses on improving healthcare in developing nations.
To identify countries which might have high impact upon establishing health care initiatives, by organisation such as WHO. Specifically they are aiming to find countries with low life expectancy compared to their GDP per capita, as it indicate poor health care facility relative to economic capability.
Country, year, Life expectancy (lifeExp), Population (pop), GDP per capita (gdpPercap). These are the variables used for the analysis.
Compare life expectancy between countries for a given year.
Correlation analysis between GDP per capita and life expectancy.
Identify the countries and regions that exhibit the most significant healthcare underperformance relative to their economic capability. The results will help prioritize countries for healthcare funding or interventions. Success is achieved when actionable insights are provided to guide decision-making for healthcare initiatives.
Below we find the bottom 50 country with lowest life expectancy for a given GDP per capita. For this we first identify the 50 countries by arranging the columns lifeExp and gdpPercap in ascending order and getting the first 50 countries. And we can see the results in the scatter plot.
bottom_50_countries <- df |>
filter(year == 2007) |>
arrange(lifeExp, gdpPercap) |>
select(country) |>
head(50)
df |>
filter(year == 2007) |>
mutate(colors = ifelse(country %in% bottom_50_countries$country, "bottom 50 country", "others")) |>
arrange(lifeExp, gdpPercap) |>
ggplot() +
geom_point(mapping = aes(x = gdpPercap, y = lifeExp, color = colors)) +
labs(
title = "Top 50 countries with low life exp wrt its GDP per capita",
x = "GDP per capita",
y = "Life expectancy"
)
head(bottom_50_countries)
## # A tibble: 6 × 1
## country
## <fct>
## 1 Swaziland
## 2 Mozambique
## 3 Zambia
## 4 Sierra Leone
## 5 Lesotho
## 6 Angola
I’m using linear regression to find the countries with lowest life expectancy with respect to their GDP. The regression line provides us with an baseline for life expectancy at various GDP per capita. The residuals, difference between actual life expectancy and predicted value, highlight deviations.
A negative residual indicates countries performing worse than expected based on their GDP. On top of this I am considering countries with less than 67 years of life expectancy and gdp less than 9000, since these are the average standard for the whole world.
sample_year = 2007
sample_df <- df |>
filter(year == sample_year) |>
filter(lifeExp < 67) |>
filter(gdpPercap < 9000)
model <- lm(lifeExp ~ gdpPercap, data = sample_df)
sample_df <- sample_df |>
mutate(residuals = resid(model))
lowest_life_expectancy <- sample_df |>
arrange(residuals) |>
select(country) |>
head(50)
head(lowest_life_expectancy)
## # A tibble: 6 × 1
## country
## <fct>
## 1 Swaziland
## 2 Angola
## 3 Lesotho
## 4 Zambia
## 5 Mozambique
## 6 Sierra Leone
df |>
filter(year == 2007) |>
mutate(colors = ifelse(country %in% lowest_life_expectancy$country, "bottom 50 country", "Others")) |>
ggplot() +
geom_point(mapping = aes(x = gdpPercap, y = lifeExp, color = colors)) +
labs(
title = "Top 50 countries with low life exp wrt its GDP per capita",
x = "GDP per capita",
y = "Life expectancy"
)
Here we can visually compare the life expectancy between the 20 countries for a given year.
sample_year <- 2007
data_filtered <- df |>
filter(year == sample_year) |>
head(20)
data_filtered |>
ggplot(aes(x = lifeExp, y = reorder(country, -lifeExp))) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = paste("Life Expectancy in", sample_year),
x = "Life Expectancy (Years)",
y = "Country") +
theme_minimal()
Lets calculate the correlation factor between the gdpPercap and lifeExp for the year 2007, to see whether it provides us any new insights.
sample_df <- df |>
filter(year == 2007)
correlation <- cor(sample_df$gdpPercap, sample_df$lifeExp)
print(paste("Correlation between GDP per capita and Life Expectancy for the year 2007: ", round(correlation, 2)))
## [1] "Correlation between GDP per capita and Life Expectancy for the year 2007: 0.63"
Since the correlation between GDP and life expectancy is 0.63, we made a good call on using linear regression to find the countries with low life expectancy with respect to GDP per capita. This will provide us with better results rather than using the conventional method of identifying the countries just by rearranging them based of GDP and life expectancy.
Representation Bias: The dataset may disproportionately favor wealthier countries or regions with more reliable data reporting. Countries with limited resources often lack comprehensive or accurate data, leading to underrepresentation of their challenges.
Measurement Bias: Life expectancy and GDP per capita are proxies for health and economic well-being but may fail to account for cultural or regional differences in quality of life.
Ethical Risks:
Misinterpretation of Results: Such as framing entire regions as “failing” without considering structural or historical factors.
Policy Misuse: Policymakers or organizations might use the data to justify harmful interventions or prioritize regions based on economic interests rather than actual need.
Certain societal and ethical dimensions cannot be easily quantified, such as Cultural and Historical Contexts like Colonial histories, governance structures, and conflict may explain disparities but are not captured in GDP per capita or life expectancy.
To address this gap, we need to incorporate qualitative data and collaborate with experts in global health and social sciences to contextualize quantitative findings.
People in who fall in the under performing country might receive different treatment access to healthcare and other resources. Also the Global health organisations such as WHO might prioritize based on this data. These two groups will be directly affected.
And policy makers and researches will get indirectly affected. Ultimately it is the people who gets affected the most.