Week 13 assignment

Importing data set and libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(gapminder)
library(ggthemes)

options(scipen = 0)

Lets see have a look at the data we will be working on.

df <- gapminder_unfiltered
head(df)

## # A tibble: 6 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

Goal 1 : Business Scenario

Customer or Audience

A global health organization, such as the World Health Organization (WHO), that focuses on improving healthcare in developing nations.

Problem Statement

To identify countries which might have high impact upon establishing health care initiatives, by organisation such as WHO. Specifically they are aiming to find countries with low life expectancy compared to their GDP per capita, as it indicate poor health care facility relative to economic capability.

Scope

Variables used

Country, year, Life expectancy (lifeExp), Population (pop), GDP per capita (gdpPercap). These are the variables used for the analysis.

Analysis to be done

Identify countries with the lowest life expectancy for a given GDP per capita.

Compare life expectancy between countries for a given year.
Correlation analysis between GDP per capita and life expectancy.

Objective

Identify the countries and regions that exhibit the most significant healthcare underperformance relative to their economic capability. The results will help prioritize countries for healthcare funding or interventions. Success is achieved when actionable insights are provided to guide decision-making for healthcare initiatives.

Goal 2 : Model Critique

Analysis 1: Identify countries with the lowest life expectancy for a given GDP per capita

Usual way

Below we find the bottom 50 country with lowest life expectancy for a given GDP per capita. For this we first identify the 50 countries by arranging the columns lifeExp and gdpPercap in ascending order and getting the first 50 countries. And we can see the results in the scatter plot.

bottom_50_countries <- df |>
  filter(year == 2007) |>
  arrange(lifeExp, gdpPercap) |>
  select(country) |>
  head(50)
df |>
  filter(year == 2007) |>
  mutate(colors = ifelse(country %in% bottom_50_countries$country, "bottom 50 country", "others")) |>
  arrange(lifeExp, gdpPercap) |>
  ggplot() +
  geom_point(mapping = aes(x = gdpPercap, y = lifeExp, color = colors)) +
  labs(
    title = "Top 50 countries with low life exp wrt its GDP per capita",
    x = "GDP per capita",
    y = "Life expectancy"
  )

head(bottom_50_countries)

## # A tibble: 6 × 1
##   country     
##   <fct>       
## 1 Swaziland   
## 2 Mozambique  
## 3 Zambia      
## 4 Sierra Leone
## 5 Lesotho     
## 6 Angola

Improvement

I’m using linear regression to find the countries with lowest life expectancy with respect to their GDP. The regression line provides us with an baseline for life expectancy at various GDP per capita. The residuals, difference between actual life expectancy and predicted value, highlight deviations.

A negative residual indicates countries performing worse than expected based on their GDP. On top of this I am considering countries with less than 67 years of life expectancy and gdp less than 9000, since these are the average standard for the whole world.

sample_year = 2007
sample_df <- df |>
  filter(year == sample_year) |>
  filter(lifeExp < 67) |> 
  filter(gdpPercap < 9000)
model <- lm(lifeExp ~ gdpPercap, data = sample_df)

sample_df <- sample_df |>
  mutate(residuals = resid(model))

lowest_life_expectancy <- sample_df |>
  arrange(residuals) |>
  select(country) |>
  head(50)

head(lowest_life_expectancy)

## # A tibble: 6 × 1
##   country     
##   <fct>       
## 1 Swaziland   
## 2 Angola      
## 3 Lesotho     
## 4 Zambia      
## 5 Mozambique  
## 6 Sierra Leone

df |>
  filter(year == 2007) |>
  mutate(colors = ifelse(country %in% lowest_life_expectancy$country, "bottom 50 country", "Others")) |>
  ggplot() +
  geom_point(mapping = aes(x = gdpPercap, y = lifeExp, color = colors)) +
  labs(
    title = "Top 50 countries with low life exp wrt its GDP per capita",
    x = "GDP per capita",
    y = "Life expectancy"
  )

Analysis 2: Compare life expectancy between countries for a given year

Here we can visually compare the life expectancy between the 20 countries for a given year.

sample_year <- 2007

data_filtered <- df |>
  filter(year == sample_year) |>
  head(20)

data_filtered |>
  ggplot(aes(x = lifeExp, y = reorder(country, -lifeExp))) +
  geom_bar(stat = "identity", fill = "skyblue") +

  labs(title = paste("Life Expectancy in", sample_year),
       x = "Life Expectancy (Years)",
       y = "Country") +
  theme_minimal()

Analysis 3: Correlation analysis between GDP per capita and life expectancy

Lets calculate the correlation factor between the gdpPercap and lifeExp for the year 2007, to see whether it provides us any new insights.

sample_df <- df |>
  filter(year == 2007)
correlation <- cor(sample_df$gdpPercap, sample_df$lifeExp)

print(paste("Correlation between GDP per capita and Life Expectancy for the year 2007: ", round(correlation, 2)))

## [1] "Correlation between GDP per capita and Life Expectancy for the year 2007:  0.63"

Since the correlation between GDP and life expectancy is 0.63, we made a good call on using linear regression to find the countries with low life expectancy with respect to GDP per capita. This will provide us with better results rather than using the conventional method of identifying the countries just by rearranging them based of GDP and life expectancy.

Goal 3: Ethical and Epistemological Concerns

Overcoming Biases

Representation Bias: The dataset may disproportionately favor wealthier countries or regions with more reliable data reporting. Countries with limited resources often lack comprehensive or accurate data, leading to underrepresentation of their challenges.

Measurement Bias: Life expectancy and GDP per capita are proxies for health and economic well-being but may fail to account for cultural or regional differences in quality of life.

Possible Risks or Societal Implications

Ethical Risks:

Misinterpretation of Results: Such as framing entire regions as “failing” without considering structural or historical factors.

Policy Misuse: Policymakers or organizations might use the data to justify harmful interventions or prioritize regions based on economic interests rather than actual need.

Crucial Issues That Might Not Be Measurable

Certain societal and ethical dimensions cannot be easily quantified, such as Cultural and Historical Contexts like Colonial histories, governance structures, and conflict may explain disparities but are not captured in GDP per capita or life expectancy.

To address this gap, we need to incorporate qualitative data and collaborate with experts in global health and social sciences to contextualize quantitative findings.

Stakeholders Affected

People in who fall in the under performing country might receive different treatment access to healthcare and other resources. Also the Global health organisations such as WHO might prioritize based on this data. These two groups will be directly affected.

And policy makers and researches will get indirectly affected. Ultimately it is the people who gets affected the most.

Vigneshwar Ravirao