final project

Global Development Project

Economic Development and Public Health Across Countries

Question:

Does access to clean water improve life expectancy across countries ?

Introduction

This project analyzes the relationship between public health infrastructure, economic development, and life expectancy across countries. The main goal is to understand whether countries with better access to clean drinking water, improved sanitation, higher healthcare spending, and stronger economic conditions tend to have higher life expectancy.
The dataset comes from the World Bank World Development Indicators database. The variables used in this project include country name, income group, region, life expectancy at birth, access to basic drinking water services, access to basic sanitation services, GDP per capita, current health expenditure per capita, and population.
I chose this topic because access to clean water and healthcare are important global issues. Many countries still face inequalities in public health and economic development, and this project helps show how those differences may affect people’s quality of life.

library(readr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)
library(scales)

Attaching package: 'scales'
The following object is masked from 'package:readr':

    col_factor
library(broom)
library(plotly)

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout

Data Cleaning

The data was downloaded from the World Bank. I selected the year 2020, renamed the variables, merged the datasets by country.

life <- read_csv("~/Downloads/Life expectancy at birth, total .csv", skip = 4)
New names:
Rows: 266 Columns: 71
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(4): Country Name, Country Code, Indicator Name, Indicator Code dbl (65): 1960,
1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, ... lgl (2): 2025,
...71
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...71`
water <- read_csv("~/Downloads/People using at least basic drinking water services .csv", skip = 4)
New names:
Rows: 266 Columns: 71
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(4): Country Name, Country Code, Indicator Name, Indicator Code dbl (25): 2000,
2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, ... lgl (42): 1960,
1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, ...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...71`
sanitation <- read_csv("~/Downloads/People using at least basic sanitation services (% of population).csv", skip = 4)
New names:
Rows: 266 Columns: 71
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(4): Country Name, Country Code, Indicator Name, Indicator Code dbl (25): 2000,
2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, ... lgl (42): 1960,
1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, ...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...71`
gdp <- read_csv("~/Downloads/GDP per capita (current US$).csv", skip = 4)
New names:
Rows: 266 Columns: 71
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(4): Country Name, Country Code, Indicator Name, Indicator Code dbl (65): 1960,
1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, ... lgl (2): 2025,
...71
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...71`
health <- read_csv("~/Downloads/Current health expenditure per capita (current US$).csv", skip = 4)
New names:
Rows: 266 Columns: 71
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(4): Country Name, Country Code, Indicator Name, Indicator Code dbl (25): 2000,
2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, ... lgl (42): 1960,
1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, ...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...71`
population <- read_csv("~/Downloads/ppulation.csv", skip = 4)
New names:
Rows: 266 Columns: 71
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(4): Country Name, Country Code, Indicator Name, Indicator Code dbl (65): 1960,
1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, ... lgl (2): 2025,
...71
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...71`
metadata <- read_csv("~/Downloads/Metadata_Country_API_SP.DYN.LE00.IN_DS2_en_csv_v2_121663.csv")
New names:
Rows: 265 Columns: 6
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(5): Country Code, Region, IncomeGroup, SpecialNotes, TableName lgl (1): ...6
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...6`
names(life)
 [1] "Country Name"   "Country Code"   "Indicator Name" "Indicator Code"
 [5] "1960"           "1961"           "1962"           "1963"          
 [9] "1964"           "1965"           "1966"           "1967"          
[13] "1968"           "1969"           "1970"           "1971"          
[17] "1972"           "1973"           "1974"           "1975"          
[21] "1976"           "1977"           "1978"           "1979"          
[25] "1980"           "1981"           "1982"           "1983"          
[29] "1984"           "1985"           "1986"           "1987"          
[33] "1988"           "1989"           "1990"           "1991"          
[37] "1992"           "1993"           "1994"           "1995"          
[41] "1996"           "1997"           "1998"           "1999"          
[45] "2000"           "2001"           "2002"           "2003"          
[49] "2004"           "2005"           "2006"           "2007"          
[53] "2008"           "2009"           "2010"           "2011"          
[57] "2012"           "2013"           "2014"           "2015"          
[61] "2016"           "2017"           "2018"           "2019"          
[65] "2020"           "2021"           "2022"           "2023"          
[69] "2024"           "2025"           "...71"         
 project_year <- "2020"
 
# Select variables needed from each dataset
 
life_clean <- life |>
  select(`Country Name`, `Country Code`,
         life_expectancy = all_of(project_year))

water_clean <- water |>
  select(`Country Name`, `Country Code`,
         clean_water_access = all_of(project_year))

sanitation_clean <- sanitation |>
  select(`Country Name`, `Country Code`,
         sanitation_access = all_of(project_year))

gdp_clean <- gdp |>
  select(`Country Name`, `Country Code`,
         gdp_per_capita = all_of(project_year))

health_clean <- health |>
  select(`Country Name`, `Country Code`,
         health_expenditure = all_of(project_year))

population_clean <- population |>
  select(`Country Name`, `Country Code`,
         population = all_of(project_year))

metadata_clean <- metadata |>
  select(`Country Code`, Region, IncomeGroup)

# Merge all datasets together

world_health <- life_clean |>
  left_join(water_clean,
            by = c("Country Name", "Country Code")) |>
  left_join(sanitation_clean,
            by = c("Country Name", "Country Code")) |>
  left_join(gdp_clean,
            by = c("Country Name", "Country Code")) |>
  left_join(health_clean,
            by = c("Country Name", "Country Code")) |>
  left_join(population_clean,
            by = c("Country Name", "Country Code")) |>
  left_join(metadata_clean,
            by = "Country Code")
# Check if the merged dataset exists

names(world_health)
 [1] "Country Name"       "Country Code"       "life_expectancy"   
 [4] "clean_water_access" "sanitation_access"  "gdp_per_capita"    
 [7] "health_expenditure" "population"         "Region"            
[10] "IncomeGroup"       
# Clean the final dataset

world_health_clean <- world_health |>
  filter(!is.na(life_expectancy),
         !is.na(clean_water_access),
         !is.na(sanitation_access),
         !is.na(gdp_per_capita),
         !is.na(health_expenditure),
         !is.na(population)) |>
  mutate(
    life_expectancy = as.numeric(life_expectancy),
    clean_water_access = as.numeric(clean_water_access),
    sanitation_access = as.numeric(sanitation_access),
    gdp_per_capita = as.numeric(gdp_per_capita),
    health_expenditure = as.numeric(health_expenditure),
    population = as.numeric(population)
  )
# Check cleaned dataset

glimpse(world_health_clean)
Rows: 229
Columns: 10
$ `Country Name`     <chr> "Africa Eastern and Southern", "Afghanistan", "Afri…
$ `Country Code`     <chr> "AFE", "AFG", "AFW", "AGO", "ALB", "AND", "ARB", "A…
$ life_expectancy    <dbl> 63.76648, 61.45400, 57.36474, 63.11600, 77.82400, 7…
$ clean_water_access <dbl> 58.22948, 72.18207, 72.24889, 64.94403, 95.06804, 1…
$ sanitation_access  <dbl> 32.87999, 48.90568, 37.05991, 49.80164, 99.29879, 1…
$ gdp_per_capita     <dbl> 1351.5917, 510.7871, 2030.8617, 1759.3562, 6027.913…
$ health_expenditure <dbl> 81.44614, 80.08923, 64.62419, 56.01642, 396.17477, …
$ population         <dbl> 694446100, 39068979, 474569351, 33451132, 2528480, …
$ Region             <chr> NA, "Middle East & North Africa", NA, "Sub-Saharan …
$ IncomeGroup        <chr> NA, "Low income", NA, "Lower middle income", "Upper…
summary(world_health_clean)
 Country Name       Country Code       life_expectancy clean_water_access
 Length:229         Length:229         Min.   :50.60   Min.   : 35.51    
 Class :character   Class :character   1st Qu.:66.36   1st Qu.: 83.70    
 Mode  :character   Mode  :character   Median :72.14   Median : 94.31    
                                       Mean   :71.61   Mean   : 88.42    
                                       3rd Qu.:77.16   3rd Qu.: 99.08    
                                       Max.   :86.09   Max.   :100.00    
 sanitation_access gdp_per_capita     health_expenditure   population       
 Min.   :  9.013   Min.   :   255.8   Min.   :   12.80   Min.   :1.040e+04  
 1st Qu.: 59.711   1st Qu.:  1993.6   1st Qu.:   79.81   1st Qu.:3.399e+06  
 Median : 88.927   Median :  5299.0   Median :  356.81   Median :1.553e+07  
 Mean   : 76.879   Mean   : 14388.6   Mean   : 1252.97   Mean   :3.682e+08  
 3rd Qu.: 98.034   3rd Qu.: 15369.6   3rd Qu.: 1182.54   3rd Qu.:1.012e+08  
 Max.   :100.000   Max.   :176891.9   Max.   :11648.50   Max.   :7.855e+09  
    Region          IncomeGroup       
 Length:229         Length:229        
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      

Multiple Linear Regression

# Multiple Linear Regression Analysis

model <- lm(
life_expectancy ~ clean_water_access +
sanitation_access +
gdp_per_capita +
health_expenditure,

data = world_health_clean
)

summary(model)

Call:
lm(formula = life_expectancy ~ clean_water_access + sanitation_access + 
    gdp_per_capita + health_expenditure, data = world_health_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.1699  -1.5153   0.5194   2.0465   9.0706 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        5.300e+01  1.957e+00  27.089  < 2e-16 ***
clean_water_access 4.648e-02  3.682e-02   1.262 0.208088    
sanitation_access  1.686e-01  2.047e-02   8.240 1.44e-14 ***
gdp_per_capita     6.640e-05  1.923e-05   3.453 0.000663 ***
health_expenditure 4.618e-04  2.127e-04   2.171 0.030961 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.185 on 224 degrees of freedom
Multiple R-squared:  0.8109,    Adjusted R-squared:  0.8075 
F-statistic: 240.1 on 4 and 224 DF,  p-value: < 2.2e-16

Visualization

The first visualization shows that countries with higher access to basic drinking water services usually have higher life expectancy. Most countries with very high clean water access are grouped near the top of the graph, meaning they tend to have longer life expectancy. This suggests that clean water is an important public health factor.

# Visualization 1: Clean Water Access and Life Expectancy

ggplot(world_health_clean,
       aes(x = clean_water_access,
           y = life_expectancy,
           color = IncomeGroup)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  scale_color_brewer(palette = "Set2") +
  labs(
    title = "Access to Clean Water and Life Expectancy",
    x = "Access to Basic Drinking Water Services (%)",
    y = "Life Expectancy at Birth (Years)",
    color = "Income Group",
    caption = "Source: World Bank World Development Indicators"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 50 rows containing missing values or values outside the scale range
(`geom_point()`).

The second visualization shows the relationship between GDP per capita and life expectancy. Countries with higher GDP per capita generally have higher life expectancy. This may happen because richer countries usually have more resources for healthcare, clean water, sanitation, and public services.

# Visualization 2: Interactive GDP per Capita and Life Expectancy

library(plotly)

final_plot <- ggplot(world_health_clean,
       aes(x = gdp_per_capita,
           y = life_expectancy,
           color = IncomeGroup,
           size = population,
           text = paste(
             "Country:", `Country Name`,
             "<br>Income Group:", IncomeGroup,
             "<br>Region:", Region,
             "<br>GDP per capita:", round(gdp_per_capita, 2),
             "<br>Life expectancy:", round(life_expectancy, 2),
             "<br>Population:", population
           ))) +
  geom_point(alpha = 0.7) +
  scale_x_log10(labels = dollar) +
  scale_color_brewer(palette = "Set2") +
  labs(
    title = "Interactive GDP per Capita and Life Expectancy",
    x = "GDP per Capita",
    y = "Life Expectancy",
    color = "Income Group",
    size = "Population",
    caption = "Source: World Bank World Development Indicators"
  ) +
  theme_light()

ggplotly(final_plot, tooltip = "text")
# Save cleaned dataset for Tableau

write_csv(world_health_clean, "world_health_tableau.csv")
## Tableau Visualization

For Tableau, I created a world map using the cleaned dataset called `world_health_tableau.csv`.

The Tableau visualization is titled **Global Health and Economic Development Dashboard**.

In Tableau, I used:

Country Name = Location  
Life Expectancy = Color  
Population = Size  
Region = Filter  
IncomeGroup = Filter  
Clean Water Access = Tooltip  
GDP per Capita = Tooltip  
Health Expenditure = Tooltip  

The Tableau map allows viewers to compare life expectancy across countries and explore how public health and economic development indicators differ by region and income group.

Tableau Public Link: paste your Tableau link here.

Final Analysis

This project shows that life expectancy is connected to public health infrastructure and economic development. Countries with better access to clean drinking water, stronger sanitation services, higher healthcare spending, and higher GDP per capita generally tend to have longer life expectancy.
The regression model also supports this pattern. The adjusted R-squared value is about 0.8075, which means the model explains about 80.75% of the variation in life expectancy. Sanitation access, GDP per capita, and health expenditure were statistically significant predictors because their p-values were below 0.05. Clean water access had a positive relationship with life expectancy, but it was not statistically significant in this model.
Overall, the results suggest that improving health infrastructure and economic conditions can help improve life expectancy across countries. One important takeaway is that life expectancy is not only about hospitals or medicine. It is also connected to clean water, sanitation, income, public investment, and global inequality.

Tableau link: https://public.tableau.com/app/profile/mohamed.dione/viz/GlobalHealthandeconomicdevelopment/globalhealthandeconomicdevelopmentdashboard?publish=yes