Dataset Description - Life Expectancy(WHO)

The Life Expectancy (WHO) dataset, collected from the World Health Organization (WHO) and United Nations, provides data on global health and socio-economic indicators influencing life expectancy across countries.

It consists of 2,938 rows and 22 columns, covering factors such as mortality rates, immunization levels, health expenditure, GDP, schooling, and population data for multiple countries over several years. Each record represents a specific country-year combination.

The main objective of this dataset is to analyze how variables like income, healthcare spending, education, and disease prevalence affect life expectancy. It is commonly used for regression, correlation, visualization, and clustering analyses to explore patterns and relationships among global health factors.

Load Required Libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(class)

Load Dataset

library(readxl)
WHO <- read_excel("C:/Users/HP/Downloads/Life Expectancy Data Project.xlsx")
View(WHO)
df<-WHO

Performing Exploratory Data Analysis

#1. What is the structure of Dataset?
str(df)
## tibble [2,938 × 22] (S3: tbl_df/tbl/data.frame)
##  $ Country                        : chr [1:2938] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Year                           : num [1:2938] 2015 2014 2013 2012 2011 ...
##  $ Status                         : chr [1:2938] "Developing" "Developing" "Developing" "Developing" ...
##  $ Life expectancy                : num [1:2938] 65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
##  $ Adult Mortality                : num [1:2938] 263 271 268 272 275 279 281 287 295 295 ...
##  $ infant deaths                  : num [1:2938] 62 64 66 69 71 74 77 80 82 84 ...
##  $ Alcohol                        : num [1:2938] 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 ...
##  $ percentage expenditure         : num [1:2938] 71.3 73.5 73.2 78.2 7.1 ...
##  $ Hepatitis B                    : num [1:2938] 65 62 64 67 68 66 63 64 63 64 ...
##  $ Measles                        : num [1:2938] 1154 492 430 2787 3013 ...
##  $ BMI                            : num [1:2938] 19.1 18.6 18.1 17.6 17.2 16.7 16.2 15.7 15.2 14.7 ...
##  $ under-five deaths              : num [1:2938] 83 86 89 93 97 102 106 110 113 116 ...
##  $ Polio                          : num [1:2938] 6 58 62 67 68 66 63 64 63 58 ...
##  $ Total expenditure              : num [1:2938] 8.16 8.18 8.13 8.52 7.87 9.2 9.42 8.33 6.73 7.43 ...
##  $ Diphtheria                     : num [1:2938] 65 62 64 67 68 66 63 64 63 58 ...
##  $ HIV/AIDS                       : num [1:2938] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
##  $ GDP                            : num [1:2938] 584.3 612.7 631.7 670 63.5 ...
##  $ Population                     : num [1:2938] 33736494 327582 31731688 3696958 2978599 ...
##  $ thinness  1-19 years           : num [1:2938] 17.2 17.5 17.7 17.9 18.2 18.4 18.6 18.8 19 19.2 ...
##  $ thinness 5-9 years             : num [1:2938] 17.3 17.5 17.7 18 18.2 18.4 18.7 18.9 19.1 19.3 ...
##  $ Income composition of resources: num [1:2938] 0.479 0.476 0.47 0.463 0.454 0.448 0.434 0.433 0.415 0.405 ...
##  $ Schooling                      : num [1:2938] 10.1 10 9.9 9.8 9.5 9.2 8.9 8.7 8.4 8.1 ...

Interpretation:- The dataset structure reveals 2,938 observations and 22 variables, including numeric and categorical types. It contains details like Country, Year, Status, Life Expectancy, GDP, Schooling, and several health indicators, confirming its multi-dimensional nature.

#2.Identify and remove any missing or duplicate values from the dataset?
colSums(is.na(df))
##                         Country                            Year 
##                               0                               0 
##                          Status                 Life expectancy 
##                               0                              10 
##                 Adult Mortality                   infant deaths 
##                              10                               0 
##                         Alcohol          percentage expenditure 
##                             194                               0 
##                     Hepatitis B                         Measles 
##                             553                               0 
##                             BMI               under-five deaths 
##                              34                               0 
##                           Polio               Total expenditure 
##                              19                             226 
##                      Diphtheria                        HIV/AIDS 
##                              19                               0 
##                             GDP                      Population 
##                             448                             652 
##            thinness  1-19 years              thinness 5-9 years 
##                              34                              34 
## Income composition of resources                       Schooling 
##                             167                             163
df<- df %>% drop_na(`Life expectancy`)
df<- df %>% drop_na(`Adult Mortality`) 
df<- df %>% drop_na(`Alcohol`) 
df<- df %>% drop_na(`Hepatitis B`) 
df<- df %>% drop_na(`BMI`) 
df<- df %>% drop_na(`Polio`) 
df<- df %>% drop_na(`Diphtheria`) 
df<- df %>% drop_na(`GDP`) 
df<- df %>% drop_na(`thinness  1-19 years`) 
df<- df %>% drop_na(`Income composition of resources`)
df<- df %>% drop_na(`Total expenditure`)
df<- df %>% drop_na(`HIV/AIDS`)
df<- df %>% drop_na(`Population`)
df<- df %>% drop_na(`thinness 5-9 years`)
df<- df %>% drop_na(`Schooling`)
colSums(is.na(df))
##                         Country                            Year 
##                               0                               0 
##                          Status                 Life expectancy 
##                               0                               0 
##                 Adult Mortality                   infant deaths 
##                               0                               0 
##                         Alcohol          percentage expenditure 
##                               0                               0 
##                     Hepatitis B                         Measles 
##                               0                               0 
##                             BMI               under-five deaths 
##                               0                               0 
##                           Polio               Total expenditure 
##                               0                               0 
##                      Diphtheria                        HIV/AIDS 
##                               0                               0 
##                             GDP                      Population 
##                               0                               0 
##            thinness  1-19 years              thinness 5-9 years 
##                               0                               0 
## Income composition of resources                       Schooling 
##                               0                               0
sum(duplicated(df)) 
## [1] 0

Interpretation:- Missing values were detected in several variables such as Life Expectancy, Adult Mortality, Alcohol, and GDP. After removing them, the dataset became clean and ready for analysis. No duplicate rows were found, ensuring data reliability.

#3.Which country has the highest average life expectancy?
df %>%
  group_by(Country) %>%
  summarise(Average_Life_Expectancy = mean(`Life expectancy`)) %>%
  arrange(desc(Average_Life_Expectancy))%>%
  head(5)
## # A tibble: 5 × 2
##   Country Average_Life_Expectancy
##   <chr>                     <dbl>
## 1 Ireland                    83.4
## 2 Canada                     82.2
## 3 France                     82.2
## 4 Italy                      82.2
## 5 Spain                      82.0

Interpretation:- The analysis shows that Ireland has the highest average life expectancy and Canada, France, Italy, and Spain similarly stands at the second highest average life expectancy indicating better healthcare systems, higher income levels, and strong living standards.

#4.Which country has the highest average GDP?
df %>%
  group_by(Country) %>%
  summarise(Average_GDP = mean(GDP)) %>%
  arrange(desc(Average_GDP)) %>%
  head(5)
## # A tibble: 5 × 2
##   Country     Average_GDP
##   <chr>             <dbl>
## 1 Luxembourg       56727.
## 2 Netherlands      39640.
## 3 Australia        35391.
## 4 Austria          33172.
## 5 Sweden           32232.

Interpretation:- Countries like Luxembourg, Netherlands and Australia show the highest GDP per capita, suggesting strong economic growth and industrial development, which often contribute to improved healthcare and longer lifespans.

#5.Which country has the highest average total health expenditure?
df %>%
  group_by(Country) %>%
  summarise(Average_Expenditure = mean(`Total expenditure`)) %>%
  arrange(desc(Average_Expenditure)) %>%
  head(5)
## # A tibble: 5 × 2
##   Country                Average_Expenditure
##   <chr>                                <dbl>
## 1 Sweden                               11.8 
## 2 Bosnia and Herzegovina                9.18
## 3 Greece                                9.04
## 4 Malta                                 8.93
## 5 Australia                             8.84

Interpretation:- The top countries in health expenditure include Sweden, Bosnia & Herzegovina, and Greece, reflecting that developed countries invest significantly in healthcare infrastructure and services.

#6.Create a new column – Life Expectancy Category
df <- df %>%
  mutate(LifeExp_Category = case_when(
    `Life expectancy` < 60 ~ "Low",
    `Life expectancy` >= 60 & `Life expectancy` < 75 ~ "Medium",
    `Life expectancy` >= 75 ~ "High"
  ))
View(df)

Interpretation:- The data was categorized into three groups: Low (<60 years), Medium (60–75 years), and High (>75 years). Most developed nations fall in the High category, while developing ones tend to cluster in the Medium or Low categories, showing global health inequality

Performing Visualisation

#7.Which countries have the highest average health expenditure?
avg_health_exp <- aggregate(df$`Total expenditure`, by = list(df$Country), FUN = mean)
avg_health_exp <- avg_health_exp[order(avg_health_exp$x, decreasing = TRUE), ]
barplot(avg_health_exp$x[1:10],
        names.arg = avg_health_exp$Group.1[1:10],
        col = "Purple",
        main = "Top 10 Countries by Average Health Expenditure",
        xlab = "Country",
        ylab = "Average Health Expenditure (%)",
        las = 2, # Rotate labels for readability 
        cex.names = 0.8)

Interpretation:- The bar chart highlights that countries with higher health spending generally correspond to those with better life expectancy. This trend suggests a positive association between healthcare investment and longevity.

#8.How has the global average life expectancy changed over time?
life_trend <- df %>%
  group_by(Year) %>%
  summarise(Average_Life = mean(`Life expectancy`))
plot(life_trend$Year, life_trend$Average_Life, 
     type = "l", 
     col = "Red", 
     main = "Global Trend of Life Expectancy Over Years", 
     xlab = "Year", 
     ylab = "Average Life Expectancy", 
     lwd = 2)

Interpretation:- The global average life expectancy has shown a consistent upward trend across years, indicating significant improvement in global healthcare, disease prevention, and living conditions.

#9.How does the distribution of life expectancy vary between developed and developing countries?
boxplot(`Life expectancy` ~ Status, 
        data = df,
        main = "Life Expectancy Distribution by Status",
        xlab = "Status",
        ylab = "Life Expectancy",
        col = c("Orange", "Brown"),
        border = "Black")

Interpretation:- Developed countries exhibit higher median life expectancy and less variability compared to developing ones. This shows that better economic conditions and medical facilities directly influence longer life spans.

#10.What is the relationship between GDP and life expectancy?
ggplot(df, aes(x = GDP, y = `Life expectancy`)) +
  geom_point(alpha = 0.6, color = "Violet") +
  labs(title = "Relationship between GDP and Life Expectancy",
       x = "GDP per Capita", y = "Life Expectancy")

Interpretation:- The scatter plot demonstrates a positive correlation between GDP and life expectancy—wealthier nations tend to have higher life expectancies, emphasizing the role of economic growth in improving health outcomes.

# 11. What proportion of countries are developed vs developing in the dataset?
status_data <- df %>%
  group_by(Status) %>%
  summarise(Count = n())
percent <- round(100 * status_data$Count / sum(status_data$Count), 1)
labels <- paste(status_data$Status, " (", percent, "%)", sep = "")
pie(status_data$Count,
    labels = labels,
    main = "Distribution of Developed vs Developing Countries",
    col = c("skyblue", "lightgreen"),
    border = "Black")

Interpretation:- The dataset is dominated by developing countries, making up the majority of the records. This shows that WHO data emphasizes global inclusivity and focuses on tracking progress in emerging economies.

#12.What is the distribution of Life Expectancy across all countries?
ggplot(df, aes(x = `Life expectancy`)) +
  geom_histogram(binwidth = 2, fill = "Yellow", color = "black") +
  ggtitle("Distribution of Life Expectancy") +
  xlab("Life Expectancy (Years)") +
  ylab("Number of Countries") +
  theme_classic()

Interpretation:- The histogram shows that most countries have life expectancies between 60 and 80 years, with fewer nations at the extreme ends. The distribution is slightly right-skewed, indicating ongoing improvement in global health.

Performing Predictive Analysis

#13. Does GDP significantly affect Life Expectancy?
life.gdp.lm <- lm(`Life expectancy` ~ GDP, data = df)
summary(life.gdp.lm)
## 
## Call:
## lm(formula = `Life expectancy` ~ GDP, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.084  -3.924   1.675   5.314  21.440 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.742e+01  2.161e-01  311.94   <2e-16 ***
## GDP         3.383e-04  1.695e-05   19.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.896 on 1647 degrees of freedom
## Multiple R-squared:  0.1948, Adjusted R-squared:  0.1943 
## F-statistic: 398.4 on 1 and 1647 DF,  p-value: < 2.2e-16
ggplot(df, aes(x = GDP, y = `Life expectancy`)) +
  geom_point(color = "Green", alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE, color = "Red", lwd = 1.2) +
  labs(title = "Regression: Effect of GDP on Life Expectancy",
       x = "GDP per Capita",
       y = "Life Expectancy (Years)") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Interpretation:- The plot shows a clear positive relationship between GDP and life expectancy — as GDP increases, life expectancy also rises. This indicates that economically stronger countries tend to have longer lifespans, highlighting the impact of economic growth on overall health and living standards.

#14.How do GDP, Schooling, and Total Expenditure together influence Life Expectancy?
life.multi.lm <- lm(`Life expectancy` ~ GDP + Schooling + `Total expenditure`, data = df)
summary(life.multi.lm)
## 
## Call:
## lm(formula = `Life expectancy` ~ GDP + Schooling + `Total expenditure`, 
##     data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.2956  -2.9347   0.7472   4.0238  14.6109 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          4.348e+01  7.453e-01  58.342  < 2e-16 ***
## GDP                  9.974e-05  1.451e-05   6.875 8.75e-12 ***
## Schooling            2.107e+00  6.040e-02  34.890  < 2e-16 ***
## `Total expenditure` -4.593e-02  6.598e-02  -0.696    0.486    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.955 on 1645 degrees of freedom
## Multiple R-squared:  0.5426, Adjusted R-squared:  0.5418 
## F-statistic: 650.5 on 3 and 1645 DF,  p-value: < 2.2e-16
predicted <- predict(life.multi.lm, df)
plot(df$`Life expectancy`, predicted,
     main = "Actual vs Predicted Life Expectancy",
     xlab = "Actual Life Expectancy",
     ylab = "Predicted Life Expectancy",
     pch = 19, col = "Orange")
abline(a = 0, b = 1, col = "Purple", lwd = 2)

Interpretation:- The scatter plot shows a strong positive correlation between actual and predicted life expectancy values. Most data points cluster close to the red regression line, indicating that GDP, Schooling, and Total Expenditure together effectively predict life expectancy.

#15.Perform K-Means clustering on the Life Expectancy dataset to group countries into three distinct clusters based on their Life Expectancy, GDP, and Adult Mortality.
# Step 1: Select relevant features for clustering
data_subset <- df[, c("Life expectancy", "GDP", "Adult Mortality")]

# Step 2: Apply K-Means (3 clusters)
set.seed(1)
kmeans_result <- kmeans(data_subset, centers = 3, nstart = 20)

# Step 3: Add cluster information to dataset
df$Cluster <- as.factor(kmeans_result$cluster)

# Step 4: Extract final cluster centroids
centers <- as.data.frame(kmeans_result$centers)
centers$Cluster <- as.factor(1:3)

# Step 5: Compute convex hulls (boundaries for visualization)
hull <- df %>% 
  group_by(Cluster) %>% 
  slice(chull(`Life expectancy`, `GDP`))

# Step 6: Plot the clusters
plot(df$`Life expectancy`, df$`GDP`, 
     col = df$Cluster, 
     pch = 19, 
     xlab = "Life Expectancy", 
     ylab = "GDP", 
     main = "K-Means Clustering (3 Clusters)")

points(centers$`Life expectancy`, centers$`GDP`, 
       col = 1:3, pch = 8, cex = 2)

Interpretation:- The K-Means clustering plot divides countries into three distinct groups based on GDP and Life Expectancy. The green cluster represents countries with low GDP and lower life expectancy, the black cluster shows moderate GDP and life expectancy, while the red cluster indicates high GDP nations with higher life expectancy. This clustering highlights a clear positive relationship between a country’s economic status and the average life expectancy of its population.