1 About This Dataset

The Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries The datasets are made available to public for the purpose of health data analysis. The dataset related to life expectancy, health factors for 193 countries has been collected from the same WHO data repository website and its corresponding economic data was collected from United Nation website. Among all categories of health-related factors only those critical factors were chosen which are more representative. It has been observed that in the past 15 years , there has been a huge development in health sector resulting in improvement of human mortality rates especially in the developing nations in comparison to the past 30 years. Therefore, in this project we have considered data from year 2000-2015 for 193 countries for further analysis. The individual data files have been merged together into a single dataset. On initial visual inspection of the data showed some missing values. As the datasets were from WHO, we found no evident errors. Missing data was handled in R software by using Missmap command. The result indicated that most of the missing data was for population, Hepatitis B and GDP. The missing data were from less known countries like Vanuatu, Tonga, Togo,Cabo Verde etc. Finding all data for these countries was difficult and hence, it was decided that we exclude these countries from the final model dataset. The final merged file(final dataset) consists of 22 Columns and 2938 rows which meant 20 predicting variables. All predicting variables was then divided into several broad categories:​Immunization related factors, Mortality factors, Economical factors and Social factors.

Source : Kaggle https://www.kaggle.com/code/gauravks13/eda-life-expactancy

2 Read Data, Data Cleansing and Coercion

2.1 Read Data

First, we will read the data and put it into a variable named life_expactancy

life_expectancy <- read.csv("data_input/life_expectancy.csv")

2.2 Data Inspection

head(life_expectancy)
tail(life_expectancy)
dim(life_expectancy)
## [1] 2938   22

well, the data has 2938 rows and 22 columns

2.3 Data Cleansing & Coercion

Check data type for each column

library(dplyr)

glimpse(life_expectancy)
## Rows: 2,938
## Columns: 22
## $ Country                         <chr> "Afghanistan", "Afghanistan", "Afghani~
## $ Year                            <int> 2015, 2014, 2013, 2012, 2011, 2010, 20~
## $ Status                          <chr> "Developing", "Developing", "Developin~
## $ Life.expectancy                 <dbl> 65.0, 59.9, 59.9, 59.5, 59.2, 58.8, 58~
## $ Adult.Mortality                 <int> 263, 271, 268, 272, 275, 279, 281, 287~
## $ infant.deaths                   <int> 62, 64, 66, 69, 71, 74, 77, 80, 82, 84~
## $ Alcohol                         <dbl> 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.~
## $ percentage.expenditure          <dbl> 71.279624, 73.523582, 73.219243, 78.18~
## $ Hepatitis.B                     <int> 65, 62, 64, 67, 68, 66, 63, 64, 63, 64~
## $ Measles                         <int> 1154, 492, 430, 2787, 3013, 1989, 2861~
## $ BMI                             <dbl> 19.1, 18.6, 18.1, 17.6, 17.2, 16.7, 16~
## $ under.five.deaths               <int> 83, 86, 89, 93, 97, 102, 106, 110, 113~
## $ Polio                           <int> 6, 58, 62, 67, 68, 66, 63, 64, 63, 58,~
## $ Total.expenditure               <dbl> 8.16, 8.18, 8.13, 8.52, 7.87, 9.20, 9.~
## $ Diphtheria                      <int> 65, 62, 64, 67, 68, 66, 63, 64, 63, 58~
## $ HIV.AIDS                        <dbl> 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1~
## $ GDP                             <dbl> 584.25921, 612.69651, 631.74498, 669.9~
## $ Population                      <dbl> 33736494, 327582, 31731688, 3696958, 2~
## $ thinness..1.19.years            <dbl> 17.2, 17.5, 17.7, 17.9, 18.2, 18.4, 18~
## $ thinness.5.9.years              <dbl> 17.3, 17.5, 17.7, 18.0, 18.2, 18.4, 18~
## $ Income.composition.of.resources <dbl> 0.479, 0.476, 0.470, 0.463, 0.454, 0.4~
## $ Schooling                       <dbl> 10.1, 10.0, 9.9, 9.8, 9.5, 9.2, 8.9, 8~

Some of data type not in the corect type, we need to convert it into corect type (data coercion)

life_expectancy <- life_expectancy %>% 
  mutate(Status = as.factor(Status))


glimpse(life_expectancy)
## Rows: 2,938
## Columns: 22
## $ Country                         <chr> "Afghanistan", "Afghanistan", "Afghani~
## $ Year                            <int> 2015, 2014, 2013, 2012, 2011, 2010, 20~
## $ Status                          <fct> Developing, Developing, Developing, De~
## $ Life.expectancy                 <dbl> 65.0, 59.9, 59.9, 59.5, 59.2, 58.8, 58~
## $ Adult.Mortality                 <int> 263, 271, 268, 272, 275, 279, 281, 287~
## $ infant.deaths                   <int> 62, 64, 66, 69, 71, 74, 77, 80, 82, 84~
## $ Alcohol                         <dbl> 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.~
## $ percentage.expenditure          <dbl> 71.279624, 73.523582, 73.219243, 78.18~
## $ Hepatitis.B                     <int> 65, 62, 64, 67, 68, 66, 63, 64, 63, 64~
## $ Measles                         <int> 1154, 492, 430, 2787, 3013, 1989, 2861~
## $ BMI                             <dbl> 19.1, 18.6, 18.1, 17.6, 17.2, 16.7, 16~
## $ under.five.deaths               <int> 83, 86, 89, 93, 97, 102, 106, 110, 113~
## $ Polio                           <int> 6, 58, 62, 67, 68, 66, 63, 64, 63, 58,~
## $ Total.expenditure               <dbl> 8.16, 8.18, 8.13, 8.52, 7.87, 9.20, 9.~
## $ Diphtheria                      <int> 65, 62, 64, 67, 68, 66, 63, 64, 63, 58~
## $ HIV.AIDS                        <dbl> 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1~
## $ GDP                             <dbl> 584.25921, 612.69651, 631.74498, 669.9~
## $ Population                      <dbl> 33736494, 327582, 31731688, 3696958, 2~
## $ thinness..1.19.years            <dbl> 17.2, 17.5, 17.7, 17.9, 18.2, 18.4, 18~
## $ thinness.5.9.years              <dbl> 17.3, 17.5, 17.7, 18.0, 18.2, 18.4, 18~
## $ Income.composition.of.resources <dbl> 0.479, 0.476, 0.470, 0.463, 0.454, 0.4~
## $ Schooling                       <dbl> 10.1, 10.0, 9.9, 9.8, 9.5, 9.2, 8.9, 8~

Each of column already changed into desired data type

next, we need to check for missing value

anyNA(life_expectancy)
## [1] TRUE
colSums(is.na(life_expectancy))
##                         Country                            Year 
##                               0                               0 
##                          Status                 Life.expectancy 
##                               0                              10 
##                 Adult.Mortality                   infant.deaths 
##                              10                               0 
##                         Alcohol          percentage.expenditure 
##                             194                               0 
##                     Hepatitis.B                         Measles 
##                             553                               0 
##                             BMI               under.five.deaths 
##                              34                               0 
##                           Polio               Total.expenditure 
##                              19                             226 
##                      Diphtheria                        HIV.AIDS 
##                              19                               0 
##                             GDP                      Population 
##                             448                             652 
##            thinness..1.19.years              thinness.5.9.years 
##                              34                              34 
## Income.composition.of.resources                       Schooling 
##                             167                             163

we found missing values in life expetancy data, we will remove rows of data that has missing value and put it into a variable named life_expactancy_clean.

life_expectancy_clean <- na.omit(life_expectancy)
anyNA(life_expectancy_clean)
## [1] FALSE

Data has been clean!

3 Data Explanation

summary(life_expectancy_clean)
##    Country               Year             Status     Life.expectancy
##  Length:1649        Min.   :2000   Developed : 242   Min.   :44.0   
##  Class :character   1st Qu.:2005   Developing:1407   1st Qu.:64.4   
##  Mode  :character   Median :2008                     Median :71.7   
##                     Mean   :2008                     Mean   :69.3   
##                     3rd Qu.:2011                     3rd Qu.:75.0   
##                     Max.   :2015                     Max.   :89.0   
##  Adult.Mortality infant.deaths        Alcohol       percentage.expenditure
##  Min.   :  1.0   Min.   :   0.00   Min.   : 0.010   Min.   :    0.00      
##  1st Qu.: 77.0   1st Qu.:   1.00   1st Qu.: 0.810   1st Qu.:   37.44      
##  Median :148.0   Median :   3.00   Median : 3.790   Median :  145.10      
##  Mean   :168.2   Mean   :  32.55   Mean   : 4.533   Mean   :  698.97      
##  3rd Qu.:227.0   3rd Qu.:  22.00   3rd Qu.: 7.340   3rd Qu.:  509.39      
##  Max.   :723.0   Max.   :1600.00   Max.   :17.870   Max.   :18961.35      
##   Hepatitis.B       Measles            BMI        under.five.deaths
##  Min.   : 2.00   Min.   :     0   Min.   : 2.00   Min.   :   0.00  
##  1st Qu.:74.00   1st Qu.:     0   1st Qu.:19.50   1st Qu.:   1.00  
##  Median :89.00   Median :    15   Median :43.70   Median :   4.00  
##  Mean   :79.22   Mean   :  2224   Mean   :38.13   Mean   :  44.22  
##  3rd Qu.:96.00   3rd Qu.:   373   3rd Qu.:55.80   3rd Qu.:  29.00  
##  Max.   :99.00   Max.   :131441   Max.   :77.10   Max.   :2100.00  
##      Polio       Total.expenditure   Diphtheria       HIV.AIDS     
##  Min.   : 3.00   Min.   : 0.740    Min.   : 2.00   Min.   : 0.100  
##  1st Qu.:81.00   1st Qu.: 4.410    1st Qu.:82.00   1st Qu.: 0.100  
##  Median :93.00   Median : 5.840    Median :92.00   Median : 0.100  
##  Mean   :83.56   Mean   : 5.956    Mean   :84.16   Mean   : 1.984  
##  3rd Qu.:97.00   3rd Qu.: 7.470    3rd Qu.:97.00   3rd Qu.: 0.700  
##  Max.   :99.00   Max.   :14.390    Max.   :99.00   Max.   :50.600  
##       GDP              Population         thinness..1.19.years
##  Min.   :     1.68   Min.   :        34   Min.   : 0.100      
##  1st Qu.:   462.15   1st Qu.:    191897   1st Qu.: 1.600      
##  Median :  1592.57   Median :   1419631   Median : 3.000      
##  Mean   :  5566.03   Mean   :  14653626   Mean   : 4.851      
##  3rd Qu.:  4718.51   3rd Qu.:   7658972   3rd Qu.: 7.100      
##  Max.   :119172.74   Max.   :1293859294   Max.   :27.200      
##  thinness.5.9.years Income.composition.of.resources   Schooling    
##  Min.   : 0.100     Min.   :0.0000                  Min.   : 4.20  
##  1st Qu.: 1.700     1st Qu.:0.5090                  1st Qu.:10.30  
##  Median : 3.200     Median :0.6730                  Median :12.30  
##  Mean   : 4.908     Mean   :0.6316                  Mean   :12.12  
##  3rd Qu.: 7.100     3rd Qu.:0.7510                  3rd Qu.:14.00  
##  Max.   :28.200     Max.   :0.9360                  Max.   :20.70

Summary

  • average life expectancy: 69.3 years
  • minimum life expectancy: 44 years
  • lowest Gross Domestic Product per capita (gdp) is 1.68
  • the status of developing countries amounted to 1407 countries
  • the highest population size is 1.293.859.294

Check Outliers in life expectancy

lf <- aggregate(Life.expectancy~Country, life_expectancy_clean, mean)
head(lf[order(lf$Life.expectancy, decreasing = T),])
lf1 <- aggregate(Life.expectancy~Country, life_expectancy_clean, var)
head(lf1[order(lf1$Life.expectancy, decreasing = T),])
lf2 <- aggregate(Life.expectancy~Country, life_expectancy_clean, sd)
head(lf2[order(lf2$Life.expectancy, decreasing = T),])
boxplot(life_expectancy_clean$Life.expectancy)

The value of the standard deviation is small enough, then the process will continue

4 Data Manipulation & Transformation

  1. Which 10 countries have the highest average life expectancy?
top10 <- life_expectancy %>% 
  group_by(Country) %>% 
  summarise(Average.Life.Expectancy = mean(Life.expectancy)) %>% 
  ungroup() 

head(top10 %>%  arrange(desc(Average.Life.Expectancy)), 10)
  1. How does life expectancy move each year?
library(ggplot2)
library(plotly)

leByYear <- life_expectancy_clean %>% 
  group_by(Year) %>% 
  summarise(Average = mean(Life.expectancy)) %>% 
  ungroup()

ggplotly(ggplot(data = leByYear, aes(x = Year, y = Average))+
  geom_line())

from 2001 to 2003 life expectancy fell drastically, and from 2003 to 2015 life expectancy continued to increase

  1. How does GDP affect life expectancy?
cor(life_expectancy_clean$Life.expectancy, life_expectancy_clean$GDP)
## [1] 0.4413218
ggplotly((ggplot(data = life_expectancy_clean, aes(x = GDP, y = Life.expectancy)))+
   geom_point(shape=18, color="blue")+
        geom_smooth(method=lm, linetype="dashed",color="darkred", fill="blue")+
        labs(title = "Scotter Plot Relationship between GDP and Life Expectancy")+
        theme_light())

Gdp is positively correlated with life expectancy but not significant

  1. How does GDP affect life expectancy?
cor(life_expectancy_clean$Life.expectancy, life_expectancy_clean$Schooling)
## [1] 0.72763
ggplotly((ggplot(data = life_expectancy_clean, aes(x = Schooling, y = Life.expectancy)))+
   geom_point(shape=18, color="blue")+
        geom_smooth(method=lm, linetype="dashed",color="darkred", fill="blue")+
        labs(title = "Scotter Plot Relationship between Schooling and Life Expectancy")+
        theme_light())

schooling is positively correlated with life expectancy, the longer the school time, the higher the life expectancy

  1. which 5 countries have the highest adult mortality?
top5am <- life_expectancy %>% 
  group_by(Country) %>% 
  summarise(Sum.of.Adult.Mortality = sum(Adult.Mortality)) %>% 
  ungroup()

head(top5am %>%  arrange(desc(Sum.of.Adult.Mortality)), 5)
  1. What variables are strongly correlated with life expectancy?
library(GGally)

ggcorr(life_expectancy_clean, label = T, size =3)

The variables that are strongly correlated with life expectancy are Schooling, Income Composition and Diphtheria

Conclusion

  • Japan is the country with the highest life expectancy
  • The country with the highest adult mortality is Lesotho with 8801 deaths
  • Gdp is positively correlated with life expectancy but not significant
  • From 2001 to 2003 life expectancy fell drastically, and from 2003 to 2015 life expectancy continued to increase
  • schooling is positively correlated with life expectancy,
  • the longer the school time, the higher the life expectancy
  • The variables that are strongly correlated with life expectancy are Schooling, Income Composition and Diphtheria