The Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries The datasets are made available to public for the purpose of health data analysis. The dataset related to life expectancy, health factors for 193 countries has been collected from the same WHO data repository website and its corresponding economic data was collected from United Nation website. Among all categories of health-related factors only those critical factors were chosen which are more representative. It has been observed that in the past 15 years , there has been a huge development in health sector resulting in improvement of human mortality rates especially in the developing nations in comparison to the past 30 years. Therefore, in this project we have considered data from year 2000-2015 for 193 countries for further analysis. The individual data files have been merged together into a single dataset. On initial visual inspection of the data showed some missing values. As the datasets were from WHO, we found no evident errors. Missing data was handled in R software by using Missmap command. The result indicated that most of the missing data was for population, Hepatitis B and GDP. The missing data were from less known countries like Vanuatu, Tonga, Togo,Cabo Verde etc. Finding all data for these countries was difficult and hence, it was decided that we exclude these countries from the final model dataset. The final merged file(final dataset) consists of 22 Columns and 2938 rows which meant 20 predicting variables. All predicting variables was then divided into several broad categories:​Immunization related factors, Mortality factors, Economical factors and Social factors.
Source : Kaggle https://www.kaggle.com/code/gauravks13/eda-life-expactancy
First, we will read the data and put it into a variable named life_expactancy
life_expectancy <- read.csv("data_input/life_expectancy.csv")head(life_expectancy)tail(life_expectancy)dim(life_expectancy)## [1] 2938 22
well, the data has 2938 rows and 22 columns
Check data type for each column
library(dplyr)
glimpse(life_expectancy)## Rows: 2,938
## Columns: 22
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghani~
## $ Year <int> 2015, 2014, 2013, 2012, 2011, 2010, 20~
## $ Status <chr> "Developing", "Developing", "Developin~
## $ Life.expectancy <dbl> 65.0, 59.9, 59.9, 59.5, 59.2, 58.8, 58~
## $ Adult.Mortality <int> 263, 271, 268, 272, 275, 279, 281, 287~
## $ infant.deaths <int> 62, 64, 66, 69, 71, 74, 77, 80, 82, 84~
## $ Alcohol <dbl> 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.~
## $ percentage.expenditure <dbl> 71.279624, 73.523582, 73.219243, 78.18~
## $ Hepatitis.B <int> 65, 62, 64, 67, 68, 66, 63, 64, 63, 64~
## $ Measles <int> 1154, 492, 430, 2787, 3013, 1989, 2861~
## $ BMI <dbl> 19.1, 18.6, 18.1, 17.6, 17.2, 16.7, 16~
## $ under.five.deaths <int> 83, 86, 89, 93, 97, 102, 106, 110, 113~
## $ Polio <int> 6, 58, 62, 67, 68, 66, 63, 64, 63, 58,~
## $ Total.expenditure <dbl> 8.16, 8.18, 8.13, 8.52, 7.87, 9.20, 9.~
## $ Diphtheria <int> 65, 62, 64, 67, 68, 66, 63, 64, 63, 58~
## $ HIV.AIDS <dbl> 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1~
## $ GDP <dbl> 584.25921, 612.69651, 631.74498, 669.9~
## $ Population <dbl> 33736494, 327582, 31731688, 3696958, 2~
## $ thinness..1.19.years <dbl> 17.2, 17.5, 17.7, 17.9, 18.2, 18.4, 18~
## $ thinness.5.9.years <dbl> 17.3, 17.5, 17.7, 18.0, 18.2, 18.4, 18~
## $ Income.composition.of.resources <dbl> 0.479, 0.476, 0.470, 0.463, 0.454, 0.4~
## $ Schooling <dbl> 10.1, 10.0, 9.9, 9.8, 9.5, 9.2, 8.9, 8~
Some of data type not in the corect type, we need to convert it into corect type (data coercion)
life_expectancy <- life_expectancy %>%
mutate(Status = as.factor(Status))
glimpse(life_expectancy)## Rows: 2,938
## Columns: 22
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghani~
## $ Year <int> 2015, 2014, 2013, 2012, 2011, 2010, 20~
## $ Status <fct> Developing, Developing, Developing, De~
## $ Life.expectancy <dbl> 65.0, 59.9, 59.9, 59.5, 59.2, 58.8, 58~
## $ Adult.Mortality <int> 263, 271, 268, 272, 275, 279, 281, 287~
## $ infant.deaths <int> 62, 64, 66, 69, 71, 74, 77, 80, 82, 84~
## $ Alcohol <dbl> 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.~
## $ percentage.expenditure <dbl> 71.279624, 73.523582, 73.219243, 78.18~
## $ Hepatitis.B <int> 65, 62, 64, 67, 68, 66, 63, 64, 63, 64~
## $ Measles <int> 1154, 492, 430, 2787, 3013, 1989, 2861~
## $ BMI <dbl> 19.1, 18.6, 18.1, 17.6, 17.2, 16.7, 16~
## $ under.five.deaths <int> 83, 86, 89, 93, 97, 102, 106, 110, 113~
## $ Polio <int> 6, 58, 62, 67, 68, 66, 63, 64, 63, 58,~
## $ Total.expenditure <dbl> 8.16, 8.18, 8.13, 8.52, 7.87, 9.20, 9.~
## $ Diphtheria <int> 65, 62, 64, 67, 68, 66, 63, 64, 63, 58~
## $ HIV.AIDS <dbl> 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1~
## $ GDP <dbl> 584.25921, 612.69651, 631.74498, 669.9~
## $ Population <dbl> 33736494, 327582, 31731688, 3696958, 2~
## $ thinness..1.19.years <dbl> 17.2, 17.5, 17.7, 17.9, 18.2, 18.4, 18~
## $ thinness.5.9.years <dbl> 17.3, 17.5, 17.7, 18.0, 18.2, 18.4, 18~
## $ Income.composition.of.resources <dbl> 0.479, 0.476, 0.470, 0.463, 0.454, 0.4~
## $ Schooling <dbl> 10.1, 10.0, 9.9, 9.8, 9.5, 9.2, 8.9, 8~
Each of column already changed into desired data type
next, we need to check for missing value
anyNA(life_expectancy)## [1] TRUE
colSums(is.na(life_expectancy))## Country Year
## 0 0
## Status Life.expectancy
## 0 10
## Adult.Mortality infant.deaths
## 10 0
## Alcohol percentage.expenditure
## 194 0
## Hepatitis.B Measles
## 553 0
## BMI under.five.deaths
## 34 0
## Polio Total.expenditure
## 19 226
## Diphtheria HIV.AIDS
## 19 0
## GDP Population
## 448 652
## thinness..1.19.years thinness.5.9.years
## 34 34
## Income.composition.of.resources Schooling
## 167 163
we found missing values in life expetancy data, we will remove rows of data that has missing value and put it into a variable named life_expactancy_clean.
life_expectancy_clean <- na.omit(life_expectancy)anyNA(life_expectancy_clean)## [1] FALSE
Data has been clean!
summary(life_expectancy_clean)## Country Year Status Life.expectancy
## Length:1649 Min. :2000 Developed : 242 Min. :44.0
## Class :character 1st Qu.:2005 Developing:1407 1st Qu.:64.4
## Mode :character Median :2008 Median :71.7
## Mean :2008 Mean :69.3
## 3rd Qu.:2011 3rd Qu.:75.0
## Max. :2015 Max. :89.0
## Adult.Mortality infant.deaths Alcohol percentage.expenditure
## Min. : 1.0 Min. : 0.00 Min. : 0.010 Min. : 0.00
## 1st Qu.: 77.0 1st Qu.: 1.00 1st Qu.: 0.810 1st Qu.: 37.44
## Median :148.0 Median : 3.00 Median : 3.790 Median : 145.10
## Mean :168.2 Mean : 32.55 Mean : 4.533 Mean : 698.97
## 3rd Qu.:227.0 3rd Qu.: 22.00 3rd Qu.: 7.340 3rd Qu.: 509.39
## Max. :723.0 Max. :1600.00 Max. :17.870 Max. :18961.35
## Hepatitis.B Measles BMI under.five.deaths
## Min. : 2.00 Min. : 0 Min. : 2.00 Min. : 0.00
## 1st Qu.:74.00 1st Qu.: 0 1st Qu.:19.50 1st Qu.: 1.00
## Median :89.00 Median : 15 Median :43.70 Median : 4.00
## Mean :79.22 Mean : 2224 Mean :38.13 Mean : 44.22
## 3rd Qu.:96.00 3rd Qu.: 373 3rd Qu.:55.80 3rd Qu.: 29.00
## Max. :99.00 Max. :131441 Max. :77.10 Max. :2100.00
## Polio Total.expenditure Diphtheria HIV.AIDS
## Min. : 3.00 Min. : 0.740 Min. : 2.00 Min. : 0.100
## 1st Qu.:81.00 1st Qu.: 4.410 1st Qu.:82.00 1st Qu.: 0.100
## Median :93.00 Median : 5.840 Median :92.00 Median : 0.100
## Mean :83.56 Mean : 5.956 Mean :84.16 Mean : 1.984
## 3rd Qu.:97.00 3rd Qu.: 7.470 3rd Qu.:97.00 3rd Qu.: 0.700
## Max. :99.00 Max. :14.390 Max. :99.00 Max. :50.600
## GDP Population thinness..1.19.years
## Min. : 1.68 Min. : 34 Min. : 0.100
## 1st Qu.: 462.15 1st Qu.: 191897 1st Qu.: 1.600
## Median : 1592.57 Median : 1419631 Median : 3.000
## Mean : 5566.03 Mean : 14653626 Mean : 4.851
## 3rd Qu.: 4718.51 3rd Qu.: 7658972 3rd Qu.: 7.100
## Max. :119172.74 Max. :1293859294 Max. :27.200
## thinness.5.9.years Income.composition.of.resources Schooling
## Min. : 0.100 Min. :0.0000 Min. : 4.20
## 1st Qu.: 1.700 1st Qu.:0.5090 1st Qu.:10.30
## Median : 3.200 Median :0.6730 Median :12.30
## Mean : 4.908 Mean :0.6316 Mean :12.12
## 3rd Qu.: 7.100 3rd Qu.:0.7510 3rd Qu.:14.00
## Max. :28.200 Max. :0.9360 Max. :20.70
Summary
Check Outliers in life expectancy
lf <- aggregate(Life.expectancy~Country, life_expectancy_clean, mean)
head(lf[order(lf$Life.expectancy, decreasing = T),])lf1 <- aggregate(Life.expectancy~Country, life_expectancy_clean, var)
head(lf1[order(lf1$Life.expectancy, decreasing = T),])lf2 <- aggregate(Life.expectancy~Country, life_expectancy_clean, sd)
head(lf2[order(lf2$Life.expectancy, decreasing = T),])boxplot(life_expectancy_clean$Life.expectancy)
The value of the standard deviation is small enough, then the process
will continue
top10 <- life_expectancy %>%
group_by(Country) %>%
summarise(Average.Life.Expectancy = mean(Life.expectancy)) %>%
ungroup()
head(top10 %>% arrange(desc(Average.Life.Expectancy)), 10)library(ggplot2)
library(plotly)
leByYear <- life_expectancy_clean %>%
group_by(Year) %>%
summarise(Average = mean(Life.expectancy)) %>%
ungroup()
ggplotly(ggplot(data = leByYear, aes(x = Year, y = Average))+
geom_line())from 2001 to 2003 life expectancy fell drastically, and from 2003 to 2015 life expectancy continued to increase
cor(life_expectancy_clean$Life.expectancy, life_expectancy_clean$GDP)## [1] 0.4413218
ggplotly((ggplot(data = life_expectancy_clean, aes(x = GDP, y = Life.expectancy)))+
geom_point(shape=18, color="blue")+
geom_smooth(method=lm, linetype="dashed",color="darkred", fill="blue")+
labs(title = "Scotter Plot Relationship between GDP and Life Expectancy")+
theme_light())Gdp is positively correlated with life expectancy but not significant
cor(life_expectancy_clean$Life.expectancy, life_expectancy_clean$Schooling)## [1] 0.72763
ggplotly((ggplot(data = life_expectancy_clean, aes(x = Schooling, y = Life.expectancy)))+
geom_point(shape=18, color="blue")+
geom_smooth(method=lm, linetype="dashed",color="darkred", fill="blue")+
labs(title = "Scotter Plot Relationship between Schooling and Life Expectancy")+
theme_light())schooling is positively correlated with life expectancy, the longer the school time, the higher the life expectancy
top5am <- life_expectancy %>%
group_by(Country) %>%
summarise(Sum.of.Adult.Mortality = sum(Adult.Mortality)) %>%
ungroup()
head(top5am %>% arrange(desc(Sum.of.Adult.Mortality)), 5)library(GGally)
ggcorr(life_expectancy_clean, label = T, size =3)
The variables that are strongly correlated with life expectancy
are Schooling, Income Composition and Diphtheria