EDA - Life Expectancy (WHO)
What is Life Expectancy?
Life expectancy is a statistical measure of the average age/time that an organism is expected to live based on several factors. Because life expectancy is an average measure, a particular person or organism may die many years before or many years after the life expectancy.
About the Dataset
For this report, we will use Life Expectancy Data.csv contains 193 countries from 2000 - 2015 consists of 22 columns of several life expectancy factors. The 22 columns are:
- Country : Country
- Year : Year
- Status : Developed or Developing Status
- Life.expectancy : Life expectancy in age
- Adult.Mortality : Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)
- infant.deaths : Number of Infant Deaths per 1000 population
- Alcohol : Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)
- percentage.expenditure : Expenditure on health as a percentage of Gross Domestic Product per capita(%)
- Hepatitis.B : Hepatitis B (HepB) immunization coverage among 1-year-olds (%)
- Measles : Measles - number of reported cases per 1000 population
- BMI : Average Body Mass Index of entire population
- under.five.deaths : Number of under-five deaths per 1000 population
- Polio : Polio (Pol3) immunization coverage among 1-year-olds (%)
- Total.expenditure : General government expenditure on health as a percentage of total government expenditure (%)
- Diphtheria : Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)
- HIV.AIDS : Deaths per 1000 live births HIV/AIDS (0-4 years)
- GDP : Gross Domestic Product per capita (in USD)
- Population : Population of the country
- thinness..1.19.years : Prevalence of thinness among children and adolescents for Age 10 to 19 (% )
- thinness.5.9.years : Prevalence of thinness among children for Age 5 to 9(%)
- Income.composotion.of.resources : Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
- Schooling : Number of years of Schooling(years)
#Load library
library(tidyr)
# Data input
<- read.csv("data/Life Expectancy Data.csv")
data
#Checking data
str(data)
## 'data.frame': 2938 obs. of 22 variables:
## $ Country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ Year : int 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 ...
## $ Status : chr "Developing" "Developing" "Developing" "Developing" ...
## $ Life.expectancy : num 65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
## $ Adult.Mortality : int 263 271 268 272 275 279 281 287 295 295 ...
## $ infant.deaths : int 62 64 66 69 71 74 77 80 82 84 ...
## $ Alcohol : num 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 ...
## $ percentage.expenditure : num 71.3 73.5 73.2 78.2 7.1 ...
## $ Hepatitis.B : int 65 62 64 67 68 66 63 64 63 64 ...
## $ Measles : int 1154 492 430 2787 3013 1989 2861 1599 1141 1990 ...
## $ BMI : num 19.1 18.6 18.1 17.6 17.2 16.7 16.2 15.7 15.2 14.7 ...
## $ under.five.deaths : int 83 86 89 93 97 102 106 110 113 116 ...
## $ Polio : int 6 58 62 67 68 66 63 64 63 58 ...
## $ Total.expenditure : num 8.16 8.18 8.13 8.52 7.87 9.2 9.42 8.33 6.73 7.43 ...
## $ Diphtheria : int 65 62 64 67 68 66 63 64 63 58 ...
## $ HIV.AIDS : num 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
## $ GDP : num 584.3 612.7 631.7 670 63.5 ...
## $ Population : num 33736494 327582 31731688 3696958 2978599 ...
## $ thinness..1.19.years : num 17.2 17.5 17.7 17.9 18.2 18.4 18.6 18.8 19 19.2 ...
## $ thinness.5.9.years : num 17.3 17.5 17.7 18 18.2 18.4 18.7 18.9 19.1 19.3 ...
## $ Income.composition.of.resources: num 0.479 0.476 0.47 0.463 0.454 0.448 0.434 0.433 0.415 0.405 ...
## $ Schooling : num 10.1 10 9.9 9.8 9.5 9.2 8.9 8.7 8.4 8.1 ...
From this result, we find Year, and Status columns are not in correct data type. we will convert it to factor type.
#Explicit coertion
$Year <- as.factor(data$Year)
data$Status <- as.factor(data$Status)
data
str(data)
## 'data.frame': 2938 obs. of 22 variables:
## $ Country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ Year : Factor w/ 16 levels "2000","2001",..: 16 15 14 13 12 11 10 9 8 7 ...
## $ Status : Factor w/ 2 levels "Developed","Developing": 2 2 2 2 2 2 2 2 2 2 ...
## $ Life.expectancy : num 65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
## $ Adult.Mortality : int 263 271 268 272 275 279 281 287 295 295 ...
## $ infant.deaths : int 62 64 66 69 71 74 77 80 82 84 ...
## $ Alcohol : num 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 ...
## $ percentage.expenditure : num 71.3 73.5 73.2 78.2 7.1 ...
## $ Hepatitis.B : int 65 62 64 67 68 66 63 64 63 64 ...
## $ Measles : int 1154 492 430 2787 3013 1989 2861 1599 1141 1990 ...
## $ BMI : num 19.1 18.6 18.1 17.6 17.2 16.7 16.2 15.7 15.2 14.7 ...
## $ under.five.deaths : int 83 86 89 93 97 102 106 110 113 116 ...
## $ Polio : int 6 58 62 67 68 66 63 64 63 58 ...
## $ Total.expenditure : num 8.16 8.18 8.13 8.52 7.87 9.2 9.42 8.33 6.73 7.43 ...
## $ Diphtheria : int 65 62 64 67 68 66 63 64 63 58 ...
## $ HIV.AIDS : num 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
## $ GDP : num 584.3 612.7 631.7 670 63.5 ...
## $ Population : num 33736494 327582 31731688 3696958 2978599 ...
## $ thinness..1.19.years : num 17.2 17.5 17.7 17.9 18.2 18.4 18.6 18.8 19 19.2 ...
## $ thinness.5.9.years : num 17.3 17.5 17.7 18 18.2 18.4 18.7 18.9 19.1 19.3 ...
## $ Income.composition.of.resources: num 0.479 0.476 0.47 0.463 0.454 0.448 0.434 0.433 0.415 0.405 ...
## $ Schooling : num 10.1 10 9.9 9.8 9.5 9.2 8.9 8.7 8.4 8.1 ...
#Check missing Value
colSums(is.na(data))
## Country Year
## 0 0
## Status Life.expectancy
## 0 10
## Adult.Mortality infant.deaths
## 10 0
## Alcohol percentage.expenditure
## 194 0
## Hepatitis.B Measles
## 553 0
## BMI under.five.deaths
## 34 0
## Polio Total.expenditure
## 19 226
## Diphtheria HIV.AIDS
## 19 0
## GDP Population
## 448 652
## thinness..1.19.years thinness.5.9.years
## 34 34
## Income.composition.of.resources Schooling
## 167 163
We can see above, there are many columns with missing values. for this project, we will fill the missing values with the next value then previous value (first up and then down) for each country because the data ordered by year on descending rule.
#Get country list
<-unique(data$Country)
country.list
#Fill the missing values
for (country in country.list) {
$Country == country,] <- data[data$Country == country,] %>% fill(c(Life.expectancy,
data[data
Adult.Mortality,
Alcohol,
Hepatitis.B,
BMI,
Polio,
Total.expenditure,
Diphtheria,
GDP,
Population,1.19.years,
thinness..5.9.years,
thinness.
Income.composition.of.resources,.direction = "updown")
Schooling),
}colSums(is.na(data))
## Country Year
## 0 0
## Status Life.expectancy
## 0 10
## Adult.Mortality infant.deaths
## 10 0
## Alcohol percentage.expenditure
## 17 0
## Hepatitis.B Measles
## 144 0
## BMI under.five.deaths
## 34 0
## Polio Total.expenditure
## 0 32
## Diphtheria HIV.AIDS
## 0 0
## GDP Population
## 405 648
## thinness..1.19.years thinness.5.9.years
## 34 34
## Income.composition.of.resources Schooling
## 167 163
After we filled missing values with above method, we still see that several columns have missing values. This indicates that there is no recorded data on each country. So, we will fill it all with 0 except for life expectancy column. We will delete the rows that Life.expectancy data is missing because Life.expectancy data is the mandatory data to analyse.
#Delete data with missing Life.expectancy
<- drop_na(data = data, Life.expectancy)
data
#Fill missing data with 0
is.na(data)] <- 0
data[anyNA(data)
## [1] FALSE
Great!! Now our data is clean!
summary(data)
## Country Year Status Life.expectancy
## Length:2928 2000 : 183 Developed : 512 Min. :36.30
## Class :character 2001 : 183 Developing:2416 1st Qu.:63.10
## Mode :character 2002 : 183 Median :72.10
## 2003 : 183 Mean :69.22
## 2004 : 183 3rd Qu.:75.70
## 2005 : 183 Max. :89.00
## (Other):1830
## Adult.Mortality infant.deaths Alcohol percentage.expenditure
## Min. : 1.0 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 74.0 1st Qu.: 0.00 1st Qu.: 0.610 1st Qu.: 4.854
## Median :144.0 Median : 3.00 Median : 3.580 Median : 65.611
## Mean :164.8 Mean : 30.41 Mean : 4.503 Mean : 740.321
## 3rd Qu.:228.0 3rd Qu.: 22.00 3rd Qu.: 7.600 3rd Qu.: 442.614
## Max. :723.0 Max. :1800.00 Max. :17.870 Max. :19479.912
##
## Hepatitis.B Measles BMI under.five.deaths
## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.:62.00 1st Qu.: 0.0 1st Qu.:19.00 1st Qu.: 0.00
## Median :88.00 Median : 17.0 Median :43.00 Median : 4.00
## Mean :72.09 Mean : 2427.9 Mean :37.82 Mean : 42.18
## 3rd Qu.:96.00 3rd Qu.: 362.2 3rd Qu.:56.10 3rd Qu.: 28.00
## Max. :99.00 Max. :212183.0 Max. :77.60 Max. :2500.00
##
## Polio Total.expenditure Diphtheria HIV.AIDS
## Min. : 3.0 Min. : 0.000 Min. : 2.00 Min. : 0.100
## 1st Qu.:77.0 1st Qu.: 4.210 1st Qu.:78.00 1st Qu.: 0.100
## Median :93.0 Median : 5.700 Median :93.00 Median : 0.100
## Mean :82.3 Mean : 5.867 Mean :82.07 Mean : 1.748
## 3rd Qu.:97.0 3rd Qu.: 7.470 3rd Qu.:97.00 3rd Qu.: 0.800
## Max. :99.0 Max. :17.600 Max. :99.00 Max. :50.600
##
## GDP Population thinness..1.19.years thinness.5.9.years
## Min. : 0.0 Min. :0.000e+00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 225.8 1st Qu.:7.336e+03 1st Qu.: 1.500 1st Qu.: 1.500
## Median : 1194.6 Median :5.459e+05 Median : 3.300 Median : 3.300
## Mean : 6370.1 Mean :9.958e+06 Mean : 4.798 Mean : 4.828
## 3rd Qu.: 4793.6 3rd Qu.:4.593e+06 3rd Qu.: 7.100 3rd Qu.: 7.200
## Max. :119172.7 Max. :1.294e+09 Max. :27.700 Max. :28.600
##
## Income.composition.of.resources Schooling
## Min. :0.0000 Min. : 0.000
## 1st Qu.:0.4660 1st Qu.: 9.575
## Median :0.6620 Median :12.100
## Mean :0.5931 Mean :11.344
## 3rd Qu.:0.7730 3rd Qu.:14.100
## Max. :0.9480 Max. :20.700
##
Based on the summary, we can conclude that the average life expectancy from 193 countries is 69.22 years old, minimum and maximum life expectancy are 36.30 and 89 years old.
Highest and Lowest Life Expectancy
Highest Life Expectancy Country
Now, we want to know country with the highest average life expectancy.
<- aggregate(Life.expectancy ~ Country,
country_avg_le data = data,
FUN = mean)
#Get Top 10 Country with highest Life Expectancy
<- tail(country_avg_le[order(country_avg_le$Life.expectancy),], 10)
country_avg_le_10_up country_avg_le_10_up
#Create barplot
<- barplot(height = country_avg_le_10_up$Life.expectancy,
a names = country_avg_le_10_up$Country,
horiz = T,
las = 1,
main = "Top 10 Countries with Highest Average Life Expectancy",
xlab = "Life Expectancy",
col="#2685de")
text(y = a, x = country_avg_le_10_up[,2]-5, label=country_avg_le_10_up[,2])
Based on above result, we find that Japan have the highest life expectancy average with 82.53750 years old.
Lowest Life Expectancy Country
#Get Top 10 Country with Lowest Life Expectancy
<- tail(country_avg_le[order(country_avg_le$Life.expectancy, decreasing = T),], 10)
country_avg_le_10_down country_avg_le_10_down
#Create barplot
<- barplot(height = country_avg_le_10_down$Life.expectancy,
b names = country_avg_le_10_down$Country,
horiz = T,
las = 1,
main = "Top 10 Countries with Lowest Average Life Expectancy",
xlab = "Life Expectancy",
col="#2685de", cex.names = 0.6)
text(y = b, x = country_avg_le_10_down[,2]-3, label=country_avg_le_10_down[,2])
Based on above result, we find that Sierra Leone have the lowest life expectancy average with 46.11250 years old which under the world life expectancy average.
Country Status Comparison on Life Expectancy
In this data, country status is devied to Developed and Developing Country. we want to compare the average life expectancy betweeen developed and developing country.
<- aggregate(Life.expectancy ~ Status,
average_status data = data,
FUN = mean)
average_status
<- barplot(height = average_status$Life.expectancy,
c names = average_status$Status,
ylab = "Life Expectancy",
col="#2685de")
text(x = c, y = average_status[,2]-2, labels = average_status[,2])
Is Countries GDP have Correlation on Life Expectancy?
#Subset GDP and Life.expectancy columns
<- data[, c("GDP", "Life.expectancy")]
gdp_le #Delete 0 values
<- gdp_le[gdp_le$GDP != 0,]
gdp_le
#Corelation Value
cor(gdp_le$Life.expectancy, gdp_le$GDP)
## [1] 0.4615107
#Scatter plot
plot(gdp_le$Life.expectancy,
$GDP,
gdp_lexlab = "Life Expectancy",
ylab = "GDP")
abline(lm(gdp_le$GDP ~ gdp_le$Life.expectancy),
col = "red")
Based on that result, we can conclude that correlation between GDP and life expctancy is weak-positif which mean life expectancy will increase as long as the GDP increases.
Whats Parameter have the Most Impact on Life Expectancy?
#we need GGally library to use ggcor() function
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggcorr(data, label = T, hjust = 1, layout.exp = 5)
## Warning in ggcorr(data, label = T, hjust = 1, layout.exp = 5): data in column(s)
## 'Country', 'Year', 'Status' are not numeric and were ignored
Based on result, we find that there are some life expectancy parameter having positif and negative corelation. Some strong positive corelation parameter are Schooling, income composition, and measles. Negative corelation parameters are thinness..1.19.years, thinness.5.9.years, and HIV.AIDS.
How about Indonesian Condition?
<- data[data$Country == "Indonesia",]
indonesia_data summary(indonesia_data)
## Country Year Status Life.expectancy
## Length:16 2000 : 1 Developed : 0 Min. :65.30
## Class :character 2001 : 1 Developing:16 1st Qu.:66.85
## Mode :character 2002 : 1 Median :67.60
## 2003 : 1 Mean :67.56
## 2004 : 1 3rd Qu.:68.35
## 2005 : 1 Max. :69.10
## (Other):10
## Adult.Mortality infant.deaths Alcohol percentage.expenditure
## Min. : 19.0 Min. :114.0 Min. :0.050 Min. : 0.000
## 1st Qu.:180.5 1st Qu.:132.8 1st Qu.:0.060 1st Qu.: 9.813
## Median :187.5 Median :151.5 Median :0.065 Median : 49.270
## Mean :166.6 Mean :151.2 Mean :0.070 Mean : 83.768
## 3rd Qu.:189.0 3rd Qu.:173.2 3rd Qu.:0.080 3rd Qu.:141.985
## Max. :213.0 Max. :187.0 Max. :0.090 Max. :254.469
##
## Hepatitis.B Measles BMI under.five.deaths
## Min. :62.00 Min. : 3344 Min. : 2.50 Min. :136.0
## 1st Qu.:64.75 1st Qu.:14105 1st Qu.:17.30 1st Qu.:159.2
## Median :77.00 Median :15671 Median :20.50 Median :184.5
## Mean :73.56 Mean :16245 Mean :19.96 Mean :186.6
## 3rd Qu.:82.00 3rd Qu.:20521 3rd Qu.:24.02 3rd Qu.:216.8
## Max. :85.00 Max. :29171 Max. :27.40 Max. :237.0
##
## Polio Total.expenditure Diphtheria HIV.AIDS
## Min. : 8.00 Min. :1.980 Min. : 7.00 Min. :0.1
## 1st Qu.:56.00 1st Qu.:2.490 1st Qu.:72.00 1st Qu.:0.1
## Median :78.50 Median :2.800 Median :76.50 Median :0.2
## Mean :62.19 Mean :2.675 Mean :72.38 Mean :0.2
## 3rd Qu.:82.25 3rd Qu.:2.862 3rd Qu.:78.75 3rd Qu.:0.3
## Max. :86.00 Max. :3.100 Max. :85.00 Max. :0.3
##
## GDP Population thinness..1.19.years thinness.5.9.years
## Min. : 78.93 Min. : 2145652 Min. : 1.400 Min. : 1.200
## 1st Qu.: 326.13 1st Qu.: 22639758 1st Qu.: 1.575 1st Qu.: 1.475
## Median :1367.41 Median : 24904887 Median : 1.750 Median : 1.700
## Mean :1669.12 Mean :116555259 Mean : 3.419 Mean : 3.944
## 3rd Qu.:3169.16 3rd Qu.:237750488 3rd Qu.: 1.900 3rd Qu.: 4.175
## Max. :3687.95 Max. :258162113 Max. :11.000 Max. :11.200
##
## Income.composition.of.resources Schooling
## Min. :0.5970 Min. :10.60
## 1st Qu.:0.6212 1st Qu.:10.88
## Median :0.6395 Median :11.40
## Mean :0.6414 Mean :11.61
## 3rd Qu.:0.6637 3rd Qu.:12.38
## Max. :0.6860 Max. :12.90
##
<- plot(x = indonesia_data$Year, y = indonesia_data$Life.expectancy,
d xlab = "Year",
ylab = "Life Expectancy")
Based on above summary, Indonesia still in Developing country status, having 67.56 years average life expectancy, below the world average life expectancy 69.22 years old. But, from 2000 until 2015, Indonesian life expectancy always increse.
Conclusion
Life expectancy is number of age that organism expected to life. This have several parameter that having positive corelation and negative corelation. positive corelation means, if value of parameter increases, so the life expectancy increases. And negative corelation means, if value of parameter decreases, so the life expectancy decreases.
The average of life expecancy from 193 countries is 69.22 years old. Japan have the highest life expectancy in 82.53750 years old and Sierra Leone have the lowest life expectancy in just 46.11250 years old.
Indonesia have average life expectancy in 67.56 years old. This still below the global average life expectancy. Country with developing status have life expectancy below the global average. So they need to increases the positive corelation life expectancy parameter significantly and decreases the negative corelation life expectancy parameter significantly. Thats will make their life expectancy increases.