mydata <- fread("./Life Expectancy Data.csv",)
mydata1 <- mydata %>%
group_by(Country) %>%
slice_min(Year)
mydata2 <- subset(mydata1, select = c(1,3,4,22,11,14))
names(mydata2)[names(mydata2) == "Life expectancy"] = "LifeExpectancy"
names(mydata2)[names(mydata2) == "Total expenditure"] = "TotalExpenditure"
head(mydata2)
## # A tibble: 6 × 6
## # Groups: Country [6]
## Country Status LifeExpectancy Schooling BMI TotalExpenditure
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan Developing 54.8 5.5 12.2 8.2
## 2 Albania Developing 72.6 10.7 45 6.26
## 3 Algeria Developing 71.3 10.7 44.4 3.49
## 4 Angola Developing 45.3 4.6 15.4 2.79
## 5 Antigua and Barbuda Developing 73.6 0 38.2 4.13
## 6 Argentina Developing 74.1 15 54 9.21
Unit of Observation: Country
Sample Size: 193 before cleaning
Country: Name of a country.
Status: Whether a country is developed or developing.
Life Expectancy: Life expectancy at birth in years.
Schooling: Population average number of years of Schooling in years.
BMI: Average Body Mass Index of entire population
Total Expenditure General government expenditure on health as a percentage of total government expenditure (%)
Research Question: How can a country increase Life Expectancy?
Data Source: Kaggle Life Expectancy (WHO) https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who
sum(is.na(mydata2))
## [1] 31
mydata3 <- na.omit(mydata2)
Removed 31 missing values
sum(duplicated(mydata2))
## [1] 0
mydata4 <- mydata3[!duplicated(mydata3), ]
There are no duplicate values
mydata4$Status <- as.factor(mydata4$Status)
Changed variable Status into a factor
describeBy(mydata4)
## Warning in describeBy(mydata4): no grouping variable requested
## vars n mean sd median trimmed mad min max range
## Country* 1 170 85.50 49.22 85.50 85.50 63.01 1.0 170.00 169.00
## Status* 2 170 1.83 0.38 2.00 1.91 0.00 1.0 2.00 1.00
## LifeExpectancy 3 170 67.04 10.17 71.10 68.09 9.12 39.0 81.10 42.10
## Schooling 4 170 10.62 4.03 11.40 10.90 3.41 0.0 20.40 20.40
## BMI 5 170 34.67 18.87 38.05 35.18 24.61 1.4 67.90 66.50
## TotalExpenditure 6 170 5.58 2.04 5.41 5.53 2.11 1.1 13.63 12.53
## skew kurtosis se
## Country* 0.00 -1.22 3.77
## Status* -1.74 1.02 0.03
## LifeExpectancy -0.79 -0.46 0.78
## Schooling -0.68 0.36 0.31
## BMI -0.17 -1.42 1.45
## TotalExpenditure 0.42 0.35 0.16
The mean life expectancy is 67.04 years, therefore the we expect the life expectancy of a random country in this sample to be close to 67.04 years.
The median of life expectancy is 71.10 years, 50% of the countries in this sample had life expectancy less than 71.10 years and 50% had a life expectancy higher than 71.10 years.
The mean years of schooling is 10.62, therefore the we expect the average years of schooling of a random country in this sample to be close to 10.62 years.
The median of years of average years of schooling is 11.4, 50% of the countries in this sample had population average number of years of schooling less than 11.4 and 50% had higher than 11.4 years.
The mean BMI is 34.67, therefore the we expect the population average BMI of a random country in this sample to be close to 34.67.
The median BMI is 38.05, 50% of the countries in this sample had population average BMI less than 38.05 and 50% had a BMI higher than 34.67.
The mean total expenditure on education is 5.58%, therefore the we expect the total expenditure on education of a random country in this sample to be close to 5.58%.
The median of total expenditure is 5.41%, 50% of the countries in this sample spent less than 5.41% of total government expenditure on health and 50% spent more than 5.41% of total government expenditure on health.
tab1(mydata4$Status, sort.group = "decreasing", cum.percent = FALSE,)
## mydata4$Status :
## Frequency Percent
## Developing 141 82.9
## Developed 29 17.1
## Total 170 100.0
18% of the countries in this sample are developed whilst 82% are not.
ggplot(mydata4, aes(x=Status, y=LifeExpectancy, fill=Status)) +
geom_boxplot(outlier.shape = 19)+
ggtitle("Boxplot of Life Expectancy by a countries Status")
We can see that developed countries are have a higher life expectancy, however we can see that developing countries have a very large range of life expectancy, therefore we should look at the distribution of Life Expectancy.
ggplot(mydata4, aes(x=LifeExpectancy)) +
geom_density(alpha=.3, fill="green", color="black", size=1.5)+
geom_vline(aes(xintercept=median(LifeExpectancy)))+
labs(y = "Density", x = "Life Expectancy")+
ggtitle("Distribution density of Life Expectancy")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
Life expectancy is skewed to the left, due to some countries with very low life expectancy. Therefore we should use the median Life expectancy for further analyses as the mean Life expectancy will be underestimated due to the skewness and the large range.
ggplot(mydata4, aes(x=BMI)) +
geom_density(alpha=.3, fill="orange", color="black", size=1.5)+
labs(y = "Density", x = "BMI")+
ggtitle("Distribution density of BMI")
Looking at BMI we can see a Bimodel distribution, this make BMI unsuitable for analysis as there is likely to be another variable causing each peak. In this case it is likely that gender differences, (Male and Female) are responsible for this distribution, however Male and Female categories are not available in this sample.
ggplot(mydata4, aes(x=, y=Schooling,)) +
geom_boxplot(outlier.shape = 19, outlier.colour = "red", outlier.size =5)
ggplot(mydata4, aes(x=Schooling)) +
geom_density(alpha=.3, fill="red", color="black", size=1.5)+
geom_vline(aes(xintercept=median(Schooling))) +
xlim(c(1, 21))+
labs(y = "Density", x = "Schooling")+
ggtitle("Distribution density of Schooling")
## Warning: Removed 7 rows containing non-finite values (`stat_density()`).
Schooling has several outliers at value 0, trimming these shows a normal distribution, with a slight skew to the left. Due to this skew it may be best to use the median of the schooling for further analysis.
ggplot(mydata4, aes(x=, y=TotalExpenditure,)) +
geom_boxplot(outlier.shape = 19, outlier.colour = "red", outlier.size =5)
ggplot(mydata4, aes(x=TotalExpenditure)) +
geom_histogram(binwidth = .8, alpha=.3, fill="purple", size=1.5)+
xlim(c(0, 11))+
labs(y = "Frequency", x = "Total Expenditure")+
ggtitle("Distribution of Total Expenditure")
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 2 rows containing missing values (`geom_bar()`).
We can see that there are two outliers in Total Expenditure which may reduce the accuracy of analysis. With these outliers trimmed we can see the Total Expenditure shows normal distribution in a histogram and so we can use the mean Total Expenditure as the average Total Expenditure from each country.
Q1 <- quantile(mydata4$Schooling, .25)
Q3 <- quantile(mydata4$Schooling, .75)
IQR <- IQR(mydata4$Schooling)
mydata5 <- subset(mydata4, mydata4$Schooling> (Q1 - 1.5*IQR) & mydata4$Schooling< (Q3 + 1.5*IQR))
dim(mydata5)
## [1] 163 6
Q1 <- quantile(mydata5$TotalExpenditure, .25)
Q3 <- quantile(mydata5$TotalExpenditure, .75)
IQR <- IQR(mydata5$TotalExpenditure)
mydata6 <- subset(mydata5, mydata5$TotalExpenditure> (Q1 - 1.5*IQR) & mydata5$TotalExpenditure< (Q3 + 1.5*IQR))
dim(mydata6)
## [1] 162 6
Removed outliers from Schooling and Total Expenditure
scatterplotMatrix(mydata6[ , c(-1, -2)],
smooth = FALSE)
Life expectancy is positively correlated with all of our variables, all of our variable increase life expectancy. The strongest being average years of schooling. BMI is bi model and needs to be split into two different variables that are not available in the data, so we will not look at this.
ggplot(mydata6, aes(x = LifeExpectancy, y = Schooling)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)+
ggtitle("Relationship between Life Expectancy and Scooling")
## `geom_smooth()` using formula = 'y ~ x'
Taking a closer look at the correlation between Schooling and Life expectancy we see a Moderate correlation, suggesting that the more the average years of schooling, the greater the life expectancy of a country.
cor(mydata6$LifeExpectancy, mydata6$Schooling)
## [1] 0.7829121
cor(mydata6$LifeExpectancy, mydata6$TotalExpenditure)
## [1] 0.2226708
Indeed the correlation coefficient is 0.782, therefore life expectancy and average years of schooling are positively, strongly correlated. The correlation coefficient for life expectancy and expenditure on health is positive but much weaker at 0.222. Suggesting schooling may have a greater impact on life expectancy than and should be analysed further.
ggplot(mydata6, aes(x = LifeExpectancy, y = Schooling, color = Status)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
geom_point()
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity
ggtitle("Life Expectancy and Scooling grouped by a countries Status")
## $title
## [1] "Life Expectancy and Scooling grouped by a countries Status"
##
## attr(,"class")
## [1] "labels"
ggplot(mydata6, aes(x = LifeExpectancy, y = TotalExpenditure, color = Status)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
ggtitle("Life Expectancy and Health Expenditure grouped by a countries Status")
## `geom_smooth()` using formula = 'y ~ x'
geom_point()
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity
The correlation is slightly stronger in developed countries for schooling. But more interestingly, the correlation is much more significant in developed countries for expenditure on health. Suggesting developing countries should focus on schooling and developed countries should focus on schooling and spending on health.