Introduction

Life expectancy is a measure of how long the average people live which can calculated from the average ages of their fatalities.
Life expectancy depend on a number of factors such as life style, education, medical technology and quality of life.
In fact, almost countries that have good well-being and sufficient public utility are developed countries.
Nevertheless, all countries around the world have increased life expectancy in many decades ago.
So, in this assignment, there will be represented statistical analysis of life expectancy in developed and developing countries in 2015, in order to compare different type of countries (developed and developing) that have affect to life expectancy or not.

Figure1: Comparing between developed and developing countries

Problem Statement

The statistical analysis of life expectancy in different countries will be separated them to two groups consist of developed and developing countries.
The experiment would like to find whether or not life expectancy depends on these status of country.

The statistical analysis including:

Hypothesis Testing by Levene test.
Using two-sample t-test for analysis.

Data

The dataset was downloaded from open source website named kaggle.com (https://www.kaggle.com/kumarajarshi/life-expectancy-who).
The dataset contains a lot of measurement from 2000 to 2015 for 193 countries around the world.
In this assignment, the experiment will be considered of life expectancy and divided data into two groups including developed and developing countries, and also in 2015.
Developed countries have 32 sample sizes, and developing countries have 151 sample sizes.

The important variables and description as below:

Year : Year of observation
Status : Status of countries (developed/developing)
life expectancy : Life expectancy (year) of people

Data Cont.

Before analysis the data need to preprocess by:

Separating status of country by subset them.
choosing the sample in year 2015.
Using box plot for visualisation the data.
Checking data normality by Q-Q plot.

Data Cont.

Subset data from the dataset in order to prepare for analysis.

life_exp <- read.csv("Life Expectancy Data.csv")


exp_status <- subset(life_exp, Year == 2015 , select=c(Status, Life.expectancy))


developed <- subset(exp_status, Status == "Developed")
developing <- subset(exp_status, Status == "Developing")

Data Cont.

The data represent life expectancy in developed countries.

tibble(developed$Status, developed$Life.expectancy)

## # A tibble: 32 x 2
##    `developed$Status` `developed$Life.expectancy`
##    <chr>                                    <dbl>
##  1 Developed                                 82.8
##  2 Developed                                 81.5
##  3 Developed                                 81.1
##  4 Developed                                 74.5
##  5 Developed                                 78  
##  6 Developed                                 85  
##  7 Developed                                 78.8
##  8 Developed                                 86  
##  9 Developed                                 81  
## 10 Developed                                 75.8
## # … with 22 more rows

Data Cont.

The data represent life expectancy in developing countries.

tibble(developing$Status, developing$Life.expectancy)

## # A tibble: 151 x 2
##    `developing$Status` `developing$Life.expectancy`
##    <chr>                                      <dbl>
##  1 Developing                                  65  
##  2 Developing                                  77.8
##  3 Developing                                  75.6
##  4 Developing                                  52.4
##  5 Developing                                  76.4
##  6 Developing                                  76.3
##  7 Developing                                  74.8
##  8 Developing                                  72.7
##  9 Developing                                  76.1
## 10 Developing                                  76.9
## # … with 141 more rows

Data Cont.

Checking type of observations. And then create factor of Status.

str(exp_status)

## 'data.frame':    183 obs. of  2 variables:
##  $ Status         : chr  "Developing" "Developing" "Developing" "Developing" ...
##  $ Life.expectancy: num  65 77.8 75.6 52.4 76.4 76.3 74.8 82.8 81.5 72.7 ...

exp_status$Status <- as.factor(exp_status$Status)

Statistics and Visualisation

In this step, Visualisation by Box plot for comparing life expectancy in developed and developing countries.

exp_status %>%  boxplot(Life.expectancy~Status, data = ., ylab = "Life Expectancy (year)")

Statistics and Visualisation cont.

Using Q-Q plot to check data normality.

par(mfrow = c(1, 2))
qqPlot(developed$Life.expectancy, dist = "norm", ylab = "Life expectancy (developed)")

## [1] 27 16

qqPlot(developing$Life.expectancy, dist = "norm", ylab = "Life expectancy (developing)")

## [1] 120   4

Statistics and Visualisation cont.

The result of box plot shows life expectancy in developed countries are higher than developing countries.
Moreover, box plot illustrates there are no outliers in both group countries.
The Q-Q plot represent that the data of developing countries looks follow a normal distribution than developed countries.
Nevertheless, both sample sizes are large (n > 30) which means normality can be assumed.
Therefore, they can be used two-sample t-test, though the normality assumption is violated.

Statistics Cont.

The summary of life expectancy values in two groups.

exp_status %>% group_by(Status) %>% summarise(Min = min(Life.expectancy, na.rm = TRUE),
                                           Q1 = quantile(Life.expectancy, probs = .25, na.rm = TRUE),
                                           Median = median(Life.expectancy, na.rm = TRUE),
                                           Q3 = round(quantile(Life.expectancy, probs = .75, na.rm = TRUE),2),
                                           Max = max(Life.expectancy, na.rm = TRUE),
                                           Mean =round(mean(Life.expectancy, na.rm = TRUE),2),
                                           SD = round(sd(Life.expectancy, na.rm = TRUE),2),
                                           n = n(),
                                           Missing = sum(is.na(Life.expectancy)))

## # A tibble: 2 x 10
##   Status       Min    Q1 Median    Q3   Max  Mean    SD     n Missing
##   <fct>      <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <int>   <int>
## 1 Developed   73.6  78.6   81.6  82.7    88  80.7  3.46    32       0
## 2 Developing  51    64.6   71.6  75.5    85  69.7  7.5    151       0

Hypothesis Testing

Using leveneTest() function in order to compare the variances of developed and developing countries.

\[H_0: a_1^2 = σ_2^2 \]

\[H_A: a_1^2 ≠ σ_2^2 \]

exp_status %>%  leveneTest(Life.expectancy ~ Status, data =. )

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value    Pr(>F)    
## group   1  19.469 1.753e-05 ***
##       181                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As the result shows \(p\) < 0.05. Thus, it need to reject \(H_0\) and it can assume that there are unequal variance.

Hypothesis Testing Cont.

Statistical hypotheses for two-sample t-test:

\[H_0: \mu_1 - \mu_2 = 0\]

\[H_A: \mu_1 - \mu_2 \ne 0\]

Two-sample t-test with Unequal Variance. It can be used t.test() function by determine var.equal = FALSE, and this test is known as Welch two-sample t-test.

exp_status %>% t.test(Life.expectancy ~ Status, data =., var.equal = FALSE, alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  Life.expectancy by Status
## t = 12.753, df = 102.42, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   9.305573 12.733045
## sample estimates:
##  mean in group Developed mean in group Developing 
##                 80.70937                 69.69007

interpretation

Levene’s test showed statistically significant and it assume that unequal variance.
So, the test used Welch two-sample t-test in order to compare two independent samples.
The hypothesis test illustrated there is a significant difference between mean of developed and developing countries.
The result of two-sample t-test represents t = 12.753, \(p\) < 0.05 at 95% CI of difference in means [9.306,12.733], which is not capture \(H_0\).

Discussion

The finding represent developed and developing countries (in 2015) have affect to life expectancy significantly.
The average of life expectancy in developed country are 80.71, meanwhile developing countries are 69.69.
Therefore, it can conclude that people in developed countries have longer life expectancy.
As the box plot shows that there are no outliers in developed and developing countries’ life expectancy. However, in the future, if the data has any outliers, we need to fix them before testing.
Nevertheless, this investigation has very different number of sample size. Thus, The next investigation would like to increase the sample size of developed countries as the same as developing countries for comparing them in order to be ensure that there have an accurate analysis.

References

Kaggle, Life Expectancy (WHO), Kaggle, viewed 26 September 2020, https://www.kaggle.com/kumarajarshi/life-expectancy-who
Surbhi, S 2015, Difference Between Developed Countries and Developing Countries , image, Key Different, viewed 10 October 2020, https://keydifferences.com/difference-between-developed-countries-and-developing-countries.html

MATH1324 Assignment 2

Statistical analysis of life expectancy in developed and developing countries

RPubs link information

Introduction

Problem Statement

The statistical analysis including:

Data

The important variables and description as below:

Data Cont.

Before analysis the data need to preprocess by:

Data Cont.

Data Cont.

Data Cont.

Data Cont.

Statistics and Visualisation

Statistics and Visualisation cont.

Statistics and Visualisation cont.

Statistics Cont.

Hypothesis Testing

Hypothesis Testing Cont.

interpretation

Discussion

References