Hridyansh Gulati s3893381, Shaikh Mohammad Rahil s3960736 and Tanya Thankachan s3909102
Last updated: 16 October, 2022
Data collection steps:
Data structure:
MaleLifeEx<-MaleLife%>%select(Country.Name,Country.Code,X2005..YR2005.)
MaleLifeExpec<-data.frame(MaleLifeEx)
colnames(MaleLifeExpec)<-c("CountryNames","CountryCode","MALE")
str(MaleLife)## 'data.frame': 271 obs. of 5 variables:
## $ Series.Name : chr "Life expectancy at birth, male (years)" "Life expectancy at birth, male (years)" "Life expectancy at birth, male (years)" "Life expectancy at birth, male (years)" ...
## $ Series.Code : chr "SP.DYN.LE00.MA.IN" "SP.DYN.LE00.MA.IN" "SP.DYN.LE00.MA.IN" "SP.DYN.LE00.MA.IN" ...
## $ Country.Name : chr "Afghanistan" "Africa Eastern and Southern" "Africa Western and Central" "Albania" ...
## $ Country.Code : chr "AFG" "AFE" "AFW" "ALB" ...
## $ X2005..YR2005.: chr "57.044" "52.2217123832099" "50.3338297768142" "72.708" ...
Data structure:
FemaleLifeEx<-FemaleLife%>%select(Country.Name, Country.Code, X2005..YR2005.)
FemaleLifeExpec<-data.frame(FemaleLifeEx)
colnames(FemaleLifeExpec)<-c("CountryNames","CountryCode","FEMALE")
str(FemaleLife)## 'data.frame': 271 obs. of 5 variables:
## $ Series.Name : chr "Life expectancy at birth, female (years)" "Life expectancy at birth, female (years)" "Life expectancy at birth, female (years)" "Life expectancy at birth, female (years)" ...
## $ Series.Code : chr "SP.DYN.LE00.FE.IN" "SP.DYN.LE00.FE.IN" "SP.DYN.LE00.FE.IN" "SP.DYN.LE00.FE.IN" ...
## $ Country.Name : chr "Afghanistan" "Africa Eastern and Southern" "Africa Western and Central" "Albania" ...
## $ Country.Code : chr "AFG" "AFE" "AFW" "ALB" ...
## $ X2005..YR2005.: chr "59.628" "55.754852591005" "52.2441270069714" "78.165" ...
LifeExp<-inner_join(MaleLifeExpec,FemaleLifeExpec, by=c("CountryNames","CountryCode"))
LifeExp<-LifeExp%>%slice(1:266)## 'data.frame': 266 obs. of 4 variables:
## $ CountryNames: chr "Afghanistan" "Africa Eastern and Southern" "Africa Western and Central" "Albania" ...
## $ CountryCode : chr "AFG" "AFE" "AFW" "ALB" ...
## $ MALE : num 57 52.2 50.3 72.7 71.8 ...
## $ FEMALE : num 59.6 55.8 52.2 78.2 74.4 ...
## CountryNames CountryCode MALE FEMALE
## 0 0 22 22
There are missing values..
## [1] 266
## [1] 266
## [1] 244
## [1] 244
NA values are removed.
LifeExp<-LifeExp%>%pivot_longer(names_to = "Gender", values_to = "Life Expectancy", cols = 3:4 )
head(LifeExp)Lets factorize gender Gender variable -
## tibble [488 × 4] (S3: tbl_df/tbl/data.frame)
## $ CountryNames : chr [1:488] "Afghanistan" "Afghanistan" "Africa Eastern and Southern" "Africa Eastern and Southern" ...
## $ CountryCode : chr [1:488] "AFG" "AFG" "AFE" "AFE" ...
## $ Gender : Factor w/ 2 levels "FEMALE","MALE": 2 1 2 1 2 1 2 1 2 1 ...
## $ Life Expectancy: num [1:488] 57 59.6 52.2 55.8 50.3 ...
#Life Expectancy summary
LifeExp%>%group_by(Gender)%>%summarise(Min = min(`Life Expectancy`,na.rm = TRUE),
Q1 = quantile(`Life Expectancy`,probs = .25,na.rm = TRUE),
Median = median(`Life Expectancy`, na.rm = TRUE),
Q3 = quantile(`Life Expectancy`,probs = .75,na.rm = TRUE),
Max = max(`Life Expectancy`,na.rm = TRUE),
Mean = mean(`Life Expectancy`, na.rm = TRUE),
SD = sd(`Life Expectancy`, na.rm = TRUE),
n = n(),
Missing = sum(is.na(`Life Expectancy`)))| country | year | cases | population |
|---|---|---|---|
| Afghanistan | 1999 | 745 | 19987071 |
| Afghanistan | 2000 | 2666 | 20595360 |
| Brazil | 1999 | 37737 | 172006362 |
| Brazil | 2000 | 80488 | 174504898 |
| China | 1999 | 212258 | 1272915272 |
| China | 2000 | 213766 | 1280428583 |
This sample data suggests females have higher mean life expectancies at birth over the world.
We notice outliers in both male and female gender for Life Expectancy variable. But since there is 1 outlier per gender and is very close to the lower outlier, we choose to keep these.
The boxplot also shows that although the life expectancy at birth is nearly equal for both genders, it appears that women actually have longer life expectancies.
We can determine whether this difference is statistically significant using the hypothesis test, two-sample t-test. Let’s get started by considering the assumptions behind the two-sample t-test. Before that, lets properly define the hypothesis.
Lets test the 2 assumptions of two-sample t-test; Test of Assumption of Normality and, Homogeneity of Variance on the Life Expectancy variable for male and female genders.
#normality test on male population
Life_Expectancy_male <- LifeExp %>% filter(LifeExp$Gender == "MALE")
Life_Expectancy_male$`Life Expectancy`%>% qqPlot(dist="norm")## [1] 127 67
#normality test on female population
Life_Expectancy_female <- LifeExp %>% filter(LifeExp$Gender == "FEMALE")
Life_Expectancy_female$`Life Expectancy`%>% qqPlot(dist="norm")## [1] 67 244
We notice that some of the data points fall outside the blue lines for both male and female samples indicating non-normality of the distribution. However, from the summary statistics, we see the sample sizes for male and female populations to be 244 each. Using the CLT (Central Limit Theorem), we know that when the sample size is large (i.e. n>30) the sampling distribution of a mean will be approximately normally distributed, regardless of the underlying population distribution. Thus, since the normality condition is satisfied for the two-sample t-test.
We will use Levene’s test to test Homogeneity of variance, or the assumption of equal variance. The Levene’s test has the following statistical hypotheses:
\[H_0: \sigma_1^2 = \sigma_2^2 \]
\[H_A: \sigma_1^2 \ne \sigma_2^2\] where \(\sigma_1^2\) and \(\sigma_2^2\) refer to the population variance of female and male life expectancies, respectively. The Levene’s test reports a p-value that is compared to the standard 0.05 significance level ($$). We can use the leveneTest() function in R to compare the variances of male and female life expectancies:
Levene’s Test Result -
The \(p\)-value for the Levene’s test of equal variance for Life expectancy between males and females was \(p\) = 0.3199. Since \(p\) > 0.05, we fail to reject \(H_0\) (null hypothesis). In plain language, we are safe to assume equal variance. The assumption of equal variance is important because it will determine the type of two-sample \(t\)-test we will perform.
With the assumption of equal variance and assumption of normality, we can now perform \(t\)-test on the Life expectancy at birth variable for male and female populations.
We perform a two-sided hypothesis test as the hypotheses we will be stating are non-directional (\(μ_1\) – \(μ_2\) = 0) and (\(μ_1\) – \(μ_2\) != 0), there is no (\(μ_1\) – \(μ_2\) < 0 or \(μ_1\) – \(μ_2\) > 0). We use the t.test().
The two-sample tt-test has the following statistical hypotheses: \[H_0:\mu_1−\mu_2=0\] \[H_A:\mu_1−\mu_2≠0\] where,
\(H_0\) (null hypothesis) states that the difference between the two independent population means, that is, mean female life expectancy \(μ_1\) and mean male life expectancy \(μ_2\), is 0
and,
\(H_A\) (Altenate hypothesis) states that the difference between the two independent population means, that is, mean female life expectancy \(μ_1\) and mean male life expectancy \(μ_2\), is not 0.
Or in other words, null hypothesis is, male and female have equal mean life expectancies and, alternate hypothesis is male and female have different mean life expectancies.
Now, lets run the \(t\)-test -
##
## Two Sample t-test
##
## data: Life Expectancy by Gender
## t = 5.535, df = 486, p-value = 5.099e-08
## alternative hypothesis: true difference in means between group FEMALE and group MALE is not equal to 0
## 95 percent confidence interval:
## 3.111621 6.536644
## sample estimates:
## mean in group FEMALE mean in group MALE
## 70.72492 65.90078
We have used the var.equal = TRUE option to perform the equal variance assumed two-sample t-test and the alternative = “two-sided” option to specify a two-tailed test.
\(T\)-Test Results -
The difference between males and females estimated by the sample was 70.72492 - 65.90078 = 4.82414.
The test statistic \(t\) = 5.535
The t-statistic is compared to a two-tailed t-critical value \(t*\) with \(df\): \[ df = n_1 + n_2 - 2 \] For two-tailed hypothesis testing, the rejection regions are split between above and below \(H_0\). We still need to maintain an overall significance level of 0.05. Since it is a two-sided hypothesis test, \(α\) splits as \(α/2\) for the upper and lower tail. We find the \(t\) - critical values associated with 0.05/2 = 0.025 in the upper and lower tail of the sampling distribution under \(H_0\) using qt() in R:
## [1] 1.964857
Thus, \(t*\) is 1.964857.
As the test statistic \(t\) from the two-sample \(t\)-test assuming equal variance was t = 5.535, which was more extreme than 1.964857, we reject H\(_0\) (null hypothesis). Thus, according to the critical value method, there was a statistically significant difference between male and female life expectancy means.
The \(p\)-value of the two-sample \(t\)-test will tell us the probability of observing a sample difference between the means of 4.82414 (from \(t\)-test result, difference of means), or one more extreme, assuming the difference was 0 in the population (i.e. \(H_0\) is true). The two-tailed \(p\)-value was reported to be \(p\) = 5.099e-08. According to the pp-value method, as \(p\) = 5.099e-08 < \(α\) (0.05), we reject \(H_0\). Thus, according to the \(p\)-value method, there was a statistically significant difference between the means.
The 95% CI of the difference between the means (4.82414) is reported as 95% CI [3.111621 6.536644] (from \(t\)-test). As this interval does not capture \(H_0\) (0 mean difference), we reject it. Once again, according to the Confidence Interval method, there was a statistically significant difference between the means.
A two-sample \(t\)-test was used to test for a significant difference between the mean life expectancy at birth of males and females. While the life expectancy for males and females exhibited evidence of non-normality upon inspection of the normal Q-Q plot, the central limit theorem ensured that the t-test could be applied due to the large sample size (244) in each group. The Levene’s test of homogeneity of variance indicated that equal variance could be assumed. The results of the two-sample t-test assuming equal variance found a statistically significant difference between the mean life expectancy at birth of males and females, \(t\) (\(df\)=468) = 5.535, \(p\) = 5.099e-08, 95% CI for the difference in means [3.111621 6.536644]. The results of the investigation suggest that females have significantly higher life expectancy at birththan males. Thus, gender does play a role in defining average life expectancies at birth.
However, there are limitations associated with our investigation. The dataset had missing values for some of the countries and/or geographical locations. Thus, it is not a good representation of the entire world population.
University of Oxford (2021) What is AI? Here’s everything you need to know about artificial intelligence, ourworldindata.org website, accessed 15 October 2021. https://ourworldindata.org/grapher/life-expectancy-of-women-vs-life-expectancy-of-women
The World Bank, DataBank| Health Nutrition and Population Statistics, accessed 10 October 2021. https://databank.worldbank.org/source/health-nutrition-and-population-statistics#