MATH 1324 Applied Analytics

Assignment 2

Hridyansh Gulati s3893381, Shaikh Mohammad Rahil s3960736 and Tanya Thankachan s3909102

Last updated: 16 October, 2022

Introduction

We have often heard ‘women live longer than men’. How true is this?
Oxford website suggests, the life expectancy of women is higher than for men for most countries (Our World in Data 2021).
We intend to gather statistical evidence through Hypothesis Testing to draw inferences on the question, is gender a factor in the life expectancy at birth?
Life expectancy is how long, on average, a newborn can be expected to live, if the current death rates do not change.

Problem Statement

Using a sample data of the total population, our objective is either to draw an inferential conclusion about the population that males and females have the same average life expectancy (null hypothesis) or disprove this and infer that gender plays no role in defining average life expectancies (alternate hypothesis).
To answer this question, we will be using Hypothesis Testing and conducting a two-sample t-test to compare the differences between average life expectancy for males and females.

Data

We required worldwide male and female life expectancy data.
The sample datasets for male and female population were taken from The World Bank website at the following URL: https://databank.worldbank.org/source/health-nutrition-and-population-statistics#.
the sample data is for 2005 and holds the life expectancy numbers from various countries and graphical locations all around the world
Each dataset holds 5 variables:
- Series Names, which is simply a title
- Series Codes, data specific code
- Country Name
- Country Code, an Abbreviation
- X2005..YR2005, the life expectancy in years
X2005..YR2005 is the numeric variable with 2 digit scale and decimals and other 4 are character variables.

Data collection (Data Cont.)

Data collection steps:

Navigate to https://databank.worldbank.org/source/health-nutrition-and-population-statistics#
In Variables sub-tab, under Database, select Health Nutrition and Population Statistics
Under Country, select everything
Under Series, search life expectancy and filter for life expectancy at birth, female (years) and life expectancy at birth, male (years)
Under Year, select relevant year
Apply changes and download using download options on top right.

Reading the data (Data Cont.)

Read Male life expectancy data

Imported using the read.csv() and display using head():

MaleLife <-
  read.csv("C:/Users/admin/Downloads/Males.csv")
head(MaleLife)

Data structure:

Data set contains 271 observations in 5 variables.

MaleLifeEx<-MaleLife%>%select(Country.Name,Country.Code,X2005..YR2005.)
MaleLifeExpec<-data.frame(MaleLifeEx)
colnames(MaleLifeExpec)<-c("CountryNames","CountryCode","MALE")
str(MaleLife)

## 'data.frame':    271 obs. of  5 variables:
##  $ Series.Name   : chr  "Life expectancy at birth, male (years)" "Life expectancy at birth, male (years)" "Life expectancy at birth, male (years)" "Life expectancy at birth, male (years)" ...
##  $ Series.Code   : chr  "SP.DYN.LE00.MA.IN" "SP.DYN.LE00.MA.IN" "SP.DYN.LE00.MA.IN" "SP.DYN.LE00.MA.IN" ...
##  $ Country.Name  : chr  "Afghanistan" "Africa Eastern and Southern" "Africa Western and Central" "Albania" ...
##  $ Country.Code  : chr  "AFG" "AFE" "AFW" "ALB" ...
##  $ X2005..YR2005.: chr  "57.044" "52.2217123832099" "50.3338297768142" "72.708" ...

Read Female life expectancy data

Imported using the read.csv() and display using head():

FemaleLife <-
  read.csv("C:/Users/admin/Downloads/Females.csv")
head(FemaleLife)

Data structure:

Data set contains 271 observations in 5 variables.

FemaleLifeEx<-FemaleLife%>%select(Country.Name, Country.Code, X2005..YR2005.)
FemaleLifeExpec<-data.frame(FemaleLifeEx)
colnames(FemaleLifeExpec)<-c("CountryNames","CountryCode","FEMALE")
str(FemaleLife)

## 'data.frame':    271 obs. of  5 variables:
##  $ Series.Name   : chr  "Life expectancy at birth, female (years)" "Life expectancy at birth, female (years)" "Life expectancy at birth, female (years)" "Life expectancy at birth, female (years)" ...
##  $ Series.Code   : chr  "SP.DYN.LE00.FE.IN" "SP.DYN.LE00.FE.IN" "SP.DYN.LE00.FE.IN" "SP.DYN.LE00.FE.IN" ...
##  $ Country.Name  : chr  "Afghanistan" "Africa Eastern and Southern" "Africa Western and Central" "Albania" ...
##  $ Country.Code  : chr  "AFG" "AFE" "AFW" "ALB" ...
##  $ X2005..YR2005.: chr  "59.628" "55.754852591005" "52.2441270069714" "78.165" ...

Combine Male and Female datasets

Join on CountryNames and/or CountryCode using inner join (equal record counts)
Save in to LifeExp dataframe

LifeExp<-inner_join(MaleLifeExpec,FemaleLifeExpec, by=c("CountryNames","CountryCode"))
LifeExp<-LifeExp%>%slice(1:266)

Data type conversion

Convert numeric variables read as character to numeric

LifeExp$MALE<-as.numeric(LifeExp$MALE)
LifeExp$FEMALE<-as.numeric(LifeExp$FEMALE)
str(LifeExp)

## 'data.frame':    266 obs. of  4 variables:
##  $ CountryNames: chr  "Afghanistan" "Africa Eastern and Southern" "Africa Western and Central" "Albania" ...
##  $ CountryCode : chr  "AFG" "AFE" "AFW" "ALB" ...
##  $ MALE        : num  57 52.2 50.3 72.7 71.8 ...
##  $ FEMALE      : num  59.6 55.8 52.2 78.2 74.4 ...

Data Cleaning and Tidying (Data cont.)

Missing value detection and correction

Check NA values per variable

colSums(is.na(LifeExp))

## CountryNames  CountryCode         MALE       FEMALE 
##            0            0           22           22

There are missing values..

Check data length

length(LifeExp$MALE)

## [1] 266

length(LifeExp$FEMALE)

## [1] 266

Remove the NA observations

LifeExp<-LifeExp[complete.cases(LifeExp),]

Check data length again

length(LifeExp$MALE)

## [1] 244

length(LifeExp$FEMALE)

## [1] 244

NA values are removed.

Tidying data

Format from wide to long

LifeExp<-LifeExp%>%pivot_longer(names_to = "Gender",  values_to = "Life Expectancy", cols = 3:4 )
head(LifeExp)

Lets factorize gender Gender variable -

LifeExp$Gender<-as.factor(LifeExp$Gender)
str(LifeExp)

## tibble [488 × 4] (S3: tbl_df/tbl/data.frame)
##  $ CountryNames   : chr [1:488] "Afghanistan" "Afghanistan" "Africa Eastern and Southern" "Africa Eastern and Southern" ...
##  $ CountryCode    : chr [1:488] "AFG" "AFG" "AFE" "AFE" ...
##  $ Gender         : Factor w/ 2 levels "FEMALE","MALE": 2 1 2 1 2 1 2 1 2 1 ...
##  $ Life Expectancy: num [1:488] 57 59.6 52.2 55.8 50.3 ...

Descriptive Statistics and Visualisation

LifeExp dataframe has Country names, Country code, Gender and Life Expectancy

Summary statistics of Life Expectancy variable

Lets look at the summary statistics of Life Expectancy for Male and Female Genders. We use summarise(),

#Life Expectancy summary
LifeExp%>%group_by(Gender)%>%summarise(Min = min(`Life Expectancy`,na.rm = TRUE),
                                                   Q1 = quantile(`Life Expectancy`,probs = .25,na.rm = TRUE),
                                                   Median = median(`Life Expectancy`, na.rm = TRUE),
                                                   Q3 = quantile(`Life Expectancy`,probs = .75,na.rm = TRUE),
                                                   Max = max(`Life Expectancy`,na.rm = TRUE),
                                                   Mean = mean(`Life Expectancy`, na.rm = TRUE),
                                                   SD = sd(`Life Expectancy`, na.rm = TRUE),
                                                   n = n(),
                                                   Missing = sum(is.na(`Life Expectancy`)))

knitr::kable(table1)

country	year	cases	population
Afghanistan	1999	745	19987071
Afghanistan	2000	2666	20595360
Brazil	1999	37737	172006362
Brazil	2000	80488	174504898
China	1999	212258	1272915272
China	2000	213766	1280428583

This sample data suggests females have higher mean life expectancies at birth over the world.

Outlier detection

Now, lets check if there are any outliers in the Life Expectancy variable for male and female gender.
We use box plots to visualize these outliers using boxplot(),

LifeExp%>%boxplot(`Life Expectancy`~ Gender, data=., ylab = "Life Expectancy  at birth")

We notice outliers in both male and female gender for Life Expectancy variable. But since there is 1 outlier per gender and is very close to the lower outlier, we choose to keep these.
The boxplot also shows that although the life expectancy at birth is nearly equal for both genders, it appears that women actually have longer life expectancies.

We can determine whether this difference is statistically significant using the hypothesis test, two-sample t-test. Let’s get started by considering the assumptions behind the two-sample t-test. Before that, lets properly define the hypothesis.

Hypothesis Testing

Lets test the 2 assumptions of two-sample t-test; Test of Assumption of Normality and, Homogeneity of Variance on the Life Expectancy variable for male and female genders.

Testing the Assumption of Normality:

If we can satisfy that the data are approximately normal, we can go ahead with the two-sample tt-test.
best approach, check data visually for gross departures from normality using the QQ plots
If data points fall close to the diagonal line, the distribution is normal
But due to the sampling error of the sampled data, the data points won’t fall directly on the line.
Our LifeExp sample data has 244 sample size (n > 30). Using CLT, the data will be normal
Still, lets look at the QQ plot.

#normality test on male population
Life_Expectancy_male <- LifeExp %>% filter(LifeExp$Gender == "MALE")
Life_Expectancy_male$`Life Expectancy`%>% qqPlot(dist="norm")

## [1] 127  67

#normality test on female population
Life_Expectancy_female <- LifeExp %>% filter(LifeExp$Gender == "FEMALE")
Life_Expectancy_female$`Life Expectancy`%>% qqPlot(dist="norm")

## [1]  67 244

We notice that some of the data points fall outside the blue lines for both male and female samples indicating non-normality of the distribution. However, from the summary statistics, we see the sample sizes for male and female populations to be 244 each. Using the CLT (Central Limit Theorem), we know that when the sample size is large (i.e. n>30) the sampling distribution of a mean will be approximately normally distributed, regardless of the underlying population distribution. Thus, since the normality condition is satisfied for the two-sample t-test.

Hypothesis Testing Cont.

Testing Homogeneity of Variance:

We will use Levene’s test to test Homogeneity of variance, or the assumption of equal variance. The Levene’s test has the following statistical hypotheses:

\[H_0: \sigma_1^2 = \sigma_2^2 \]

\[H_A: \sigma_1^2 \ne \sigma_2^2\] where $\sigma_1^2$ and $\sigma_2^2$ refer to the population variance of female and male life expectancies, respectively. The Levene’s test reports a p-value that is compared to the standard 0.05 significance level ($$). We can use the leveneTest() function in R to compare the variances of male and female life expectancies:

#Homogenity of Variance
leveneTest(`Life Expectancy`~Gender, data = LifeExp)

Levene’s Test Result -

The $p$-value for the Levene’s test of equal variance for Life expectancy between males and females was $p$ = 0.3199. Since $p$ > 0.05, we fail to reject $H_0$ (null hypothesis). In plain language, we are safe to assume equal variance. The assumption of equal variance is important because it will determine the type of two-sample $t$-test we will perform.

With the assumption of equal variance and assumption of normality, we can now perform $t$-test on the Life expectancy at birth variable for male and female populations.

Hypothesis Testing Cont.

We perform a two-sided hypothesis test as the hypotheses we will be stating are non-directional ($μ_1$ – $μ_2$ = 0) and ($μ_1$ – $μ_2$ != 0), there is no ($μ_1$ – $μ_2$ < 0 or $μ_1$ – $μ_2$ > 0). We use the t.test().

The two-sample tt-test has the following statistical hypotheses: \[H_0:\mu_1−\mu_2=0\] \[H_A:\mu_1−\mu_2≠0\] where,

$H_0$ (null hypothesis) states that the difference between the two independent population means, that is, mean female life expectancy $μ_1$ and mean male life expectancy $μ_2$, is 0

and,

$H_A$ (Altenate hypothesis) states that the difference between the two independent population means, that is, mean female life expectancy $μ_1$ and mean male life expectancy $μ_2$, is not 0.

Or in other words, null hypothesis is, male and female have equal mean life expectancies and, alternate hypothesis is male and female have different mean life expectancies.

Now, lets run the $t$-test -

t.test(
  `Life Expectancy`~Gender,
  data = LifeExp,
  var.equal = TRUE,
  alternative = "two.sided"
)

## 
##  Two Sample t-test
## 
## data:  Life Expectancy by Gender
## t = 5.535, df = 486, p-value = 5.099e-08
## alternative hypothesis: true difference in means between group FEMALE and group MALE is not equal to 0
## 95 percent confidence interval:
##  3.111621 6.536644
## sample estimates:
## mean in group FEMALE   mean in group MALE 
##             70.72492             65.90078

We have used the var.equal = TRUE option to perform the equal variance assumed two-sample t-test and the alternative = “two-sided” option to specify a two-tailed test.

$T$-Test Results -

The difference between males and females estimated by the sample was 70.72492 - 65.90078 = 4.82414.

The test statistic $t$ = 5.535

The t-statistic is compared to a two-tailed t-critical value $t*$ with $df$: \[ df = n_1 + n_2 - 2 \] For two-tailed hypothesis testing, the rejection regions are split between above and below $H_0$. We still need to maintain an overall significance level of 0.05. Since it is a two-sided hypothesis test, $α$ splits as $α/2$ for the upper and lower tail. We find the $t$ - critical values associated with 0.05/2 = 0.025 in the upper and lower tail of the sampling distribution under $H_0$ using qt() in R:

qt(p = 0.975, df = 244 + 244 - 2)

## [1] 1.964857

Thus, $t*$ is 1.964857.

Reading the t-test result by using the critical value:

As the test statistic $t$ from the two-sample $t$-test assuming equal variance was t = 5.535, which was more extreme than 1.964857, we reject H$_0$ (null hypothesis). Thus, according to the critical value method, there was a statistically significant difference between male and female life expectancy means.

Reading the t-test result by using the $p$ - value:

The $p$-value of the two-sample $t$-test will tell us the probability of observing a sample difference between the means of 4.82414 (from $t$-test result, difference of means), or one more extreme, assuming the difference was 0 in the population (i.e. $H_0$ is true). The two-tailed $p$-value was reported to be $p$ = 5.099e-08. According to the pp-value method, as $p$ = 5.099e-08 < $α$ (0.05), we reject $H_0$. Thus, according to the $p$-value method, there was a statistically significant difference between the means.

Reading the t-test result by using the Confidence Interval:

The 95% CI of the difference between the means (4.82414) is reported as 95% CI [3.111621 6.536644] (from $t$-test). As this interval does not capture $H_0$ (0 mean difference), we reject it. Once again, according to the Confidence Interval method, there was a statistically significant difference between the means.

Discussion

A two-sample $t$-test was used to test for a significant difference between the mean life expectancy at birth of males and females. While the life expectancy for males and females exhibited evidence of non-normality upon inspection of the normal Q-Q plot, the central limit theorem ensured that the t-test could be applied due to the large sample size (244) in each group. The Levene’s test of homogeneity of variance indicated that equal variance could be assumed. The results of the two-sample t-test assuming equal variance found a statistically significant difference between the mean life expectancy at birth of males and females, $t$ ($df$=468) = 5.535, $p$ = 5.099e-08, 95% CI for the difference in means [3.111621 6.536644]. The results of the investigation suggest that females have significantly higher life expectancy at birththan males. Thus, gender does play a role in defining average life expectancies at birth.

However, there are limitations associated with our investigation. The dataset had missing values for some of the countries and/or geographical locations. Thus, it is not a good representation of the entire world population.

References

Introduction statement taken from:

University of Oxford (2021) What is AI? Here’s everything you need to know about artificial intelligence, ourworldindata.org website, accessed 15 October 2021. https://ourworldindata.org/grapher/life-expectancy-of-women-vs-life-expectancy-of-women

Data set reference:

The World Bank, DataBank| Health Nutrition and Population Statistics, accessed 10 October 2021. https://databank.worldbank.org/source/health-nutrition-and-population-statistics#

MATH 1324 Applied Analytics

Assignment 2

Introduction

Problem Statement

Data

Data collection (Data Cont.)

Reading the data (Data Cont.)

Read Male life expectancy data

Read Female life expectancy data

Combine Male and Female datasets

Data type conversion

Data Cleaning and Tidying (Data cont.)

Missing value detection and correction

Tidying data

Descriptive Statistics and Visualisation

Summary statistics of Life Expectancy variable

Outlier detection

Hypothesis Testing

Testing the Assumption of Normality:

Hypothesis Testing Cont.

Testing Homogeneity of Variance:

Hypothesis Testing Cont.

Reading the t-test result by using the critical value:

Reading the t-test result by using the \(p\) - value:

Reading the t-test result by using the Confidence Interval:

Discussion

References