MATH 1324 Applied Analytics

Assignment 2

Hridyansh Gulati s3893381, Shaikh Mohammad Rahil s3960736 and Tanya Thankachan s3909102

Last updated: 16 October, 2022

Introduction

Problem Statement

Data

Data collection (Data Cont.)

Data collection steps:

  1. Navigate to https://databank.worldbank.org/source/health-nutrition-and-population-statistics#
  2. In Variables sub-tab, under Database, select Health Nutrition and Population Statistics
  3. Under Country, select everything
  4. Under Series, search life expectancy and filter for life expectancy at birth, female (years) and life expectancy at birth, male (years)
  5. Under Year, select relevant year
  6. Apply changes and download using download options on top right.

Reading the data (Data Cont.)

Read Male life expectancy data

MaleLife <-
  read.csv("C:/Users/admin/Downloads/Males.csv")
head(MaleLife)

Data structure:

MaleLifeEx<-MaleLife%>%select(Country.Name,Country.Code,X2005..YR2005.)
MaleLifeExpec<-data.frame(MaleLifeEx)
colnames(MaleLifeExpec)<-c("CountryNames","CountryCode","MALE")
str(MaleLife)
## 'data.frame':    271 obs. of  5 variables:
##  $ Series.Name   : chr  "Life expectancy at birth, male (years)" "Life expectancy at birth, male (years)" "Life expectancy at birth, male (years)" "Life expectancy at birth, male (years)" ...
##  $ Series.Code   : chr  "SP.DYN.LE00.MA.IN" "SP.DYN.LE00.MA.IN" "SP.DYN.LE00.MA.IN" "SP.DYN.LE00.MA.IN" ...
##  $ Country.Name  : chr  "Afghanistan" "Africa Eastern and Southern" "Africa Western and Central" "Albania" ...
##  $ Country.Code  : chr  "AFG" "AFE" "AFW" "ALB" ...
##  $ X2005..YR2005.: chr  "57.044" "52.2217123832099" "50.3338297768142" "72.708" ...

Read Female life expectancy data

FemaleLife <-
  read.csv("C:/Users/admin/Downloads/Females.csv")
head(FemaleLife)

Data structure:

FemaleLifeEx<-FemaleLife%>%select(Country.Name, Country.Code, X2005..YR2005.)
FemaleLifeExpec<-data.frame(FemaleLifeEx)
colnames(FemaleLifeExpec)<-c("CountryNames","CountryCode","FEMALE")
str(FemaleLife)
## 'data.frame':    271 obs. of  5 variables:
##  $ Series.Name   : chr  "Life expectancy at birth, female (years)" "Life expectancy at birth, female (years)" "Life expectancy at birth, female (years)" "Life expectancy at birth, female (years)" ...
##  $ Series.Code   : chr  "SP.DYN.LE00.FE.IN" "SP.DYN.LE00.FE.IN" "SP.DYN.LE00.FE.IN" "SP.DYN.LE00.FE.IN" ...
##  $ Country.Name  : chr  "Afghanistan" "Africa Eastern and Southern" "Africa Western and Central" "Albania" ...
##  $ Country.Code  : chr  "AFG" "AFE" "AFW" "ALB" ...
##  $ X2005..YR2005.: chr  "59.628" "55.754852591005" "52.2441270069714" "78.165" ...

Combine Male and Female datasets

LifeExp<-inner_join(MaleLifeExpec,FemaleLifeExpec, by=c("CountryNames","CountryCode"))
LifeExp<-LifeExp%>%slice(1:266)

Data type conversion

LifeExp$MALE<-as.numeric(LifeExp$MALE)
LifeExp$FEMALE<-as.numeric(LifeExp$FEMALE)
str(LifeExp)
## 'data.frame':    266 obs. of  4 variables:
##  $ CountryNames: chr  "Afghanistan" "Africa Eastern and Southern" "Africa Western and Central" "Albania" ...
##  $ CountryCode : chr  "AFG" "AFE" "AFW" "ALB" ...
##  $ MALE        : num  57 52.2 50.3 72.7 71.8 ...
##  $ FEMALE      : num  59.6 55.8 52.2 78.2 74.4 ...

Data Cleaning and Tidying (Data cont.)

Missing value detection and correction

colSums(is.na(LifeExp))
## CountryNames  CountryCode         MALE       FEMALE 
##            0            0           22           22

There are missing values..

length(LifeExp$MALE)
## [1] 266
length(LifeExp$FEMALE)
## [1] 266
LifeExp<-LifeExp[complete.cases(LifeExp),]
length(LifeExp$MALE)
## [1] 244
length(LifeExp$FEMALE)
## [1] 244

NA values are removed.

Tidying data

LifeExp<-LifeExp%>%pivot_longer(names_to = "Gender",  values_to = "Life Expectancy", cols = 3:4 )
head(LifeExp)

Lets factorize gender Gender variable -

LifeExp$Gender<-as.factor(LifeExp$Gender)
str(LifeExp)
## tibble [488 × 4] (S3: tbl_df/tbl/data.frame)
##  $ CountryNames   : chr [1:488] "Afghanistan" "Afghanistan" "Africa Eastern and Southern" "Africa Eastern and Southern" ...
##  $ CountryCode    : chr [1:488] "AFG" "AFG" "AFE" "AFE" ...
##  $ Gender         : Factor w/ 2 levels "FEMALE","MALE": 2 1 2 1 2 1 2 1 2 1 ...
##  $ Life Expectancy: num [1:488] 57 59.6 52.2 55.8 50.3 ...

Descriptive Statistics and Visualisation

Summary statistics of Life Expectancy variable

#Life Expectancy summary
LifeExp%>%group_by(Gender)%>%summarise(Min = min(`Life Expectancy`,na.rm = TRUE),
                                                   Q1 = quantile(`Life Expectancy`,probs = .25,na.rm = TRUE),
                                                   Median = median(`Life Expectancy`, na.rm = TRUE),
                                                   Q3 = quantile(`Life Expectancy`,probs = .75,na.rm = TRUE),
                                                   Max = max(`Life Expectancy`,na.rm = TRUE),
                                                   Mean = mean(`Life Expectancy`, na.rm = TRUE),
                                                   SD = sd(`Life Expectancy`, na.rm = TRUE),
                                                   n = n(),
                                                   Missing = sum(is.na(`Life Expectancy`)))
knitr::kable(table1)
country year cases population
Afghanistan 1999 745 19987071
Afghanistan 2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583

This sample data suggests females have higher mean life expectancies at birth over the world.

Outlier detection

LifeExp%>%boxplot(`Life Expectancy`~ Gender, data=., ylab = "Life Expectancy  at birth")

We can determine whether this difference is statistically significant using the hypothesis test, two-sample t-test. Let’s get started by considering the assumptions behind the two-sample t-test. Before that, lets properly define the hypothesis.

Hypothesis Testing

Lets test the 2 assumptions of two-sample t-test; Test of Assumption of Normality and, Homogeneity of Variance on the Life Expectancy variable for male and female genders.

Testing the Assumption of Normality:

#normality test on male population
Life_Expectancy_male <- LifeExp %>% filter(LifeExp$Gender == "MALE")
Life_Expectancy_male$`Life Expectancy`%>% qqPlot(dist="norm")

## [1] 127  67
#normality test on female population
Life_Expectancy_female <- LifeExp %>% filter(LifeExp$Gender == "FEMALE")
Life_Expectancy_female$`Life Expectancy`%>% qqPlot(dist="norm")

## [1]  67 244

We notice that some of the data points fall outside the blue lines for both male and female samples indicating non-normality of the distribution. However, from the summary statistics, we see the sample sizes for male and female populations to be 244 each. Using the CLT (Central Limit Theorem), we know that when the sample size is large (i.e. n>30) the sampling distribution of a mean will be approximately normally distributed, regardless of the underlying population distribution. Thus, since the normality condition is satisfied for the two-sample t-test.

Hypothesis Testing Cont.

Testing Homogeneity of Variance:

We will use Levene’s test to test Homogeneity of variance, or the assumption of equal variance. The Levene’s test has the following statistical hypotheses:

\[H_0: \sigma_1^2 = \sigma_2^2 \]

\[H_A: \sigma_1^2 \ne \sigma_2^2\] where \(\sigma_1^2\) and \(\sigma_2^2\) refer to the population variance of female and male life expectancies, respectively. The Levene’s test reports a p-value that is compared to the standard 0.05 significance level ($$). We can use the leveneTest() function in R to compare the variances of male and female life expectancies:

#Homogenity of Variance
leveneTest(`Life Expectancy`~Gender, data = LifeExp)

Levene’s Test Result -

The \(p\)-value for the Levene’s test of equal variance for Life expectancy between males and females was \(p\) = 0.3199. Since \(p\) > 0.05, we fail to reject \(H_0\) (null hypothesis). In plain language, we are safe to assume equal variance. The assumption of equal variance is important because it will determine the type of two-sample \(t\)-test we will perform.

With the assumption of equal variance and assumption of normality, we can now perform \(t\)-test on the Life expectancy at birth variable for male and female populations.

Hypothesis Testing Cont.

We perform a two-sided hypothesis test as the hypotheses we will be stating are non-directional (\(μ_1\)\(μ_2\) = 0) and (\(μ_1\)\(μ_2\) != 0), there is no (\(μ_1\)\(μ_2\) < 0 or \(μ_1\)\(μ_2\) > 0). We use the t.test().

The two-sample tt-test has the following statistical hypotheses: \[H_0:\mu_1−\mu_2=0\] \[H_A:\mu_1−\mu_2≠0\] where,

\(H_0\) (null hypothesis) states that the difference between the two independent population means, that is, mean female life expectancy \(μ_1\) and mean male life expectancy \(μ_2\), is 0

and,

\(H_A\) (Altenate hypothesis) states that the difference between the two independent population means, that is, mean female life expectancy \(μ_1\) and mean male life expectancy \(μ_2\), is not 0.

Or in other words, null hypothesis is, male and female have equal mean life expectancies and, alternate hypothesis is male and female have different mean life expectancies.

Now, lets run the \(t\)-test -

t.test(
  `Life Expectancy`~Gender,
  data = LifeExp,
  var.equal = TRUE,
  alternative = "two.sided"
)
## 
##  Two Sample t-test
## 
## data:  Life Expectancy by Gender
## t = 5.535, df = 486, p-value = 5.099e-08
## alternative hypothesis: true difference in means between group FEMALE and group MALE is not equal to 0
## 95 percent confidence interval:
##  3.111621 6.536644
## sample estimates:
## mean in group FEMALE   mean in group MALE 
##             70.72492             65.90078

We have used the var.equal = TRUE option to perform the equal variance assumed two-sample t-test and the alternative = “two-sided” option to specify a two-tailed test.

\(T\)-Test Results -

The difference between males and females estimated by the sample was 70.72492 - 65.90078 = 4.82414.

The test statistic \(t\) = 5.535

The t-statistic is compared to a two-tailed t-critical value \(t*\) with \(df\): \[ df = n_1 + n_2 - 2 \] For two-tailed hypothesis testing, the rejection regions are split between above and below \(H_0\). We still need to maintain an overall significance level of 0.05. Since it is a two-sided hypothesis test, \(α\) splits as \(α/2\) for the upper and lower tail. We find the \(t\) - critical values associated with 0.05/2 = 0.025 in the upper and lower tail of the sampling distribution under \(H_0\) using qt() in R:

qt(p = 0.975, df = 244 + 244 - 2)
## [1] 1.964857

Thus, \(t*\) is 1.964857.

Reading the t-test result by using the critical value:

As the test statistic \(t\) from the two-sample \(t\)-test assuming equal variance was t = 5.535, which was more extreme than 1.964857, we reject H\(_0\) (null hypothesis). Thus, according to the critical value method, there was a statistically significant difference between male and female life expectancy means.

Reading the t-test result by using the \(p\) - value:

The \(p\)-value of the two-sample \(t\)-test will tell us the probability of observing a sample difference between the means of 4.82414 (from \(t\)-test result, difference of means), or one more extreme, assuming the difference was 0 in the population (i.e. \(H_0\) is true). The two-tailed \(p\)-value was reported to be \(p\) = 5.099e-08. According to the pp-value method, as \(p\) = 5.099e-08 < \(α\) (0.05), we reject \(H_0\). Thus, according to the \(p\)-value method, there was a statistically significant difference between the means.

Reading the t-test result by using the Confidence Interval:

The 95% CI of the difference between the means (4.82414) is reported as 95% CI [3.111621 6.536644] (from \(t\)-test). As this interval does not capture \(H_0\) (0 mean difference), we reject it. Once again, according to the Confidence Interval method, there was a statistically significant difference between the means.

Discussion

A two-sample \(t\)-test was used to test for a significant difference between the mean life expectancy at birth of males and females. While the life expectancy for males and females exhibited evidence of non-normality upon inspection of the normal Q-Q plot, the central limit theorem ensured that the t-test could be applied due to the large sample size (244) in each group. The Levene’s test of homogeneity of variance indicated that equal variance could be assumed. The results of the two-sample t-test assuming equal variance found a statistically significant difference between the mean life expectancy at birth of males and females, \(t\) (\(df\)=468) = 5.535, \(p\) = 5.099e-08, 95% CI for the difference in means [3.111621 6.536644]. The results of the investigation suggest that females have significantly higher life expectancy at birththan males. Thus, gender does play a role in defining average life expectancies at birth.

However, there are limitations associated with our investigation. The dataset had missing values for some of the countries and/or geographical locations. Thus, it is not a good representation of the entire world population.

References

University of Oxford (2021) What is AI? Here’s everything you need to know about artificial intelligence, ourworldindata.org website, accessed 15 October 2021. https://ourworldindata.org/grapher/life-expectancy-of-women-vs-life-expectancy-of-women

The World Bank, DataBank| Health Nutrition and Population Statistics, accessed 10 October 2021. https://databank.worldbank.org/source/health-nutrition-and-population-statistics#