Assignment 3

Male Vs Female population in Urban Area

SUYASHA GUPTA (3796079),SHIVAM SETHI (3810464) AND RISHABH JAIN (3806429)

Introduction

(1.)We observe male population has dominated throughout the history .

(2.)We believe the inequality among gender stereotypes have descreased throughout time which can be evident numerically throught population of each gender.

(3.)We are here to see if the trend has really changed in the urban areas of the world as the standard of living has increased , equality in employment is ensured and most importantly general awareness is spread through better education.

Problem Statement

Female population has always been hampered due to old and outdated customs and cultural foundations. We want to investigate if the accumulative female population in urban area of the world is higher than the male population in general. Is there a statistically significant difference between the urban population of males and females? We convert this question to two hypotheses:

  1. A hypothesis that represents no difference between the urban population of females and males and is called H0.

  2. An alternative hypothesis stating that there is a significant difference between the urban population of females and males and is called Ha.

Data

(1.)The dataset we are working on our actually two different datasets namely urban_females and urban_males sourced from https://databank.worldbank.org/reports.aspx?source=283&series=SP.URB.TOTL.FE.ZS and

https://databank.worldbank.org/reports.aspx?source=283&series=SP.URB.TOTL.MA.ZS.

(2.)The urban_females dataset provides us with an information regarding the Female urban population of 263 different countries and unions through various year from 1990 to 2018. The urban_males dataset provides us with an information regarding the male urban population of 263 different countries and unions through various year from 1990 to 2018.

(3.)The two datasets were joined to work with the above stated problem and to reach an adequate conclusion.

The data comprises of series name and code signifying the category among female and male. Country name and code signifying the countries and regions with their designated codes. We have our final two variables giving us the years and the urban female/male population percentage out of total population for that year.

Descriptive Statistics and Visualisation

(1.)We had scanned the data for missing values, inconsistencies, outliers and obvious errors. We had worked with the required function to compute the missing values in every variable.

(2.)In the columns with missing values we then applied the required function to replace the missing values with mean values of the respective variable.

(3.)We plotted a boxplot to search for outliers and we got the following outlier values, i.e., 62.13389, 62.73187, 58.72326, 74.70017, 59.50012, 58.73326, 75.87234, 59.63667.

(4.)All the outliers are present in the data for males. Now, we do not remove the outliers as these outliers hold a significant description regarding out investigation and outcomes.

Descriptive Statistics and Visualisation Cont.

outliers<-boxplot(`popultaion percentage`~`Series Name`, data= urban_pop)

Descriptive Statistics and Visualisation Cont.

outliers
## $stats
##           [,1]      [,2]
## [1,]  2.952183  2.467082
## [2,] 18.025466 17.776091
## [3,] 26.625049 26.625049
## [4,] 34.231366 35.516384
## [5,] 57.817419 53.185671
## 
## $n
## [1] 1052 1052
## 
## $conf
##         [,1]     [,2]
## [1,] 25.8356 25.76086
## [2,] 27.4145 27.48924
## 
## $out
## [1] 62.13389 62.73187 58.72326 74.70017 59.50012 58.73326 75.87234 59.63667
## 
## $group
## [1] 1 1 1 1 1 1 1 1
## 
## $names
## [1] "male"   "female"

Decsriptive Statistics Cont.

(1.)We have computed the summary of the data grouped by the series name to receive an output that provides us with the summary of female and male separately.

(2.)We observe the minimum and maximum values in females as 2.45 and 53.18 vs 2.95 and 75.87 in males. It depicts a drastic difference.

(3.)However, when we look at the Mean values of females and males, which are 26.67 and 26.57 respectively, we find the mean urban population of females a bit higher than that of males.

(4.)Also, the median gives an equal value for both.

urban_pop%>%group_by(`Series Name`)%>%summarise(Min= min(`popultaion percentage`,na.rm=TRUE),Q1=quantile(`popultaion percentage`,probs = .25,na.rm = TRUE), Median = median(`popultaion percentage`, na.rm = TRUE),Q3 = quantile(`popultaion percentage`,probs = .75,na.rm = TRUE),Max = max(`popultaion percentage`,na.rm = TRUE),Mean = mean(`popultaion percentage`, na.rm = TRUE), SD = sd(`popultaion percentage`, na.rm = TRUE),n = n(),Missing = sum(is.na(`popultaion percentage`)))->series
knitr::kable(series)
Series Name Min Q1 Median Q3 Max Mean SD n Missing
male 2.952183 18.04061 26.62505 34.23122 75.87234 26.5793 11.12257 1052 0
female 2.467083 17.78563 26.62505 35.51000 53.18567 26.6708 11.01545 1052 0

Hypothesis Testing

We will now run a two-sample t-test on this dataset to compute any statistically significant difference between urban population of females and males this is a randomly collected dataset of independent samples.

Before running a hypothesis test, we need to confirm two facts now -

  1. population data are normally distributed

  2. population homogeneity of variance.

Normal Distribution

To confirm normal distribution, we visualize with Q-Q plot for both the populations. There seems to be some values outside the 95% normality quantiles in both populations. However, we can ignore this according to the Central Limit Theorem because our sample was large enough (n>30).

Homogeneity of Variance

Next, we check homogeneity of variance using Levene’s test to compare the variances of male and female grades. Hypotheses for the Levene’s test are -

H0 : (σ1)^2 = (σ2)^2

HA : (σ1)^2 ≠ (σ2)^2

According to the test, Pr value = 0.5445which is > 0.05. So, we fail to reject H0, which lets us assume equal variances.

urban_male <- urban_pop %>% filter(`Series Name` == "male")
urban_male$`popultaion percentage` %>% qqPlot(dist="norm")

## [1] 995 732

Homogeneity of Variance Cont.

qqplot

urban_female <- urban_pop %>% filter(`Series Name` == "female")
urban_female$`popultaion percentage` %>% qqPlot(dist="norm")

## [1] 922 659

levene test

leveneTest(`popultaion percentage`~`Series Name`, data= urban_pop)

two sample ttest

t.test(`popultaion percentage`~`Series Name`, data= urban_pop,
  var.equal = TRUE,
  alternative = "two.sided")
## 
##  Two Sample t-test
## 
## data:  popultaion percentage by Series Name
## t = -0.18959, df = 2102, p-value = 0.8496
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.038002  0.854992
## sample estimates:
##   mean in group male mean in group female 
##              26.5793              26.6708

Two Sample t-test

We ran a two-sample t-test for the identification of any statistically significant difference between urban population of females and males. Confidence interval level: 95% Significance level α = 0.05.

Following assumptions were proved: (1) Comparing two independent population means with unknown population variance. (2) Population data: large sample used(n>30 for both groups), therefore, normality need not be required to exist. (3) Population homogeneity of variance assumption not violated. Hypotheses for the two-sample t-test: \[H_0: \mu_1 = \mu_2\] \[H_A: \mu_1 \ne \mu_2\] where μ1 and μ2 refer to the population of females and males respectively. The null hypothesis is simply that the difference between the two independent population is 0. The difference between male and female population estimated by the sample was 26.6708 - 26.5793 = 0.0915. We will reject H0 if p-value < 0.05 or if 95% CI of the mean difference does not capture H0 : μ1 − μ2 = 0, otherwise, we fail to reject H0.

Interpretation of the t-test:

A two-sample t-test was used to test for a significant difference between the urban population of females and males. After the t-test, according to the p-value method, as p=0.8496 < α=0.05, we fail to reject H0. According to the p-value method, there is no statistically significant difference between male and female urban population.

According to the t-test, estimated difference between means = 26.6708 - 26.5793 = 0.0915 95% CI of difference between means [-1.038002 - 0.854992]. This interval captures H0: μ1 − μ2 = 0, we fail reject H0 once again. So, there was no statistically significant difference between the means.

As the t test shows, mean population for females is just slightly higher (26.6708) than males (26.5793). So, according to the t-test, females and males have almost equal population percentage.

Discussion

Findings:

The results of the two-sample t-test assuming equal variance did not find a statistically significant difference between the mean urban population of females and males, t(df=2102) = - 0.18959, p=0.8496, 95% CI for the difference in [-1.038002 - 0.854992]. According to the t-test, females and males have almost equal population percentage.

Strength & Limitation/ Directions for future investigation:

We are limited to the only the urban population and also unclear as to what area is deemed fit to represent urban area in most regions, therefore, this result would not hold good for general purposes. We can only infer to a population that is akin to the areas represented in the sample. That means that the total populations for females and males that we have analysed only are relevant to such regions and countries and not all.

Conclusion:

We conclude that the mean population of female and male does not have a significant difference and that population of females in urban regions is just slightly higher. We see a change in the domination of male population over centuries and millenniums. We reach to a point where we see equality.

References

The datasets were derived from www.databank.worldbank.org and the following specific address –

https://databank.worldbank.org/reports.aspx?source=283&series=SP.URB.TOTL.FE.ZS and

https://databank.worldbank.org/reports.aspx?source=283&series=SP.URB.TOTL.MA.ZS.