AYUSH RANJAN, S3802541
Last updated: 27 October, 2019
You must publish your presentation to RPubs (see here) and add this link to your presentation here.
Rpubs link comes here: www………
This online version of the presentation will be used for marking. Failure to add your link will delay your feedback and risk late penalties.
The Human development report 2015 contains many data sets with different parameters (gender development, poverty, human development index)which summarrize measure of achievements in key dimensions of human development in different countries . For the purpose of this paticular analysis we are going to only use a subset of the provided information.
Selected Datasets: 1- Gender Development 2- Gender Inequalities
Introduction Cont.
The focus of the analysis is to investigate relationship between different grops (M & F) on different parameters available in dataset( “Life Expectancy” and “Population with secondary education”).
An investigation on the relationship between “Life Expectancy” and “Population with secondary education” will be conducted.
The data is taken from Human Developent index 2015 we are choosing 2 datasets gender development and gender inequlity index to answer questions on:
-Life Expectancy at birth are different between different gender groups.
-is there a statistical evidence that there is a difference between male and female population with secondary education?
-Is it possible to predict Life Expecetency of a country if we know population percentage with secondary education
The data used for this report is open source and available at:- https://www.kaggle.com/undp/human-development
Datasets imported:- 1-gender_development.csv 2-gender_inequality.csv
How measures are calculated
Shows which factors are taken into account while calculating GDI and GII in the data to be imported for analysis.
Minimum and maximum values (goalposts) are set in order to transform the indicators expressed in different units into indi- ces between 0 and 1.
Dimension index = (actual value - minimum value)/(maximum value – minimum value)
example:-
Description of Columns 1-gender_development.csv
GII Rank Country Gender Inequality Index (GII) Maternal Mortality Ratio Adolescent Birth Rate Percent Representation in Parliament Population with Secondary Education (Female) Population with Secondary Education (Male) Labour Force Participation Rate (Female) Labour Force Participation Rate (Male)
2-gender_inequality.csv
GDI Rank Country Gender Development Index (GDI) Human Development Index (Female) Human Development Index (Male) Life Expectancy at Birth (Female) Life Expectancy at Birth (Male) Expected Years of Education (Female) Expected Years of Education (Male) Mean Years of Education (Female) Mean Years of Education (Male) Estimated Gross National Income per Capita (Female) Estimated Gross National Income per Capita (Male)
A join operation is performed using country column
## [1] 195 5
## Observations: 195
## Variables: 5
## $ Country <chr> "Afghanistan", "A…
## $ `Life Expectancy at Birth (Female)` <chr> "61.6", "80.4", "…
## $ `Life Expectancy at Birth (Male)` <chr> "59.2", "75.4", "…
## $ `Population with Secondary Education (Female)` <chr> "5.9", "81.8", "2…
## $ `Population with Secondary Education (Male)` <chr> "29.8", "87.9", "…
The datatypes are not appropriate for analysis as data does not complyt with the Tidy format. Columns contains attributes for male and female in different columns.
To achieve tidy format data will be changed to long form and a column will be introduced with gender observation for the record.
library(tidyr)
tidy_data=gather(m_data, key = Life_Expetency, value = value, -c(1,4,5))
tidy_data=gather(tidy_data, key = PWSE , value=PWSEVALUE, -c(1,4,5))
tidy_data$Life_Expetency[tidy_data$Life_Expetency=="Life Expectancy at Birth (Female)"]="F"
tidy_data$Life_Expetency[tidy_data$Life_Expetency=="Life Expectancy at Birth (Male)"]="M"
tidy_data$PWSE[tidy_data$PWSE=="Population with Secondary Education (Male)"]="M"
tidy_data$PWSE[tidy_data$PWSE=="Population with Secondary Education (Female)"]="F"
head(tidy_data)-Life Expetency = Gender flag for life expecetency value in value column for a paticular country. -PWSE = Gender flag for Population with secondary education value in PWSEVALUE column for a paticular country.
for analysis 1: -value=Life expectancy in number of years at the time of birth in a country numeric
-Life_Expectancy =gender of the corresponding Life_Expectancy value Factor
## [1] "F" "M"
for analysis 2: -PWSEVALUE=Population with Secondary Education in percentage in a country numeric
-PWSE=gender of the corresponding Population with Secondary Education value Factor
## [1] "F" "M"
Just by comparing means a conclusion can not be achieved.
A boxplot comparision is provided for comparing spread of groups.
First comparing the boxplot visually we can se that quantile and spread is different. but is this difference significant?
What kind of distribution a group is following?
Distribution of Male group is investigated on Life Expecentecy value
#dist1 M
Life_ExpM = tidy_data%>%filter(Life_Expetency=="M")
hist(Life_ExpM$value)#heavier towards tail The histogram of the distribution is heavier towards the tail.
For clear comparision to a normal distribution qq plot is used
## [1] 34 229
## [1] 167 362
## [1] 28 223
## [1] 28 223
It can be observed that the points are falling outside the tails of the distribution. This suggests the tails are heavier than what we would expect under a normal distribution.
cautious about assuming normality for females. Fortunately, due to the large sample size, n=65 CLI gives us freedom to assume normality.
#Homogeneity of Variance To conduct an accurate t-test(two-tailed) which compares group mean and provide us with evidence of statistical significance is present or not.
An assumption is needed to be maid about homogeneity of varience.
Levens test is used
Homogeneity of variance, or the assumption of equal variance, is tested using the Levene’s test.
0.05 significance level.
-value for the Levene’s test of equal variance for Life_Expetency value in males and females was p=0.1519 We find p>.05 therefore, we fail to reject H0 In plain language, we are safe to assume equal variance.
##
## Two Sample t-test
##
## data: tidy_data$value by tidy_data$Life_Expetency
## t = 7.7727, df = 758, p-value = 2.505e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.537336 5.927927
## sample estimates:
## mean in group F mean in group M
## 73.29053 68.55789
The null hypothesis for test:- Male and female have same Life_Expectancy mean.
\[H_0: \mu_1 = \mu_2 \]
\[H_A: \mu_1 \ne \mu_2\]
The P-value of the test<0.05. Hence, we reject Ho.
To conclude, there is a statistically sigficant result. Ther are difference in mean of Life_Expectancy of Male and female group. For the second t-test The null hypothesis for test:- Male and female have same Population with secondary Education mean.
\[H_0: \mu_1 = \mu_2 \]
\[H_A: \mu_1 \ne \mu_2\]
##
## Welch Two Sample t-test
##
## data: tidy_data$PWSEVALUE by tidy_data$PWSE
## t = -2.4825, df = 668.17, p-value = 0.01329
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -9.822563 -1.146668
## sample estimates:
## mean in group F mean in group M
## 54.80473 60.28935
The P-value of the test<0.05. Hence, we reject Ho.
To conclude, there is a statistically sigficant result. There is difference in means of population with secondary education of male and female.
#Further Analysis An attempt to look for relationship between Life_Epectancy and PWSE is made
There might be a possiblity to fit a linear regression model.
##
## Call:
## lm(formula = tidy_data$value ~ tidy_data$PWSEVALUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.092 -3.789 0.603 4.438 14.093
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.259489 0.556739 108.24 <2e-16 ***
## tidy_data$PWSEVALUE 0.198056 0.008619 22.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.432 on 666 degrees of freedom
## (112 observations deleted due to missingness)
## Multiple R-squared: 0.4422, Adjusted R-squared: 0.4414
## F-statistic: 528 on 1 and 666 DF, p-value: < 2.2e-16
CONCLUSION
## [1] 61.42802
The findings from the analysis:- -Female tend to have more life expectancy at birth than men. -Men tend to have higher population with secondary education.
As the data was collected from a reliable source findings are accurate according to sample
LIMITATIONS -It is possible that the sample data recorded does not give a clear representation of population.
FURTHER INVESTIGATION A further investigation about how Life_Expectancy is effected by other given factors can be made. It is possible that Life expectancy is actually not effected by Population with secondary education.Population with secondary education just have a strong corelation with actual effecting factor. A linear regression with multiple coff. will reveal such corelations.