The chosen dataset provides information about the life expectancy of people in different countries. It consists of 22 columns and 2938 rows of data, which were collected from the World Health Organization (WHO) and the United Nations Development Program (UNDP) from 2000 to 2015 and covers different aspects of life expectancy like infant mortality rate, adult mortality rate, GDP per capita, education, and healthcare expenditure. The information can also be used to predict the life expectancy of people in different countries based on the factors that influence it. Researchers can use the dataset to conduct studies on the impact of healthcare and education on life expectancy and to suggest policy changes that can improve people’s quality of life.
The dataset is relevant as life expectancy is a crucial indicator of a country’s development and well-being. This can be used to study and explore the trends in life expectancy over the years and identify the countries that need more attention in healthcare and education via visual analysis and data analytics.
Variables: The dataset consists of 22 predicting variables.
Country: Consisting of 193 values equivalent to 193 countries around the world.
Year: The dataset works with collected data for each variable from 2000 to 2015.
Status: Countries are classified into developing or developed status.
Life expectancy: The average life expectancy of countries in age.
Adult mortality: Adult mortality rates of both sexes (probability of dying between 15 and 60 years per 1000 population).
Infant deaths: Number of infant deaths per 1000 population.
Alcohol: Alcohol, recorded per capita (15+) consumption (in liters of pure alcohol).
Percentage expenditure: Expenditure on health as a percentage of Gross Domestic Product per capita (%).
Hepatitis B: Hepatitis B (HepB) immunization coverage among 1-year-olds (%).
Measles: Measles - number of reported cases per 1000 population.
BMI: The average Body Mass Index of the entire population.
Under-five deaths: Number of under-five deaths per 1000 population.
Polio: The Polio (Pol3) vaccination rate for 1-year-old children (%).
Total expenditure: The percentage a country pays for health services as part of total government expenditure (%).
Diphtheria: The Diphtheria and Pertussis vaccination rate for 1-year-old children (%).
HIV/AIDS: The rate of deaths per 1000 live births of HIV/AIDS (%).
GDP: The Gross Domestic Product per capita in a country (USD).
Population: The number of people living in a country (people).
Thinness 10-19 years: The percentage of thinness among children and adolescents for 10 - 19 year-olds (%).
Thinness 5-9 years: The proportion of thinness among children for 5 - 9 year-old (%).
Income composition: Human Development Index in terms of income composition of resources (index ranging from 0 to 1).
Schooling: The number of years that people in each country go to school (years). .
I. T-TEST TECHNIQUE AND HYPOTHESIS 1
The t-test is a statistical technique used to compare means between two groups and determine if the observed difference is statistically significant. We can assess the significance of the observed difference by calculating a t-value and comparing it to a critical value.
- Null Hypothesis (H₀): The average life expectancy of countries classified as developed is equal to that of countries classified as developing in 2015.
- Alternative Hypothesis (H₁): The average life expectancy of countries classified as developed is significantly different from that of countries classified as developing in 2015.
In this hypothesis, we examine whether there are any significant variations in life expectancy based on the country status variable in the year 2015. The t-test technique can determine if the mean life expectancy differs significantly between the different categories of country status (e.g., developed and developing countries). We can obtain an F-statistic and associated p-value by conducting a t-test analysis.
Possible Results Interpretation:
## [1] 7.093327e-23
## === t-Test Results ===
##
## Welch Two Sample t-test
##
## data: Life by Status
## t = 12.753, df = 102.42, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Developed and group Developing is not equal to 0
## 95 percent confidence interval:
## 9.305573 12.733045
## sample estimates:
## mean in group Developed mean in group Developing
## 80.70937 69.69007
## The t-test is statistically significant (p < 0.05 ).
## There is evidence of a significant difference in life expectancy between Developed and Developing countries in 2015.
## The t-test is statistically significant (p < 0.05).
## There is evidence of a significant difference in life expectancy between Developed and Developing countries.
t-value: 12.753
Degrees of freedom (df): 102.42
p-value: 2.2e - 16
The 95% confidence interval: 9.305573 and 12.733045
From the output table, it can be seen that the difference in means for the sample data is 80.70937 - 69.69007 = 11.0193, and the confidence interval shows that the true difference in means is between 9.305573 and 12.733045. So, 95% of the time, the true difference in means will be different from 0. The p-value of 2.2e - 16 is much smaller than 0.05 providing strong evidence to reject the null hypothesis. It indicates that there is a significant difference in the average lifespan between developing and developed countries in the year 2015.
II. PEARSON CORRELATION COEFFICIENT TECHNIQUE AND HYPOTHESIS 2
The Pearson correlation coefficient is a statistical measure used to quantify the strength and direction of the linear relationship between two variables. It ranges from -1 to 1, indicating the degree of correlation between the variables. This technique is used to analyze continuous data and identify patterns and dependencies.
- H₀ (Null Hypothesis): A no significant linear relationship exists between GDP and Schooling in low-income countries.
- H₁ (Alternative Hypothesis): A significant linear relationship exists between GDP and Schooling for low-income countries.
The hypothesis examines the potential linear relationship between GDP (Gross Domestic Product) and Schooling in low-income countries.
The null hypothesis (H0) states that there is no significant linear relationship between GDP and Schooling in low-income countries. In other words, the null hypothesis assumes that changes in GDP do not have a meaningful impact on the level of schooling in these countries. On the other hand, the alternative hypothesis (H1) suggests a significant linear relationship exists between GDP and Schooling in low-income countries. This implies that changes in GDP are associated with variations in the level of schooling, indicating that economic factors play a role in determining the educational outcomes in these countries.
By testing these two hypotheses, we aim to gain insights into the relationship between GDP and Schooling in low-income countries and whether the economics affect educational opportunities.
Possible Results Interpretation
If the correlation coefficient is close to 0, it suggests a weak or no linear relationship between GDP and Schooling.
If the correlation coefficient is positive, it indicates a positive linear relationship, implying that Schooling also tends to increase as GDP increases.
If the correlation coefficient is negative, it indicates a negative linear relationship, implying that as GDP increases, Schooling tends to decrease.
## [1] "Pearson correlation coefficient: 0.108196033259806"
## [1] "p-value: 0.000666236091774794"
## [1] "The correlation is statistically significant."
## [1] "Reject the null hypothesis (H₀): There is no significant correlation."
## [1] "Accept the alternative hypothesis (H₁): There is a significant correlation."
P-value: The p-value from the correlation test is 0.000666236091774794, which is less than the significance level of 0.05. This indicates strong evidence against the null hypothesis (H0) that there is no significant linear relationship between GDP and Schooling in low-income countries.
Coefficient: The Pearson correlation coefficient between GDP and Schooling in low-income countries is 0.108196033259806. The correlation coefficient measures the strength and direction of the linear relationship. In this case, the positive correlation coefficient suggests a weak positive linear relationship between GDP and Schooling in low-income countries.
In conclusion, the statistical analysis supports the alternative hypothesis (H1) that a significant linear relationship exists between GDP and Schooling in low-income countries, although the relationship is weak.
III. LINEAR REGRESSION TECHNIQUE AND HYPOTHESIS 3
The Linear Regression Technique allows the assessment of the relationship between two variables. It focuses on predicting the value of a dependent variable based on an independent variable. We’re employing this technique to understand the relationship between BMI (Body Mass Index) and adult mortality rates in the top 5 wealthiest countries.
The process begins with data reading and cleaning, selecting necessary columns, and removing missing values. Next, we calculate GDP per capita and select the top 5 wealthiest countries based on this index.
We utilize the Linear Regression technique to examine the relationship between BMI and mortality rates. This model helps us assess the variation in mortality rates based on BMI. The model results provide coefficients, especially the p-value, aiding in evaluating the statistical significance of this relationship.
Finally, we use the model’s output to create a scatter plot, a simple yet effective visualization tool, to depict the relationship between BMI and mortality rates in the top 5 wealthiest countries. This helps us comprehend the correlation between these factors visually and straightforwardly.
- Null Hypothesis (H₀):"There is no relationship between BMI and adult mortality rates in the top 5 wealthiest countries."
- Alternative hypothesis (H₁): "There is a significant relationship between BMI and adult mortality rates in the world's five wealthiest countries."
In this hypothesis, we investigate ” no significant relationship between BMI and adult mortality rates in the top 5 wealthiest countries.”
| Country |
|---|
| Maldives |
| Georgia |
| Israel |
| Croatia |
| Tonga |
| Country | bmi | Adult |
|---|---|---|
| Croatia | 63.7 | 95 |
| Croatia | 63.1 | 97 |
| Croatia | 62.5 | 97 |
| Croatia | 61.9 | 14 |
| Croatia | 61.3 | 14 |
| Croatia | 6.6 | 16 |
| Croatia | 6.0 | 19 |
| Croatia | 59.4 | 116 |
| Croatia | 58.7 | 114 |
| Croatia | 58.1 | 113 |
| Croatia | 57.5 | 116 |
| Croatia | 56.9 | 114 |
| Croatia | 56.3 | 122 |
| Croatia | 55.8 | 124 |
| Croatia | 55.2 | 126 |
| Croatia | 54.7 | 127 |
| Georgia | 56.2 | 129 |
| Georgia | 55.3 | 125 |
| Georgia | 54.4 | 128 |
| Georgia | 53.6 | 13 |
| Georgia | 52.8 | 127 |
| Georgia | 52.0 | 132 |
| Georgia | 51.3 | 133 |
| Georgia | 5.5 | 128 |
| Georgia | 49.9 | 12 |
| Georgia | 49.2 | 126 |
| Georgia | 48.6 | 128 |
| Georgia | 48.1 | 134 |
| Georgia | 47.5 | 132 |
| Georgia | 47.0 | 142 |
| Georgia | 46.5 | 121 |
| Georgia | 46.0 | 129 |
| Israel | 64.9 | 58 |
| Israel | 64.6 | 6 |
| Israel | 64.2 | 61 |
| Israel | 63.8 | 6 |
| Israel | 63.4 | 61 |
| Israel | 63.0 | 61 |
| Israel | 62.6 | 63 |
| Israel | 62.1 | 65 |
| Israel | 61.6 | 68 |
| Israel | 61.1 | 68 |
| Israel | 6.6 | 71 |
| Israel | 6.1 | 69 |
| Israel | 59.6 | 71 |
| Israel | 59.2 | 74 |
| Israel | 58.7 | 74 |
| Israel | 58.3 | 76 |
| Maldives | 27.4 | 61 |
| Maldives | 26.2 | 62 |
| Maldives | 25.1 | 64 |
| Maldives | 24.1 | 65 |
| Maldives | 23.1 | 67 |
| Maldives | 22.1 | 73 |
| Maldives | 21.2 | 75 |
| Maldives | 2.3 | 81 |
| Maldives | 19.5 | 82 |
| Maldives | 18.7 | 88 |
| Maldives | 18.0 | 93 |
| Maldives | 17.3 | 16 |
| Maldives | 16.7 | 112 |
| Maldives | 16.2 | 124 |
| Maldives | 15.6 | 129 |
| Maldives | 15.2 | 139 |
| Tonga | 75.2 | 133 |
| Tonga | 74.8 | 135 |
| Tonga | 74.3 | 137 |
| Tonga | 73.8 | 138 |
| Tonga | 73.3 | 14 |
| Tonga | 72.7 | 142 |
| Tonga | 72.1 | 147 |
| Tonga | 71.5 | 145 |
| Tonga | 7.8 | 146 |
| Tonga | 7.1 | 148 |
| Tonga | 69.4 | 15 |
| Tonga | 68.6 | 151 |
| Tonga | 67.8 | 153 |
| Tonga | 67.0 | 155 |
| Tonga | 66.2 | 157 |
| Tonga | 65.5 | 158 |
##
## Call:
## lm(formula = Adult ~ bmi, data = top_5_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -93.58 -28.20 15.49 35.13 62.95
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.2580 11.8600 7.020 7.21e-10 ***
## bmi 0.2526 0.2270 1.113 0.269
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.29 on 78 degrees of freedom
## Multiple R-squared: 0.01563, Adjusted R-squared: 0.003006
## F-statistic: 1.238 on 1 and 78 DF, p-value: 0.2692
## [1] "p-value: 0.269230538606373"
## [1] "The relationship between BMI and Adult Mortality is not statistically significant."
## [1] "Fail to reject the null hypothesis (H₀): There is no relationship between BMI and adult mortality rates in the top 5 wealthiest countries."
## [1] "Do not accept the alternative hypothesis (H₁): There is a significant relationship between BMI and adult mortality rates in the world's five wealthiest countries."
The results from the linear regression model allow us to assess the relationship between BMI and adult mortality rates in the top 5 wealthiest countries.
Coefficients: For the Body Mass Index (BMI) coefficient, the estimated value is 0.2526, with a standard error of 0.2270. This indicates a predictive relationship between BMI and mortality rates. However, the p-value for BMI (0.269) does not reach statistical significance (assuming a significance level of 0.05).
R-squared: The R-squared value is 0.01563, suggesting that only about 1.56% of the variation in mortality rates can be explained by the linear model using BMI. This implies that the model does not explain much of the data variance.
P-value: The p-value for the BMI coefficient is 0.269 (> 0.05), which is insufficient to reject the null hypothesis of no significant relationship between BMI and adult mortality rates in the top 5 wealthiest countries.
Based on these results and the initial hypothesis, insufficient statistical evidence exists to assert a significant relationship between BMI and adult mortality rates in the top 5 wealthiest countries. This might suggest no clear relationship between BMI and mortality rates within this dataset.
IV. ANOVA TECHNIQUE - TUKEY’S METHOD AND HYPOTHESIS 4
ANOVA, or Analysis of Variance, is a statistical technique used to compare the means of two or more groups or treatments. It determines whether the groups have significant differences based on the variation observed in the data. ANOVA is commonly used when there are more than two groups to compare and is particularly useful for experimental designs with categorical independent variables. By analyzing the variance between and within groups, ANOVA helps determine if significant differences exist in the means and provides insights into the factors contributing to these differences. Tukey’s method is used after ANOVA to create confidence intervals for all pairwise differences between factor level means while controlling the family error rate to a specific level - in this case: 95%.
- Null Hypothesis (H₀): Life expectancy globally has gradually improved from 2000 to 2015 without attenuation.
- Alternative Hypothesis (H₁): Life expectancy globally has gradually improved from 2000 to 2015 with attenuation.
With these hypotheses, we aim to evaluate whether there is evidence to reject the null hypothesis in favor of the alternative hypothesis, suggesting an improvement in global life expectancy over time with some attenuation and decrease at some point.
In this example, hypothesis testing was taken a step further into the realm of post-hoc analysis. Post-hoc analysis often provides much greater insight into the differences or similarities between specific groups and is an important step in this case.
## Df Sum Sq Mean Sq F value Pr(>F)
## Year 2 2282 1141.2 12.85 3.27e-06 ***
## Residuals 729 64726 88.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] 3.26574e-06
## [1] "The correlation is statistically significant."
## [1] "Reject the null hypothesis (H₀): There is no significant correlation."
## [1] "Accept the alternative hypothesis (H₁): There is a significant correlation."
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Life_Expectancy ~ Year, data = anova_data)
##
## $Year
## diff lwr upr p adj
## 2005-2010-2000-2005 2.568852 0.5653591 4.572346 0.0075779
## 2010-2015-2000-2005 4.137158 2.1336651 6.140652 0.0000045
## 2010-2015-2005-2010 1.568306 -0.7451288 3.881741 0.2496781
When there are three or more independent groups, we apply a one-way ANOVA to see if there is a significant difference. The ANOVA analysis investigated the relationship between the “Status” variable (representing different country statuses) and the “Life” variable (representing life expectancy), and revealed a highly significant difference in life expectancy across different country statuses (p < 7e-14). The p-value (7.053867e-14) provides strong evidence against the null hypothesis and supports the alternative hypothesis. The said p-value for one-way ANOVA is less than 0.05 indicating that at least one of the treatment groups differs from the others.
F-value: The F-value obtained from this ANOVA analysis was 65.91, which means a significant effect of the country’s status on life expectancy. The mean square for Status was 3206, suggesting that a considerable amount of the variation in life expectancy can be explained by the country status compared to the other variation.
P-value: Considering a significance level of 0.01, the obtained p-value (7.053867e-14) is much smaller, leading to the rejection of the null hypothesis. Furthermore, at a significance level of 0.05, the p-value remains smaller than the threshold, reinforcing the significance level of the overall ANOVA. This further supports the conclusion that there is a significant difference in life expectancy across different country statuses.
In conclusion, the statistical analysis results provide obvious evidence to reject the null hypothesis. The findings suggest a substantial and statistically significant difference in life expectancy across various country statuses while supporting the alternative hypothesis. One-way ANOVA tells us whether the groups’ means are significantly different, but we still need to do a post hoc multiple comparison test to dig further.
According to the Tukey HSD test, P3 vs P1 and P3 vs P2 significantly differ at the 95 percent confidence level.
In particular:
P-value for the difference in means between B and A: 1,20e-06
P-value for the difference in means between C and A: 0.00e+00
P-value for the difference in means between C and B: 1,18e-05