An observation study was done on schooling and life expectancy data for 193 countries taken from the World Health Organization (WHO). What was found was that there is a a significant positive correlation between life expectancy and the number of years spent in school, but also that the two variables share a linear trend. A linear regression model was then fit to the data since all conditions for such a model were satisfied. The slope of this model was found to have a P-value much less than 0.05, therefore indicating that there is a correspondence between life expectancy and the number of years spent in school. This analysis is limited in that only one exploratory variable was used, and many other factors contribute to life expectancy.
Does spending more years in school have any correspondence with increased or decreased life expectancy?
The cases are countries across the world. The dataset includes 15 years of school retention data for each case. WHO keeps track of the cases and the data was imported into Kaggle. There are 193 countries in this dataset (193 countries and 15 years of school retention data for each country is 2893 cases). In this report, we focus on the cases for the most recent year (2015).
The dependent variable is the life expectancy and it is quantitative.
The independent variable is the number of years of schooling and it is quantitative.
describe(parsed_life_expectancy_data$Life.expectancy)
## parsed_life_expectancy_data$Life.expectancy
## n missing distinct Info Mean Gmd .05 .10
## 183 0 132 1 71.62 9.182 57.33 59.82
## .25 .50 .75 .90 .95
## 65.75 73.90 76.95 81.78 82.70
##
## lowest : 51.0 52.4 52.5 53.1 53.3, highest: 83.4 83.7 85.0 86.0 88.0
describe(parsed_life_expectancy_data$Schooling)
## parsed_life_expectancy_data$Schooling
## n missing distinct Info Mean Gmd .05 .10
## 173 10 89 1 12.93 3.286 8.28 9.10
## .25 .50 .75 .90 .95
## 10.80 13.10 15.00 16.38 17.30
##
## lowest : 4.9 5.0 5.4 6.3 7.1, highest: 18.1 18.6 19.0 19.2 20.4
parsed_life_expectancy_data %>%
pivot_longer(cols = c(Life.expectancy, Schooling)) %>%
ggplot(mapping = aes(x = value)) +
geom_histogram(bins = 15) +
facet_wrap(~name, scales = "free_x")
## Warning: Removed 10 rows containing non-finite values (stat_bin).
parsed_life_expectancy_data %>%
ggplot(mapping = aes(x = Schooling, y = Life.expectancy)) +
geom_point()
## Warning: Removed 10 rows containing missing values (geom_point).
The scatterplot shown above shows a linear trend between schooling and life expectancy in 2015 for 183 countries across the globe. Based on this, we could use linear regression and fit a least squares line to the data. The conditions for fitting a least squares line are met for these two variables. These conditions and why they have been met will be shown after the model has been fitted to the data.
Since the slope of the regression line is an estimate of the true parameter, is there convincing evidence that the slope is different from zero? Does the data provide strong evidence that countries where people spend more time in school will have a higher life expectancy?
\(H_o: \beta_1 = 0\). The true linear model has slope zero.
\(H_A: \beta_1 \neq 0\). The true linear model has a slope different than zero. The data provides strong evidence that average number of years in school for a country’s population is predictive of a country’s life expectancy.
The linear model coefficients other parameters which will help us in accepting the null or the alternative are shown below.
model <- lm(Life.expectancy ~ Schooling, data = parsed_life_expectancy_data)
summary(model)
##
## Call:
## lm(formula = Life.expectancy ~ Schooling, data = parsed_life_expectancy_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.909 -2.547 0.317 3.170 10.655
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.9016 1.5870 27.03 <2e-16 ***
## Schooling 2.2287 0.1198 18.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.575 on 171 degrees of freedom
## (10 observations deleted due to missingness)
## Multiple R-squared: 0.6694, Adjusted R-squared: 0.6675
## F-statistic: 346.2 on 1 and 171 DF, p-value: < 2.2e-16
parsed_life_expectancy_data %>%
summarise(cor(Schooling, Life.expectancy, use = "complete.obs"))
## cor(Schooling, Life.expectancy, use = "complete.obs")
## 1 0.8181594
With this output, the linear regression model is:
\(Life \space expectancy = 42.9016 + 2.2287 \times Schooling\)
The R-squared value from the output shows that 67% of the variance in life expectancy can be explained by schooling using a linear model.
Note that the P-value for the Schooling variable, which is our \(\beta_1\) is less than 2e-16.
The confidence interval for the Schooling coefficient is: \(2.287 \pm 0.236\). Note that 0 is not in the 95% CI.
ggplot(data = parsed_life_expectancy_data, aes(x = Schooling, y = Life.expectancy)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 10 rows containing non-finite values (stat_smooth).
## Warning: Removed 10 rows containing missing values (geom_point).
ggplot(data = model, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
ggplot(data = model, aes(sample = .resid)) +
stat_qq()
These three plots will allow us to check that the conditions for the least squares line has been met.
Linearity: The data does indeed show a linear trend. Therefore, this condition is met.
Nearly normal residuals: The residuals are nearly normal since the points on the QQ plot almost line up in a straight diagonal line.
Constant variability: The residual plot shows no obvious pattern in the residuals. Therefore, this condition is also met.
Independent observations: Since we are not working with time series data, each of the cases are different from one another. Therefore, this condition is also met.
Since the p-value for the Schooling coefficient is much less than 0.05 (<2e-16), the data provides convincing evidence that a higher number of years spent in school has correspondence with life expectancy.
This analysis was important because it shows the strength of the relationship between life expectancy and spending more time in school. The fitted model on the scatterplots show that on average, countries where the average person spends more time in school have longer life expectancies. This means that for countries where education is not as developed or there is less incentive for people to stay in school, this research could fuel the need to keep people in school longer and give countries more incentive to keep people in school longer. For this analysis, we only considered measuring the correspondence between life expectancy and just one other variable. If we were to have another variable, such as household income or health sector effectiveness, which the data from Kaggle does provide, or number of degree holders, which the data from Kaggle does not provide, then we would probably construct a more robust analysis to see the influence of additional factors on life expectancy.