The who.csv dataset contains real-world data from 2008. The variables included follow:
Country: name of the country
LifeExp: average life expectancy for the country in years
InfantSurvival: proportion of those surviving to one year or more
Under5Survival: proportion of those surviving to five years or more
TBFree: proportion of the population without TB.
PropMD: proportion of the population who are MDs
PropRN: proportion of the population who are RNs
PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate
TotExp: sum of personal and government expenditures.
# Load the who dataset into R.
who_dataset <- read.csv('https://raw.githubusercontent.com/stephen-haslett/data605/data605-week-12/who.csv')
# Attach the dataset so we can access the variables easily.
attach(who_dataset)
# View the first 10 observations in the dataset to get a sense of how the data is structured.
kable(head(who_dataset, 10), format = 'markdown')
Country | LifeExp | InfantSurvival | Under5Survival | TBFree | PropMD | PropRN | PersExp | GovtExp | TotExp |
---|---|---|---|---|---|---|---|---|---|
Afghanistan | 42 | 0.835 | 0.743 | 0.99769 | 0.0002288 | 0.0005723 | 20 | 92 | 112 |
Albania | 71 | 0.985 | 0.983 | 0.99974 | 0.0011431 | 0.0046144 | 169 | 3128 | 3297 |
Algeria | 71 | 0.967 | 0.962 | 0.99944 | 0.0010605 | 0.0020914 | 108 | 5184 | 5292 |
Andorra | 82 | 0.997 | 0.996 | 0.99983 | 0.0032973 | 0.0035000 | 2589 | 169725 | 172314 |
Angola | 41 | 0.846 | 0.740 | 0.99656 | 0.0000704 | 0.0011462 | 36 | 1620 | 1656 |
Antigua and Barbuda | 73 | 0.990 | 0.989 | 0.99991 | 0.0001429 | 0.0027738 | 503 | 12543 | 13046 |
Argentina | 75 | 0.986 | 0.983 | 0.99952 | 0.0027802 | 0.0007410 | 484 | 19170 | 19654 |
Armenia | 69 | 0.979 | 0.976 | 0.99920 | 0.0036987 | 0.0049189 | 88 | 1856 | 1944 |
Australia | 82 | 0.995 | 0.994 | 0.99993 | 0.0023320 | 0.0091494 | 3181 | 187616 | 190797 |
Austria | 80 | 0.996 | 0.996 | 0.99990 | 0.0036109 | 0.0064587 | 3788 | 189354 | 193142 |
# Create a scatter plot of life expectancy by mean government expenditures per capita on healthcare.
life_expectancy_plot <- ggplot(who_dataset, aes(x = TotExp, y = LifeExp)) + geom_point(color = "salmon") +
ylab("Life Expectancy") + xlab("Government Expenditures Per Capita on Healthcare")
life_expectancy_plot
# Simple linear regression.
who_simple_regression <- lm(LifeExp ~ TotExp, who_dataset)
# Run the summary function on the model so we can interpret the F statistics,
# R-squared, standard error, and p-values.
summary(who_simple_regression)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
F-statistic: The F-statistic is high enough for us to reject the null hyopthesis that "government expenditures on healthcare do not contribute to a country’s life expentancy". We can therefore say that the "TotExp" variable does have an effect on the "LifeExp" variable.
R-squared: The R-squared value is low (0.2577), suggesting that the model only explains 25.77% of the data variation.
Standard error: The standard error is significantly high which suggests that the model does not fit well.
P-values: The low p-values suggest that govenerment expenditure does contribute to a country's life expectancy. Therefore the 'TotExp'variable does contribute to the model.
# Raise the life expectancy variable to the power of 4.6.
life_expectancy_46 = who_dataset$LifeExp^(4.6)
# Raise the total expenditures variable to the power of 0.6.
total_expenditures_06 = who_dataset$TotExp^(0.06)
# Create a scatter plot for LifeExp^4.6 as a function of TotExp^.06.
plot(total_expenditures_06, life_expectancy_46, main='LifeExp^4.6 as a function of TotExp^.06',
ylab = 'Life Expectancy',
xlab = 'Total Expenditure',
col = 2)
# Re-run the simple regression model on the adjusted variables.
adjusted_model = lm(life_expectancy_46 ~ total_expenditures_06)
abline(adjusted_model, col = 1)
# Run the summary function on the adjusted model so we can interpret
# the F statistics, R-squared, standard error, and p-values.
summary(adjusted_model)
##
## Call:
## lm(formula = life_expectancy_46 ~ total_expenditures_06)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## total_expenditures_06 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
F-statistic: The F-statistic for this model is much higher than that of the unadjusted model in question 1. The p-value is also much lower and therefore we can confirm that this model is much better than the unadjusted model in question 1.
R-squared: The R-squared value (0.7298) is also much higher than that of the unadjusted model. The unadjusted model only explained 25.77% of the data variation whereas this model explains 72.98% of the data variation.
Standard error: The standard error for this model is surprisingly high considering the positive F-statistic and R-squared values. This may be explained by the fact that we raised the LifeExp and TotExp variables exponentially.
P-values: The p-values for this model are much lower than those of the unadjusted model giving us more confidence that the 'TotExp'variable contributes to the model.
We can use the formula from the adjusted model in question 2 to answer this question: \[LifeExp = -736527910 + 620060216 \times TotExp\]
# Life expectancy when TotExp^.06 = 1.5.
life_expectancy <- -736527910 + 620060216 * 1.5
round(life_expectancy ^ (1/4.6), 2)
## [1] 63.31
Answer: 63.31
# Life expectancy when TotExp^.06 = 2.5.
life_expectancy <- -736527910 + 620060216 * 2.5
round(life_expectancy ^ (1/4.6), 2)
## [1] 86.51
Answer: 86.51
# Create the multiple regression model.
who_multiple_regression <- lm(LifeExp ~ PropMD + TotExp + PropMD * TotExp, who_dataset)
# Run the summary function on the model so we can interpret the F statistics,
# R-squared, standard error, and p-values.
summary(who_multiple_regression)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who_dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
F-statistic: The F-statistic is significant and the p-value is low, so we can reject the null hypothesis and conclude that the response variables contribute to the true value of the dependent variable.
R-squared: The R-squared value is low (0.3574), suggesting that the model only explains 35.74% of the data variation.
Standard error: The standard error (8.765) is significantly high which suggests that the model does not fit well.
P-values: The low p-values suggest that the response variables contribute to the true value of the dependent variable.
PropMD = 0.03
TotExp = 14
variable_values <- data.frame(PropMD, TotExp)
life_expectancy <- predict(who_multiple_regression, variable_values, interval = 'predict')
life_expectancy
## fit lwr upr
## 1 107.696 84.24791 131.1441
Answer: The forcast is not realistic as it is very rare that a person lives to the age of 107.