The attached who.csv dataset contains real-world data from 2008. The variables included follow.
Country: name of the country LifeExp: average life expectancy for the country in years InfantSurvival: proportion of those surviving to one year or more Under5Survival: proportion of those surviving to five years or more TBFree: proportion of the population without TB. PropMD: proportion of the population who are MDs PropRN: proportion of the population who are RNs PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate TotExp: sum of personal and government expenditures.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
file_path <- "/Users/mollysiebecker/DATA 605/homework/who.csv"
who_data <- read.csv(file_path)
ggplot(who_data, mapping = aes(x = TotExp, y = LifeExp)) +
geom_point() +
labs(title = "Life Expectancy vs. Total Expenditures", x = "Total Expenditures", y = "Average Life Expectancy")
m_simple <- lm(LifeExp ~ TotExp, data = who_data)
summary(m_simple)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
Right away, the assumptions of simple linear regression are not met, because we can see from the scatter plot that a linear model does not appear to be a good fit for the data. The F-statistic of 65.26 indicates that this model is better than one with only the y-intercept and not the TotExp variable. Since there is also a very low p-value, we can see that this relationship between life expectancy and total expenditure is highly unlikely to be explained only by chance. The r-squared of ~.25 indicates a low correlation, as only 25% of the variation in the data is explained by the model. The standard error is ~9.3, which can be used to show that the residuals are not normally distributed since they are not equal to approximately 1.5 times the SE. Overall, a simple linear model is not appropriate for this data.
who_data_transformed <- who_data
who_data_transformed$LifeExp <- who_data_transformed$LifeExp^4.6
who_data_transformed$TotExp <- who_data_transformed$TotExp^.06
ggplot(who_data_transformed, mapping = aes(x = TotExp, y = LifeExp)) +
geom_point() +
labs(title = "Life Expectancy vs. Total Expenditures (Transformed)", x = "(Total Expenditures)^.06", y = "(Average Life Expectancy)^4.6")
m_transformed <- lm(LifeExp ~ TotExp, data = who_data_transformed)
summary(m_transformed)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_data_transformed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
This model is better than the previous simple linear model. The F-statistic of 507.7 indicates that this model is better than one with only the y-intercept and not the TotExp^.06 variable. Again, the very low p-value indicates that this relationship between life expectancy and total expenditure is highly unlikely to be explained only by chance. The r-squared of ~.73 indicates a moderate correlation, as 73% of the variation in the data can be explained by the model. The standard error is 90490000, which can be used to show that the residuals are not normally distributed since they are not equal to approximately 1.5 times the SE.
(-736527910+620060216*1.5)^(1/4.6)
## [1] 63.31153
(-736527910+620060216*2.5)^(1/4.6)
## [1] 86.50645
Inputting the given values and then taking the inverse of raising to the 4.6th power, we find that the life expectancies are approximately 63.3 and 86.5 years, respectively.
quadratic_data <- who_data %>%
mutate(quadratic_term = PropMD * TotExp)
m_quad <- lm(LifeExp ~ PropMD + TotExp + quadratic_term, data = quadratic_data)
summary(m_quad)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + quadratic_term, data = quadratic_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## quadratic_term -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
This model is better than the previous simple linear model, but not as good as the model with the transformed data. The F-statistic of 34.5 indicates that this model is better than one with only the y-intercept and not the other variables. Again, the very low p-value indicates that this relationship between life expectancy and total expenditure is highly unlikely to be explained only by chance. The r-squared of ~.35 indicates a low correlation, as only 35% of the variation in the data can be explained by the model. The standard error is 8.765, which can be used to show that the residuals are not normally distributed since they are not equal to approximately 1.5 times the SE.
6.277e+01+1.497e+03*.03+7.233e-05*14+-6.026e-03*.03*14
## [1] 107.6785
No, this forecast does not seem realistic since ~108 is a very old age. It also seems unrealistic for a country to have one of the highest proportions of doctors at .03, but one of the lowest total expenditures at 14, which is likely contributing to the unrealistic prediction.