The attached who.csv dataset contains real-world data from 2008. The variables included follow.
Country: name of the country
LifeExp: average life expectancy for the country in years
InfantSurvival: proportion of those surviving to one year or more
Under5Survival: proportion of those surviving to five years or more
TBFree: proportion of the population without TB.
PropMD: proportion of the population who are MDs
PropRN: proportion of the population who are RNs
PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate
TotExp: sum of personal and government expenditures.
Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?
LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
# Loading who.csv dataset
who.csv = read.csv("who.csv")
# simple linear regression
simple.lm = lm(LifeExp~TotExp, data=who.csv)
summary(simple.lm)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who.csv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
# scatterplot of LifeExp~TotExp
ggplot(data = who.csv, aes(x = TotExp , y = LifeExp)) +
geom_point() +
geom_abline(slope = coef(simple.lm)[[2]], intercept = coef(simple.lm)[[1]])
The \(r^2\) value of 0.2577 mean that roughly 25% of the variability were observed.
The F statistics have a value 65.26 on 1 and 188 DF mean that one independent variable were tested from 188 and therefore, I believe doesn’t have much value and have no affect.
with the p-value of 7.714e-14 we can said that the value seems small due to the fact this value is only base on single test with one dependent and independent variable.
Residual standard error of 9.371 should show how close the sample mean deviates from the actual mean of the population, therefore as sample mean increases the standard error show be decreasing.
Base on the graph above we can conclude that the assumptions of simple linear regression does seems to met because it doesn’t show or describe any relationship between the dependent and independent variable in a straight line.
# life expectancy raise to the 4.6 power
who.csv <- who.csv %>%
mutate(LifeExpto4.6 = LifeExp^4.6, TotExpto0.06 = TotExp^0.06)
simple.lm2 <- lm(LifeExpto4.6 ~ TotExpto0.06, data = who.csv)
summary(simple.lm2)
##
## Call:
## lm(formula = LifeExpto4.6 ~ TotExpto0.06, data = who.csv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExpto0.06 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
This is a plot of simple linear regression after raising LifeExp to 4.6 power and TotExp to 0.06 power to see the linearly.
# simple linear 2 plot
ggplot(data = who.csv, aes(x = TotExpto0.06, y = LifeExpto4.6)) + geom_point() +
geom_abline(slope = coef(simple.lm2)[[2]], intercept = coef(simple.lm2)[[1]])
The \(r^2\) value of 0.7298 mean that roughly 72% of the variability were observed.
The F statistics have a value 507.7 on 1 and 188 DF mean that one independent variable were tested from 188 and therefore, I believe doesn’t have much value.
The p-value of 2.2e-16 we can said that the value seems small due to the fact this value is only base on single test with one dependent and independent variable.
Base on the graph above we can conclude that the new model simple lm2 fit the data well because it a linear relationship between the dependent and independent variable in a straight line.
Since we want our residuals to be regularly distributed, the first and third values should be about 1.5 times the standard error, and the residual standard error is 90490000.
forecast_lifeExpectancy <- function( TotExpto0.06 ){
return( (620060216 * TotExpto0.06 - 736527910 )^(1/4.6))
}
The life expectancy when TotExp^.06 =1.5, is forecasted to be 63.3115334.
The life expectancy when TotExp^.06=2.5, is forecasted to be 86.5064485.
LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
multiple_regression <- lm(LifeExp ~ PropMD + TotExp + (PropMD*TotExp), data = who.csv)
summary(multiple_regression)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data = who.csv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
This is a plot of multiple regression for LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
# multiple regression Plot
plot(multiple_regression)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
The \(R^2\) value of 0.3574 states that the model on accounts for about 35% of the variability observed.
The F statistic states that these variables do contribute to the value of life expectancy.
The p-values are quite small, as such we can reject the hypothesis that the independent variables have no impact on life expectancy.
The residual standard error is 8.765, we would like our residuals to be normally distributed as such the first and third quantifies should be around 1.5 times the standard error.
As can be seen from the plots below, the model does not really meet the conditions for a linear regression. Namely, the residuals vs fitted does not display constant variability.
forecast_lifeExpectancy <- function( PropMD,TotExp ){
return( 62.7727 + 1497.494 * PropMD +
7.233324e-05 * TotExp - 0.006025686 * PropMD * TotExp)
}
The model predicts that the life expectancy would be 107.6960019.
This forecast doesn’t seem realistic, and it speculates that if the proportion of doctors was increased and the total spending was decreased we would still expect a high life expectancy.