Country: name of the country
LifeExp: average life expectancy for the country in years
InfantSurvival: proportion of those surviving to one year or more
Under5Survival: proportion of those surviving to five years or more
TBFree: proportion of the population without TB.
PropMD: proportion of the population who are MDs
PropRN: proportion of the population who are RNs
PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate
TotExp: sum of personal and government expenditures.
library(knitr)
library(ggplot2)
data<-read.csv("https://raw.githubusercontent.com/hovig/MSDS_CUNY/master/DATA605/who.csv")
Provide a scatterplot of \(LifeExp \sim TotExp\), and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, \(R^2\), standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
ggplot(data, aes(x = data$TotExp, y = data$LifeExp)) +
geom_point(size = 3, alpha = .4) +
labs(x = "Life Expectancy", y = "Total Expenditures")
(lregression <- lm(data$LifeExp~data$TotExp, data = data))
##
## Call:
## lm(formula = data$LifeExp ~ data$TotExp, data = data)
##
## Coefficients:
## (Intercept) data$TotExp
## 6.475e+01 6.297e-05
(s<-summary(lregression))
##
## Call:
## lm(formula = data$LifeExp ~ data$TotExp, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## data$TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
cat(sprintf("%s = %f\n",c(" Residual standard error","R-squared","F-statistic","p-value"),c(s[6][[1]][[1]],s[8][[1]][[1]],s[10][[1]][[1]],s[11][[1]][[1]])))
## Residual standard error = 9.371033
## R-squared = 0.257692
## F-statistic = 65.264198
## p-value = 0.006466
hist(lregression$resid,main="Histogram of Residuals")
qqnorm(lregression$resid)
qqline(lregression$resid)
Raise life expectancy to the 4.6 power (i.e., \(LifeExp^{4.6}\)). Raise total expenditures to the 0.06 power (nearly a log transform, \(TotExp^.06\)). Plot \(LifeExp^{4.6}\) as a function of \(TotExp^.06\), and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, \(R^2\), standard error, and p-values. Which model is “better?”
LifeExp_new <- data$LifeExp**4.6
TotExp_new <- data$TotExp**0.06
ggplot(data, aes(x = TotExp_new, y = LifeExp_new)) +
geom_point(size = 3, alpha = .4) +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Life Expectancy to the 4.6 power", y = "Total Expenditures to the 0.06 power")
reg<-lm(LifeExp_new~TotExp_new, data = data)
(s<-summary(reg))
##
## Call:
## lm(formula = LifeExp_new ~ TotExp_new, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp_new 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
cat(sprintf("%s = %f\n",c(" Residual standard error","R-squared","F-statistic","p-value"),c(s[6][[1]][[1]],s[8][[1]][[1]],s[10][[1]][[1]],s[11][[1]][[1]])))
## Residual standard error = 90492392.574165
## R-squared = 0.729767
## F-statistic = 507.696705
## p-value = 0.267671
hist(reg$resid,main="Histogram of Residuals")
qqnorm(reg$resid)
qqline(reg$resid)
Using the results from 3, forecast life expectancy when \(TotExp^.06 =1.5\). Then forecast life expectancy when \(TotExp^.06=2.5\).
forecast <- function(a) {
return((s[4][[1]][[1]] + s[4][[1]][[2]] * a)**(1/4.6))
}
cat(sprintf("%s = %f years\n",c(" If TotExp^.06 =1.5 then LifeExp","If TotExp^.06 =2.5 then LifeExp"),c(forecast(1.5),forecast(2.5))))
## If TotExp^.06 =1.5 then LifeExp = 63.311533 years
## If TotExp^.06 =2.5 then LifeExp = 86.506448 years
Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? \(LifeExp = b0+b1 \times PropMd + b2 \times TotExp +b3 \times PropMD \times TotExp\)
LifeExp_lm <- lm(LifeExp~PropMD+TotExp+PropMD*TotExp, data = data)
(s<-summary(LifeExp_lm))
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
cat(sprintf("%s = %f\n",c(" Residual standard error","R-squared","F-statistic","p-value"),c(s[6][[1]][[1]],s[8][[1]][[1]],s[10][[1]][[1]],s[11][[1]][[1]])))
## Residual standard error = 8.765493
## R-squared = 0.357435
## F-statistic = 34.488327
## p-value = 0.008238
hist(LifeExp_lm$resid,main="Histogram of Residuals")
qqnorm(LifeExp_lm$resid)
qqline(LifeExp_lm$resid)
Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
PropMD <- 0.03
TotExp <- 14
b0 <- s[4][[1]][[1]]
b1 <- s[4][[1]][[2]]
b2 <- s[4][[1]][[3]]
b3 <- s[4][[1]][[4]]
(LifeExp <- b0 + b1 * PropMD + b2 * TotExp +b3 * PropMD * TotExp)
## [1] 107.696