Assignment 12 - The attached who.csv dataset contains real-world data from 2008. The variables included follow.
Country:  name of the country
LifeExp:  average life expectancy for the country in years
InfantSurvival:  proportion of those surviving to one year or more
Under5Survival:  proportion of those surviving to five years or more 
TBFree:  proportion of the population without TB.
PropMD:  proportion of the population who are MDs
PropRN:  proportion of the population who are RNs
PersExp:  mean personal expenditures on healthcare in US dollars at average exchange rate 
GovtExp:  mean government expenditures per capita on healthcare, US dollars at average exchange rate
TotExp:  sum of personal and government expenditures. 
library(knitr)
library(ggplot2)
data<-read.csv("https://raw.githubusercontent.com/hovig/MSDS_CUNY/master/DATA605/who.csv")

Question 1

Provide a scatterplot of \(LifeExp \sim TotExp\), and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, \(R^2\), standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

ggplot(data, aes(x = data$TotExp, y = data$LifeExp)) + 
  geom_point(size = 3, alpha = .4) +
  labs(x = "Life Expectancy", y = "Total Expenditures") 

(lregression <- lm(data$LifeExp~data$TotExp, data = data))
## 
## Call:
## lm(formula = data$LifeExp ~ data$TotExp, data = data)
## 
## Coefficients:
## (Intercept)  data$TotExp  
##   6.475e+01    6.297e-05
(s<-summary(lregression))
## 
## Call:
## lm(formula = data$LifeExp ~ data$TotExp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## data$TotExp 6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14
cat(sprintf("%s = %f\n",c(" Residual standard error","R-squared","F-statistic","p-value"),c(s[6][[1]][[1]],s[8][[1]][[1]],s[10][[1]][[1]],s[11][[1]][[1]])))
##  Residual standard error = 9.371033
##  R-squared = 0.257692
##  F-statistic = 65.264198
##  p-value = 0.006466
hist(lregression$resid,main="Histogram of Residuals")

qqnorm(lregression$resid)
qqline(lregression$resid)

  • The Residual Standard Error is low, better.
  • The histogram for residuals looks like it’s a left-skewed bimodal distribution. Therefore it’s not considered normal, not linear.
  • The plots don’t seem to be very close to a normal distribution, a statistical test is needed for a more accurate check between the skewness and the normal distribution

Question 2

Raise life expectancy to the 4.6 power (i.e., \(LifeExp^{4.6}\)). Raise total expenditures to the 0.06 power (nearly a log transform, \(TotExp^.06\)). Plot \(LifeExp^{4.6}\) as a function of \(TotExp^.06\), and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, \(R^2\), standard error, and p-values. Which model is “better?”

LifeExp_new <- data$LifeExp**4.6
TotExp_new <- data$TotExp**0.06
ggplot(data, aes(x = TotExp_new, y = LifeExp_new)) + 
  geom_point(size = 3, alpha = .4) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Life Expectancy to the 4.6 power", y = "Total Expenditures to the 0.06 power") 

reg<-lm(LifeExp_new~TotExp_new, data = data)
(s<-summary(reg))
## 
## Call:
## lm(formula = LifeExp_new ~ TotExp_new, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp_new   620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16
cat(sprintf("%s = %f\n",c(" Residual standard error","R-squared","F-statistic","p-value"),c(s[6][[1]][[1]],s[8][[1]][[1]],s[10][[1]][[1]],s[11][[1]][[1]])))
##  Residual standard error = 90492392.574165
##  R-squared = 0.729767
##  F-statistic = 507.696705
##  p-value = 0.267671
hist(reg$resid,main="Histogram of Residuals")

qqnorm(reg$resid)
qqline(reg$resid)

  • The standard error is very high, meaning the residuals have a greater variance
  • The R-square is a bit high and takes into account the number of variables, so it’s more useful for the multiple regression analysis
  • The histogram looks like it’s a normal distribution. It’s working with the multiple regression. This model is better

Question 3

Using the results from 3, forecast life expectancy when \(TotExp^.06 =1.5\). Then forecast life expectancy when \(TotExp^.06=2.5\).

forecast <- function(a) {
  return((s[4][[1]][[1]] + s[4][[1]][[2]] * a)**(1/4.6))
}
cat(sprintf("%s = %f years\n",c(" If TotExp^.06 =1.5 then LifeExp","If TotExp^.06 =2.5 then LifeExp"),c(forecast(1.5),forecast(2.5))))
##  If TotExp^.06 =1.5 then LifeExp = 63.311533 years
##  If TotExp^.06 =2.5 then LifeExp = 86.506448 years

Question 4

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? \(LifeExp = b0+b1 \times PropMd + b2 \times TotExp +b3 \times PropMD \times TotExp\)

LifeExp_lm <- lm(LifeExp~PropMD+TotExp+PropMD*TotExp, data = data)
(s<-summary(LifeExp_lm))
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16
cat(sprintf("%s = %f\n",c(" Residual standard error","R-squared","F-statistic","p-value"),c(s[6][[1]][[1]],s[8][[1]][[1]],s[10][[1]][[1]],s[11][[1]][[1]])))
##  Residual standard error = 8.765493
##  R-squared = 0.357435
##  F-statistic = 34.488327
##  p-value = 0.008238
hist(LifeExp_lm$resid,main="Histogram of Residuals")

qqnorm(LifeExp_lm$resid)
qqline(LifeExp_lm$resid)

  • The R-squared, p-value, F-statistic and the residual standard error seem all to be low
  • The histogram for residuals looks like it’s a left-skewed bimodal distribution. Therefore it’s not considered normal
  • The plots don’t seem to be very close to a normal distribution, a statistical test is needed for a more accurate check between the skewness and the normal distribution

Question 5

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

PropMD <- 0.03  
TotExp <- 14

b0 <- s[4][[1]][[1]]
b1 <- s[4][[1]][[2]]
b2 <- s[4][[1]][[3]]
b3 <- s[4][[1]][[4]]
  
(LifeExp <- b0 + b1 * PropMD + b2 * TotExp  +b3 * PropMD * TotExp)
## [1] 107.696
  • LifeExp being 108 years is considered high than 86, which concludes that the model is not so acceptable.