Data

real-world data from 2008. The variables included follow:

  • Country: name of the country
  • LifeExp: average life expectancy for the country in years
  • InfantSurvival: proportion of those surviving to one year or more
  • Under5Survival: proportion of those surviving to five years or more
  • TBFree: proportion of the population without TB.
  • PropMD: proportion of the population who are MDs
  • PropRN: proportion of the population who are RNs
  • PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
  • GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate
  • TotExp: sum of personal and government expenditures.
who <- read_csv("who.csv")
## Parsed with column specification:
## cols(
##   Country = col_character(),
##   LifeExp = col_double(),
##   InfantSurvival = col_double(),
##   Under5Survival = col_double(),
##   TBFree = col_double(),
##   PropMD = col_double(),
##   PropRN = col_double(),
##   PersExp = col_double(),
##   GovtExp = col_double(),
##   TotExp = col_double()
## )
head(who)

Question 1

  1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
who %>% ggplot(mapping = aes(x=TotExp, y = jitter(LifeExp))) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 70, hjust = 1)) +  
  scale_y_continuous( labels = scales::number)+ ggtitle("LifeExp~TotExp") +
  ylab("LifeExp")

#simple linear regression

lm_slt<- lm( LifeExp~TotExp,who)

summary(lm_slt)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14
par(mfrow=c(2,2))
plot(lm_slt)

par(mfrow=c(2,2))
hist(lm_slt$residuals, main = "Histogram of Residuals", xlab= "")
plot(lm_slt$residuals, fitted(lm_slt))

The Plot of Adjust vs Residuals are seems to be following some pattern and making this not meeting the conditon of normal distrubited Residuals. Adjusted R^2 is just 25% , which is not so much encourging as we are only able to explain 25% of the variation within the data The p-value is less than the significance level (usually 0.05) i.e. 7.714e-14, which indicate that our model fit the data well.

Question 2

  1. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
head(who)
who$LifeExp_p <- who$LifeExp^4.6
who$TotExp_p <- who$TotExp^.06

who %>% ggplot(mapping = aes(x=TotExp_p, y = (LifeExp_p))) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 70, hjust = 1)) +  
  scale_y_continuous( labels = scales::number)+ ggtitle("LifeExp~TotExp") +
  ylab("LifeExp")

#simple linear regression

lm_sltp<- lm( LifeExp_p~TotExp_p,who)

summary(lm_sltp)
## 
## Call:
## lm(formula = LifeExp_p ~ TotExp_p, data = who)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp_p     620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_sltp)

par(mfrow=c(2,2))
hist(lm_sltp$residuals, main = "Histogram of Residuals", xlab= "")
plot(lm_sltp$residuals, fitted(lm_sltp))

Surprised! the new transformation with Power gave us the very good result almost 72% of variance can be exmaplined in the data, and P-vlaue is still les than .05, and hence very good fit for our dataset. Since LifeExp_p is positive powered and very large number, we have very high standard error with 188 degree of freedom.

Question 3

  1. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
test_dt= data.frame(TotExp_p = 1.5)
test_result <- predict(lm_sltp,test_dt)
# logb(29305338,4.6)*10

print(paste('life expectancy when TotExp^.06 =1.5. is :->',logb(test_result,4.6)))
## [1] "life expectancy when TotExp^.06 =1.5. is :-> 12.5035429738939"

Question 4

  1. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

lm_slt_m <- lm(LifeExp ~ PropMD + TotExp + (PropMD * TotExp),who)

summary(lm_slt_m)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_slt_m)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

+ All the Coefficients are looking significant + Adjusted R-squared is just 34% which says that we are only able to explain 34 % variations of the data. + P-value is well below .05 and indicating a well suited for the data.

Overall this model may not explain all the data points , as we can see from the “Residuals vs Leverage” plot that we have few important data points which are pulling the model to some extreme result and they are impacting the overall model output.

Question 5

  1. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
test_dt2= data.frame(PropMD=.03 ,TotExp = 14)
test_result <- predict(lm_slt_m,test_dt2)
# logb(29305338,4.6)*10

print(paste('LifeExp when PropMD=.03 and TotExp = 14. is :->',test_result))
## [1] "LifeExp when PropMD=.03 and TotExp = 14. is :-> 107.696003708062"
(who[which(who$PropMD>.029 | who$TotExp < 15),])

Based on data points , with who\(PropMD>.029 | who\)TotExp < 15 , it looks like 107 years of Life expectancy may be too much here, since sum of personal and government expenditures is less and number of MDs are less we don’t expect good life expectancy.