Elina Azrilyan

November 15th, 2019

The data

The attached who.csv dataset contains real-world data from 2008. The variables included follow.

Country: name of the country

LifeExp: average life expectancy for the country in years

InfantSurvival: proportion of those surviving to one year or more

Under5Survival: proportion of those surviving to five years or more

TBFree: proportion of the population without TB.

PropMD: proportion of the population who are MDs

PropRN: proportion of the population who are RNs

PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate

GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate

TotExp: sum of personal and government expenditures.

library(knitr)
whodf <- read.csv(file="who.csv", header=TRUE, sep=",")
kable(head(whodf), digits = 2, align = c(rep("l", 4), rep("c", 4), rep("r", 4)))
Country LifeExp InfantSurvival Under5Survival TBFree PropMD PropRN PersExp GovtExp TotExp
Afghanistan 42 0.84 0.74 1 0 0 20 92 112
Albania 71 0.98 0.98 1 0 0 169 3128 3297
Algeria 71 0.97 0.96 1 0 0 108 5184 5292
Andorra 82 1.00 1.00 1 0 0 2589 169725 172314
Angola 41 0.85 0.74 1 0 0 36 1620 1656
Antigua and Barbuda 73 0.99 0.99 1 0 0 503 12543 13046
1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

There are 22 columns in our dataset and there are 463 rows of data.

Let’s examine the relationship between LifeExp and TotExp variables - let’s also add a regression line.

plot(whodf$LifeExp ~ whodf$TotExp, main = "LifeExp vs TotExp", xlab = "Pers and gov expenditures", ylab = "Average life expectancy")
abline(lm(whodf$LifeExp ~ whodf$TotExp), col="red") # regression line (y~x) 

Running simple linear regression

m1 <- lm(LifeExp ~ TotExp, data = whodf)
summary(m1)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = whodf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

F-statistic is 65.26 and p-value is close to 0 so there is high likelihood that the model is explaining the data failrly well, however due to the R^2 value - we can conclude that only 25% of the variation can be explained by our data. Standard error is very low. The assumptions of of simple linear regression are met.

qqnorm(m1$residuals)
qqline(m1$residuals)

2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
whodf2<-whodf
whodf2$LifeExp<-whodf2$LifeExp^4.6
whodf2$TotExp<-whodf2$TotExp^0.6

kable(head(whodf2), digits = 2, align = c(rep("l", 4), rep("c", 4), rep("r", 4)))
Country LifeExp InfantSurvival Under5Survival TBFree PropMD PropRN PersExp GovtExp TotExp
Afghanistan 29305338 0.84 0.74 1 0 0 20 92 16.96
Albania 327935478 0.98 0.98 1 0 0 169 3128 129.08
Algeria 327935478 0.97 0.96 1 0 0 108 5184 171.46
Andorra 636126841 1.00 1.00 1 0 0 2589 169725 1386.09
Angola 26230450 0.85 0.74 1 0 0 36 1620 85.40
Antigua and Barbuda 372636298 0.99 0.99 1 0 0 503 12543 294.64

Plotting transformed variables:

plot(whodf2$LifeExp ~ whodf2$TotExp, main = "LifeExpTransformed vs TotExpTransformed", xlab = "Pers and gov expenditures", ylab = "Average life expectancy")
abline(lm(whodf2$LifeExp ~ whodf2$TotExp), col="red") # regression line (y~x) 

Re-running regression model with transformed variables

m2 <- lm(LifeExp ~ TotExp, data = whodf2)
summary(m2)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = whodf2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -257351739  -82599957   14030425   93896945  237720335 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 211907647   10234512   20.70   <2e-16 ***
## TotExp         238461      15021   15.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 113800000 on 188 degrees of freedom
## Multiple R-squared:  0.5728, Adjusted R-squared:  0.5705 
## F-statistic:   252 on 1 and 188 DF,  p-value: < 2.2e-16

F-statistic is 252 and p-value is 0 so there is high likelihood that the model is explaining the data well, the R^2 value has improved greatly - we can conclude that 57% of the variation can be explained by our data. Standatd error is very high but t-values are pretty high as well. The assumptions of of simple linear regression are met. This model is better that the previous one.

qqnorm(m2$residuals)
qqline(m2$residuals)

3. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
#TotExp^.06 =1.5
TExp <- 1.5
LExp <- 238461*TExp + 211907647
round(LExp ^ (1/4.6),1)
## [1] 64.6
#TotExp^.06 =2.5
TExp <- 2.5
LExp <- 238461*TExp + 211907647
round(LExp ^ (1/4.6),1)
## [1] 64.6
4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

m3 <- lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, data = whodf)
summary(m3)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = whodf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

F-statistic is 34.5 and p-value is 0 so there is likelihood that the model is explaining the data fairly well, the R^2 value is telling us that 35% of the variation can be explained by our model. Standatd error is pretty low and t-values are pretty high This model seems to be pretty decent - I would say it is better than the 1st one but not as good as the 2nd one.

qqnorm(m3$residuals)
qqline(m3$residuals)

5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
#TotExp^.06 =1.5
TExp <- 14
PrMD <- 0.03
LExp <- 6.277e+01 + 1.497e+03*PrMD + 7.233e-05*TExp -6.026e-03*PrMD*TExp
round(LExp,1)
## [1] 107.7

The forecast doesn’t seem very realistic since humans don’t tend to live that long.