Elina Azrilyan

November 15th, 2019

The data

The attached who.csv dataset contains real-world data from 2008. The variables included follow.

Country: name of the country

LifeExp: average life expectancy for the country in years

InfantSurvival: proportion of those surviving to one year or more

Under5Survival: proportion of those surviving to five years or more

TBFree: proportion of the population without TB.

PropMD: proportion of the population who are MDs

PropRN: proportion of the population who are RNs

PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate

GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate

TotExp: sum of personal and government expenditures.

library(knitr)
whodf <- read.csv(file="who.csv", header=TRUE, sep=",")
kable(head(whodf), digits = 2, align = c(rep("l", 4), rep("c", 4), rep("r", 4)))

Country	LifeExp	InfantSurvival	Under5Survival	TBFree	PersExp	GovtExp	TotExp
Afghanistan	42	0.84	0.74	1	20	92	112
Albania	71	0.98	0.98	1	169	3128	3297
Algeria	71	0.97	0.96	1	108	5184	5292
Andorra	82	1.00	1.00	1	2589	169725	172314
Angola	41	0.85	0.74	1	36	1620	1656
Antigua and Barbuda	73	0.99	0.99	1	503	12543	13046

1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

There are 22 columns in our dataset and there are 463 rows of data.

Let’s examine the relationship between LifeExp and TotExp variables - let’s also add a regression line.

plot(whodf$LifeExp ~ whodf$TotExp, main = "LifeExp vs TotExp", xlab = "Pers and gov expenditures", ylab = "Average life expectancy")
abline(lm(whodf$LifeExp ~ whodf$TotExp), col="red") # regression line (y~x)

Running simple linear regression

m1 <- lm(LifeExp ~ TotExp, data = whodf)
summary(m1)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = whodf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

F-statistic is 65.26 and p-value is close to 0 so there is high likelihood that the model is explaining the data failrly well, however due to the R^2 value - we can conclude that only 25% of the variation can be explained by our data. Standard error is very low. The assumptions of of simple linear regression are met.

qqnorm(m1$residuals)
qqline(m1$residuals)

2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

whodf2<-whodf
whodf2$LifeExp<-whodf2$LifeExp^4.6
whodf2$TotExp<-whodf2$TotExp^0.6

kable(head(whodf2), digits = 2, align = c(rep("l", 4), rep("c", 4), rep("r", 4)))

Country	LifeExp	InfantSurvival	Under5Survival	TBFree	PersExp	GovtExp	TotExp
Afghanistan	29305338	0.84	0.74	1	20	92	16.96
Albania	327935478	0.98	0.98	1	169	3128	129.08
Algeria	327935478	0.97	0.96	1	108	5184	171.46
Andorra	636126841	1.00	1.00	1	2589	169725	1386.09
Angola	26230450	0.85	0.74	1	36	1620	85.40
Antigua and Barbuda	372636298	0.99	0.99	1	503	12543	294.64

Plotting transformed variables:

plot(whodf2$LifeExp ~ whodf2$TotExp, main = "LifeExpTransformed vs TotExpTransformed", xlab = "Pers and gov expenditures", ylab = "Average life expectancy")
abline(lm(whodf2$LifeExp ~ whodf2$TotExp), col="red") # regression line (y~x)

Re-running regression model with transformed variables

m2 <- lm(LifeExp ~ TotExp, data = whodf2)
summary(m2)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = whodf2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -257351739  -82599957   14030425   93896945  237720335 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 211907647   10234512   20.70   <2e-16 ***
## TotExp         238461      15021   15.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 113800000 on 188 degrees of freedom
## Multiple R-squared:  0.5728, Adjusted R-squared:  0.5705 
## F-statistic:   252 on 1 and 188 DF,  p-value: < 2.2e-16

F-statistic is 252 and p-value is 0 so there is high likelihood that the model is explaining the data well, the R^2 value has improved greatly - we can conclude that 57% of the variation can be explained by our data. Standatd error is very high but t-values are pretty high as well. The assumptions of of simple linear regression are met. This model is better that the previous one.

qqnorm(m2$residuals)
qqline(m2$residuals)

3. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

#TotExp^.06 =1.5
TExp <- 1.5
LExp <- 238461*TExp + 211907647
round(LExp ^ (1/4.6),1)

## [1] 64.6

#TotExp^.06 =2.5
TExp <- 2.5
LExp <- 238461*TExp + 211907647
round(LExp ^ (1/4.6),1)

## [1] 64.6

4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

m3 <- lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, data = whodf)
summary(m3)

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = whodf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

F-statistic is 34.5 and p-value is 0 so there is likelihood that the model is explaining the data fairly well, the R^2 value is telling us that 35% of the variation can be explained by our model. Standatd error is pretty low and t-values are pretty high This model seems to be pretty decent - I would say it is better than the 1st one but not as good as the 2nd one.

qqnorm(m3$residuals)
qqline(m3$residuals)

5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

#TotExp^.06 =1.5
TExp <- 14
PrMD <- 0.03
LExp <- 6.277e+01 + 1.497e+03*PrMD + 7.233e-05*TExp -6.026e-03*PrMD*TExp
round(LExp,1)

## [1] 107.7

The forecast doesn’t seem very realistic since humans don’t tend to live that long.

Homework 12