The attached who.csv dataset contains real-world data from 2008. The variables included follow.
Country: name of the country
LifeExp: average life expectancy for the country in years
InfantSurvival: proportion of those surviving to one year or more
Under5Survival: proportion of those surviving to five years or more
TBFree: proportion of the population without TB.
PropMD: proportion of the population who are MDs
PropRN: proportion of the population who are RNs
PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate
TotExp: sum of personal and government expenditures.
library(knitr)
whodf <- read.csv(file="who.csv", header=TRUE, sep=",")
kable(head(whodf), digits = 2, align = c(rep("l", 4), rep("c", 4), rep("r", 4)))
Country | LifeExp | InfantSurvival | Under5Survival | TBFree | PropMD | PropRN | PersExp | GovtExp | TotExp |
---|---|---|---|---|---|---|---|---|---|
Afghanistan | 42 | 0.84 | 0.74 | 1 | 0 | 0 | 20 | 92 | 112 |
Albania | 71 | 0.98 | 0.98 | 1 | 0 | 0 | 169 | 3128 | 3297 |
Algeria | 71 | 0.97 | 0.96 | 1 | 0 | 0 | 108 | 5184 | 5292 |
Andorra | 82 | 1.00 | 1.00 | 1 | 0 | 0 | 2589 | 169725 | 172314 |
Angola | 41 | 0.85 | 0.74 | 1 | 0 | 0 | 36 | 1620 | 1656 |
Antigua and Barbuda | 73 | 0.99 | 0.99 | 1 | 0 | 0 | 503 | 12543 | 13046 |
There are 22 columns in our dataset and there are 463 rows of data.
Let’s examine the relationship between LifeExp and TotExp variables - let’s also add a regression line.
plot(whodf$LifeExp ~ whodf$TotExp, main = "LifeExp vs TotExp", xlab = "Pers and gov expenditures", ylab = "Average life expectancy")
abline(lm(whodf$LifeExp ~ whodf$TotExp), col="red") # regression line (y~x)
m1 <- lm(LifeExp ~ TotExp, data = whodf)
summary(m1)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = whodf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
F-statistic is 65.26 and p-value is close to 0 so there is high likelihood that the model is explaining the data failrly well, however due to the R^2 value - we can conclude that only 25% of the variation can be explained by our data. Standard error is very low. The assumptions of of simple linear regression are met.
qqnorm(m1$residuals)
qqline(m1$residuals)
whodf2<-whodf
whodf2$LifeExp<-whodf2$LifeExp^4.6
whodf2$TotExp<-whodf2$TotExp^0.6
kable(head(whodf2), digits = 2, align = c(rep("l", 4), rep("c", 4), rep("r", 4)))
Country | LifeExp | InfantSurvival | Under5Survival | TBFree | PropMD | PropRN | PersExp | GovtExp | TotExp |
---|---|---|---|---|---|---|---|---|---|
Afghanistan | 29305338 | 0.84 | 0.74 | 1 | 0 | 0 | 20 | 92 | 16.96 |
Albania | 327935478 | 0.98 | 0.98 | 1 | 0 | 0 | 169 | 3128 | 129.08 |
Algeria | 327935478 | 0.97 | 0.96 | 1 | 0 | 0 | 108 | 5184 | 171.46 |
Andorra | 636126841 | 1.00 | 1.00 | 1 | 0 | 0 | 2589 | 169725 | 1386.09 |
Angola | 26230450 | 0.85 | 0.74 | 1 | 0 | 0 | 36 | 1620 | 85.40 |
Antigua and Barbuda | 372636298 | 0.99 | 0.99 | 1 | 0 | 0 | 503 | 12543 | 294.64 |
Plotting transformed variables:
plot(whodf2$LifeExp ~ whodf2$TotExp, main = "LifeExpTransformed vs TotExpTransformed", xlab = "Pers and gov expenditures", ylab = "Average life expectancy")
abline(lm(whodf2$LifeExp ~ whodf2$TotExp), col="red") # regression line (y~x)
Re-running regression model with transformed variables
m2 <- lm(LifeExp ~ TotExp, data = whodf2)
summary(m2)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = whodf2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -257351739 -82599957 14030425 93896945 237720335
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 211907647 10234512 20.70 <2e-16 ***
## TotExp 238461 15021 15.88 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 113800000 on 188 degrees of freedom
## Multiple R-squared: 0.5728, Adjusted R-squared: 0.5705
## F-statistic: 252 on 1 and 188 DF, p-value: < 2.2e-16
F-statistic is 252 and p-value is 0 so there is high likelihood that the model is explaining the data well, the R^2 value has improved greatly - we can conclude that 57% of the variation can be explained by our data. Standatd error is very high but t-values are pretty high as well. The assumptions of of simple linear regression are met. This model is better that the previous one.
qqnorm(m2$residuals)
qqline(m2$residuals)
#TotExp^.06 =1.5
TExp <- 1.5
LExp <- 238461*TExp + 211907647
round(LExp ^ (1/4.6),1)
## [1] 64.6
#TotExp^.06 =2.5
TExp <- 2.5
LExp <- 238461*TExp + 211907647
round(LExp ^ (1/4.6),1)
## [1] 64.6
LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
m3 <- lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, data = whodf)
summary(m3)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = whodf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
F-statistic is 34.5 and p-value is 0 so there is likelihood that the model is explaining the data fairly well, the R^2 value is telling us that 35% of the variation can be explained by our model. Standatd error is pretty low and t-values are pretty high This model seems to be pretty decent - I would say it is better than the 1st one but not as good as the 2nd one.
qqnorm(m3$residuals)
qqline(m3$residuals)
#TotExp^.06 =1.5
TExp <- 14
PrMD <- 0.03
LExp <- 6.277e+01 + 1.497e+03*PrMD + 7.233e-05*TExp -6.026e-03*PrMD*TExp
round(LExp,1)
## [1] 107.7
The forecast doesn’t seem very realistic since humans don’t tend to live that long.