Who.CSV - Real World Data from 2008

Question 1

LifeExp: average life expectancy for the country in years
InfantSurvival: proportion of those surviving to one year or more
Under5Survival: proportion of those surviving to five years or more
TBFree: proportion of the population without TB.
PropMD: proportion of the population who are MDs
PropRN: proportion of the population who are RNs
PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate
TotExp: sum of personal and government expenditures.

Provide a scatterplot of LifeExp~TotExp, and run similar linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error, and p-values only. Discuss whether the assumptions of similar linear regression met.

library(RCurl)
library(knitr)

url <- 'https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20605/Week%2012/who.csv'
who <- read.csv(url, header = TRUE, stringsAsFactors = FALSE)
kable(head(who))

Country	LifeExp	InfantSurvival	Under5Survival	TBFree	PropMD	PropRN	PersExp	GovtExp	TotExp
Afghanistan	42	0.835	0.743	0.99769	0.0002288	0.0005723	20	92	112
Albania	71	0.985	0.983	0.99974	0.0011431	0.0046144	169	3128	3297
Algeria	71	0.967	0.962	0.99944	0.0010605	0.0020914	108	5184	5292
Andorra	82	0.997	0.996	0.99983	0.0032973	0.0035000	2589	169725	172314
Angola	41	0.846	0.740	0.99656	0.0000704	0.0011462	36	1620	1656
Antigua and Barbuda	73	0.990	0.989	0.99991	0.0001429	0.0027738	503	12543	13046

attach(who)
who.lm <- lm(LifeExp ~ TotExp)
who.lm

## 
## Call:
## lm(formula = LifeExp ~ TotExp)
## 
## Coefficients:
## (Intercept)       TotExp  
##   6.475e+01    6.297e-05

summary(who.lm)

## 
## Call:
## lm(formula = LifeExp ~ TotExp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

An F-test in regression compares the fits of different linear models. Unlike t-tests that can assess only one regression coefficient at a time, the F-test can assess multiple coefficients simultaneously. The F-test of the overall significance is a specific form of the F-test. It compares a model with no predictors to the model that you specify. A regression model that contains no predictors is also known as an intercept-only model.

The hypothesis for the F-test of the overall significane are as follows:

Null hypothesis: The fit of the intercept-only model and your model are equal.
Alternative hypothesis: The fit of the intercept-only model is significantly reduced compared to your model.

If the P value for the F-test of overall significance test is less than your significance level, you can reject the null-hypothesis and conclude that your model provides a better fit than the intercept-only model. Typically, if you don’t have any significant P values for the individual coefficients in your model, the overall F-test won’t be significant either.

While R-squared provides an estimate of the strength of the relationship between your model and the response variable, it does not provide a formal hypothesis test for this relationship. The overall F-test determines whether this relationship is statistically significant. If the P value for the overall F-test is less than your significance level, you can conclude that the R-squared value is significantly different from zero. (Reference: http://blog.minitab.com/blog/adventures-in-statistics-2/what-is-the-f-test-of-overall-significance-in-regression-analysis)

http://blog.minitab.com/blog/adventures-in-statistics-2/understanding-analysis-of-variance-anova-and-the-f-test does a great job describing how an F-statistics produces a p-value.

\[F = \frac{variationbetweensamplemeans}{variationwithinthesamples}\]

When we correspond the F value (and remember to check the degrees of freedom) to a chart (or graph), we can then conclude what the p value is. In this particular example, the F value equaled 65.26 on 1 and 188 DF, thus corresponding to a p-value of: 7.714 x 10^(-14), which is statistically significant.

As stated above, the R-squared value is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regresion. For example, 0% indicates that the models explains none of the variability of the response data around its mean. Here, we have an adjusted R-square value of 25.37%.

The Standard Error of Regression is the average distance that the observed values fall from the regression line. Conveniently, it tells you how wrong the regression model is on average using the units of the response variable. Smaller values are better because it indicates that the observations are closer to the fitted line. (Reference: http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-to-interpret-s-the-standard-error-of-the-regression). Here, the Standard Error (S) is 9.371, which tells us that the average distance of the data points from the fitted line is about 9.371% away from the fitted line. Approximately 95% of the observations should wall within +/- 2 standard error of the regression from the regression line.

There are five key assumptions for linear regression:

Linear relationship
Multivariate normality
No or little multicollinearity
No auto-correlation
Homoscedasticity

First, linear regression needs the relationship between the independent and dependent variables to be linear. It is also important to check for outliers since linear regression is sensitive to outlier effects. The linear assumption can best tested with scatter plots.

attach(who)
plot(TotExp, LifeExp)

There does NOT appear to be a linear relationship, so it violates the first assumption. Let’s continues with the rest for exercise.

Secondly, the linear regression analysis requires all variables to be multivariate normal. This assumption can best be checked with a histogram or a Q-Q plot. Normality can be checked with a goodness of fit test. When the data is not normally distributed, a non-linear transformation (e.g., log-transformation) might fix this issue.

hist(resid(who.lm))

qqnorm(resid(who.lm))
qqline(resid(who.lm))

As you can, there appears to be a violation of the second assumption. The histogram does NOT demonstrate a Gaussian distribution, and the residuals of the linear regression model does not follow the Q-Q line. In fact, there appears to be a pattern, which suggests that another variable may contribute to the dependent variable.

Third, linear regression assumes that there is little or no multicollinearity in the data. Multicollinearity occurs when the independent variables are too highly correlated with each other. In statistics, multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. (However, in this case, we have one variable, so we will not be testing for multicollinearity.)

Fourth, linear regression analysis requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not indepdent from each other. For instance, this typically occurs in stock prices, where the price is not independent from the previous price. (Again, will disregard this as we only have one independent variable).

The last assumption of the linear regression is homoscedasticity. The scatter plot if a good way to check whether the data are homoscedastic (meaning the residuals are equal across the regression line).

plot(fitted(who.lm), resid(who.lm))

Again, this violates the last assumption.

Question 2

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

attach(who)
LifeExp_4.6 <- LifeExp ** (4.6)
TotExp_.06 <- TotExp ** (.06)
transformed.lm <- lm(LifeExp_4.6 ~ TotExp_.06)

transformed.lm

## 
## Call:
## lm(formula = LifeExp_4.6 ~ TotExp_.06)
## 
## Coefficients:
## (Intercept)   TotExp_.06  
##  -736527909    620060216

summary(transformed.lm)

## 
## Call:
## lm(formula = LifeExp_4.6 ~ TotExp_.06)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp_.06   620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

plot(TotExp_.06, LifeExp_4.6)
abline(transformed.lm, col="red")

From the scatterplot alone, the new revised linear regression appears to fit the data much better. Looking at the F-statistic, the F value is even higher and produces a very much statistically significant p-value < 2.2e-16. The adjusted R-squared value is 0.7283, which is significantly better. The standard error is 90,490,000 (which is significantly higher than the previous model).

According to https://stats.stackexchange.com/questions/172897/huge-difference-in-regression-standard-error-after-log-transformation-of-depende, the transformation of the dependent variable is what is reducing your standard error, so comparison of the standard errors of the residuals is meaningless.

Overall, it appears that the transformed model is better.

Question 3

Using the results from 3, forecast life expectancy when TotExp^.06 = 1.5. Then forecast life expectancy when TotExp^.06 = 2.5

forecasted_TotExp <- function(x){
  ans <- -736527909 + 620060216 * x
  return(ans ** (1/4.6)) # Must transform the data into a sensible number
}

print(paste0("What is the forecasted life expectancy when TotExp^.06 = 1.5? ", round(forecasted_TotExp(1.5),2), " years."))

## [1] "What is the forecasted life expectancy when TotExp^.06 = 1.5? 63.31 years."

print(paste0("What is the forecasted life expectancy when TotExp^.06 = 2.5? ", round(forecasted_TotExp(2.5),2), " years."))

## [1] "What is the forecasted life expectancy when TotExp^.06 = 2.5? 86.51 years."

Question 4

Build the following multiple regression model and interpret the F statistics, R^2, standard error, and p-values. How good is the model?

\[LifeExp = B_0 + B_1 * PropMD + B_2 * TotExp + B_3 * PropMD * TotExp\]

attach(who)
PropMD_TotExp <- PropMD * TotExp
revised2.lm <- lm(LifeExp ~ PropMD + TotExp + PropMD_TotExp)

revised2.lm

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD_TotExp)
## 
## Coefficients:
##   (Intercept)         PropMD         TotExp  PropMD_TotExp  
##     6.277e+01      1.497e+03      7.233e-05     -6.026e-03

summary(revised2.lm)

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD_TotExp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD_TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

This model has an F-statistic of 34.49 and a statistically significant p-value < 2.2e-16, a residual standard error of 8.765, and adjusted R-squared 0.3471. This model is not as quite good as the transformation but better than the original model in the 1st question stem. The only thing improved here compared to the first model is that the precision is slightly better with the standard error being at 8.765. The adjusted R-squared suggests that 34.71% of the variation in this data could be interpreted from this model.

Question 5

Forecast LifeExp when PropMD = 0.3 and TotExp = 14. Does this forecast seem realist? Why or why not?

forecast_revised.lm <- function(propmd, totexp){
  ans <- summary(revised2.lm)$coefficients[1] + summary(revised2.lm)$coefficients[2] * propmd + summary(revised2.lm)$coefficients[3] * totexp + summary(revised2.lm)$coefficients[4] * propmd * totexp
  return(ans)
}

print(paste0("What is the forecasted life expectancy when PropMD = 0.3 and TotExp = 14? ", round(forecast_revised.lm(0.3,14),2), " years."))

## [1] "What is the forecasted life expectancy when PropMD = 0.3 and TotExp = 14? 512 years."

This value does not make sense as no one can possibly (yet) live to 512 years old. This does not seem to be a realistic model. The problem with this model is that we are using the proportion of MDs living in a country as well as the total expenditure as independent variables for life expectancy, and we really do not know if these actually make a true difference in real life. If we consult an epidemiologist, we could benefit from a little bit of domain expertise to see if having these independent variables truly make any sense. And if so, do these transformations make any sense? Just because we have statistical significance for the data does not imply that this is a good model.

CUNY 605 Homework Week 12