CUNY SPS DATA 605 - Assignment 12

The attached who.csv dataset contains real-world data from 2008. The variables included follow.

Country: name of the country
LifeExp: average life expectancy for the country in years
InfantSurvival: proportion of those surviving to one year or more
Under5Survival: proportion of those surviving to five years or more
TBFree: proportion of the population without TB.
PropMD: proportion of the population who are MDs
PropRN: proportion of the population who are RNs
PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate
TotExp: sum of personal and government expenditures.

Problem 1

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

data <- read.csv("who.csv")
head(data)

attach(data)
plot(TotExp, LifeExp)

data.lm <- lm(LifeExp~TotExp)
summary(data.lm)

## 
## Call:
## lm(formula = LifeExp ~ TotExp)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24.76  -4.78   3.15   7.12  13.29 
## 
## Coefficients:
##                Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept) 64.75337453  0.75353661   85.93 < 0.0000000000000002 ***
## TotExp       0.00006297  0.00000779    8.08    0.000000000000077 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.4 on 188 degrees of freedom
## Multiple R-squared:  0.258,  Adjusted R-squared:  0.254 
## F-statistic: 65.3 on 1 and 188 DF,  p-value: 0.0000000000000771

The F statistic is not a particularly useful measure since the F-test compartes the current model to a model with one fewer predictor and our model already only has one predictor.

The \(R^2\) value of 0.258 means that this model explains about 25% of the variability in life expectancy, which is not too bad for a single predictor.

According to our textbook typically we would want our standard error to be “at least five to ten times smaller than the corresponding coefficient”. In this case the standard error for TotExp, 0.00000779, is 8.07862601 times smaller than the coefficient, 0.00006297. So this also indicates a good fit. The standard error for the intercept, 0.75, is 85.93 times smaller than the coefficient, 64.75.

The p-values for the coefficients for both TotExp and intercept as so small that they are essentially zero, indicating that it is highly likely that both the speed and this specific intercept value are relevant to the model.

Although the model has some success at predicting life expectancy, based on the scatter plot alone it is clear that the relationship is not linear. There does appear however to be possibly an exponential relationship.

Problem 2

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

data$LifeExp <- LifeExp^4.6
data$TotExp <- TotExp^.06

plot(TotExp, LifeExp)

data.lm <- lm(LifeExp~TotExp)
summary(data.lm)

## 
## Call:
## lm(formula = LifeExp ~ TotExp)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24.76  -4.78   3.15   7.12  13.29 
## 
## Coefficients:
##                Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept) 64.75337453  0.75353661   85.93 < 0.0000000000000002 ***
## TotExp       0.00006297  0.00000779    8.08    0.000000000000077 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.4 on 188 degrees of freedom
## Multiple R-squared:  0.258,  Adjusted R-squared:  0.254 
## F-statistic: 65.3 on 1 and 188 DF,  p-value: 0.0000000000000771

Again, the F statistic is not a particularly useful measure for a model with only one predictor.

The \(R^2\) value of 0.73 means that this model explains about 73% of the variability in life expectancy, which is a much better result than we achieved in our first model and a very strong single predictor.

In this model the standard error for TotExp, 0, is 8.08 times smaller than the coefficient, 0. So this also indicates a very good fit. The standard error for the intercept, 0.75, is 85.93 times smaller than the coefficient, 64.75.

The scatter plot in this case shows a clear linear relationship that is also evident in the linear model. This is clearly a much better model than our first attempt. It explains 3 times as much of the variability as our first model did.

Problem 3

Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

Given a linear model of:

\[ \text{LifeExp}^{4.6} = 64.75 + 0 \times \text{TotExp}^{0.06} \]

# forecast life expectancy when TotExp^.06 =1.5
le_1.5 <- (data.lm$coefficients[1] + data.lm$coefficients[2] * 1.5)^(1/4.6)

Life expectancy when \(\text{TotExp}^{0.06} = 1.5\) is \(2.48\).

# forecast life expectancy when TotExp^.06=2.5
le_2.5 <- (data.lm$coefficients[1] + data.lm$coefficients[2] * 2.5)^(1/4.6)

Life expectancy when \(\text{TotExp}^{0.06} = 2.5\) is \(2.48\).

Problem 4

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = b0 + b1 x PropMd + b2 x TotExp + b3 x PropMD x TotExp

#Reload the data to reset to original values
detach(data)
data2 <- read.csv("who.csv")
attach(data2)
mult.lm <- lm(LifeExp ~ PropMD + TotExp + (PropMD * TotExp))
summary(mult.lm)

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + (PropMD * TotExp))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27.32  -4.13   2.10   6.54  13.07 
## 
## Coefficients:
##                    Estimate    Std. Error t value             Pr(>|t|)    
## (Intercept)     62.77270326    0.79560524   78.90 < 0.0000000000000002 ***
## PropMD        1497.49395252  278.81687965    5.37    0.000000232060277 ***
## TotExp           0.00007233    0.00000898    8.05    0.000000000000094 ***
## PropMD:TotExp   -0.00602569    0.00147236   -4.09    0.000063527329494 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.8 on 186 degrees of freedom
## Multiple R-squared:  0.357,  Adjusted R-squared:  0.347 
## F-statistic: 34.5 on 3 and 186 DF,  p-value: <0.0000000000000002

hist(mult.lm$residuals)

mean(mult.lm$residuals)

## [1] -0.00000000000000079

The residuals are not normally distributed although there is a strong left skew.

plot(fitted(mult.lm), residuals(mult.lm), xlab="Fitted", ylab="Residuals")
abline(h=0)

There is a strong apparent pattern to the plotted residuals indicating that the linear model is not a good fit.

qqnorm(resid(mult.lm))
qqline(resid(mult.lm))

We can see that there is a strong curve to the sample quantiles vs theoretical quantiles.

This model is not a good fit. It has an \(R^2\) value of 0.35744 which is only about half as good as the transformedx model in Problem 2.

Problem 5

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?