DATA 605 HW12

1. Provide a scatterplot of LifeExp~TotExp,

and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

First we can load it into a data frame:

who <- "who.csv" %>% read.csv(stringsAsFactors = FALSE) %>% data.frame
head(who) # look at the top of the df

##               Country LifeExp InfantSurvival Under5Survival  TBFree
## 1         Afghanistan      42          0.835          0.743 0.99769
## 2             Albania      71          0.985          0.983 0.99974
## 3             Algeria      71          0.967          0.962 0.99944
## 4             Andorra      82          0.997          0.996 0.99983
## 5              Angola      41          0.846          0.740 0.99656
## 6 Antigua and Barbuda      73          0.990          0.989 0.99991
##        PropMD      PropRN PersExp GovtExp TotExp
## 1 0.000228841 0.000572294      20      92    112
## 2 0.001143127 0.004614439     169    3128   3297
## 3 0.001060478 0.002091362     108    5184   5292
## 4 0.003297297 0.003500000    2589  169725 172314
## 5 0.000070400 0.001146162      36    1620   1656
## 6 0.000142857 0.002773810     503   12543  13046

str(who) #look at the data stored in the df

## 'data.frame':    190 obs. of  10 variables:
##  $ Country       : chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ LifeExp       : int  42 71 71 82 41 73 75 69 82 80 ...
##  $ InfantSurvival: num  0.835 0.985 0.967 0.997 0.846 0.99 0.986 0.979 0.995 0.996 ...
##  $ Under5Survival: num  0.743 0.983 0.962 0.996 0.74 0.989 0.983 0.976 0.994 0.996 ...
##  $ TBFree        : num  0.998 1 0.999 1 0.997 ...
##  $ PropMD        : num  2.29e-04 1.14e-03 1.06e-03 3.30e-03 7.04e-05 ...
##  $ PropRN        : num  0.000572 0.004614 0.002091 0.0035 0.001146 ...
##  $ PersExp       : int  20 169 108 2589 36 503 484 88 3181 3788 ...
##  $ GovtExp       : int  92 3128 5184 169725 1620 12543 19170 1856 187616 189354 ...
##  $ TotExp        : int  112 3297 5292 172314 1656 13046 19654 1944 190797 193142 ...

pairs(who[,-1], gap = 0.5, col = "orangered") # [,-1] to remove the country name colunm

We can now run the linear regression:

fit1 <- lm(LifeExp ~ TotExp, data = who)
summary(fit1)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

plot(who$TotExp, who$LifeExp, xlab = "Total Expenditures ($)" ,ylab = "Life Expectancy (yrs)", col = "steelblue")
abline(fit1, col="yellow3")

hist(resid(fit1), main = "Histogram of Residuals", xlab = "residuals")

plot(fitted(fit1), resid(fit1))

The p-value suggests a statistically significant correlation between total expenditures and life expectancy, since $p<<0.05$. The R$^2$ of 0.2577 means that about 25.77% of the variability of life expectancy about the mean is explained by the model. This is a moderately weak correlation. The F-statistic tells us that adding the variable ‘total expenditures’ to the model improves the model compared to only having an intercept. The residual standard error tells us that, if the residuals are normally distributed, about 64% of the residuals are between $\pm 9.371$ years. These statistics suggest we have a useful model.

The linear model, when plotted over the data, does not match the data very closely. Furthermore, the residual analysis shows that the residuals have a strong right skew and do not show constant variance. Therefore, the linear model is not valid in this case.

2. Raise life expectancy to the 4.6 power

(i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

le_4.6 <- who$LifeExp^4.6
te_0.06 <- who$TotExp^0.06
fit2 <- lm(le_4.6 ~ te_0.06)
summary(fit2)

## 
## Call:
## lm(formula = le_4.6 ~ te_0.06)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## te_0.06      620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

plot(who$TotExp^0.06, who$LifeExp^4.6, xlab = "Total Expenditures^0.06 ($^0.06)" ,ylab = "Life Expectancy^4.6 (yrs^0.06)", col = "steelblue")
abline(fit2, col="yellow3")

hist(resid(fit2), main = "Histogram of Residuals", xlab = "residuals")

plot(fitted(fit2), resid(fit2))

The p-value suggests a statistically significant correlation between total expenditures^0.06 and life expectancy^4.6, since $p<<0.05$. The R$^2$ of 0.7298 means that about 72.98% of the variability of life expectancy about the mean is explained by the model. This is a moderately strong correlation. The F-statistic tells us that adding the variable ‘total expenditures’ to the model improves the model compared to only having an intercept 507.7 is much larger than 65.26, so it is a better fit than before. Note that The residual standard error tells us that, if the residuals are normally distributed, about 64% of the residuals are between $\pm 90490000$ years^4.6. These statistics suggest we have a useful model.

The linear model, when plotted over the data, matches the data more closely. Furthermore, the residual analysis shows that the residuals are normally distributed and show constant variance; there is no noticeable trend. Therefore, the linear model is valid in this case.

This model is better than in part 1.

3. Using the results from 3

forecast life expectancy when TotExp^.06 =1.5.

Then forecast life expectancy when TotExp^.06=2.5.

\[ y = -736527910 + 620060216x \\ y = -736527910 + 620060216(1.5) \\ y = 193562414 \\ le = y^{1/4.6} \\ le = 193562414^{1/4.6} \\ le = 63.31153 \space years \]

Life expectancy is about 63.3 years when tot_exp^0.06 = 1.5.

\[ y = -736527910 + 620060216x \\ y = -736527910 + 620060216(2.5) \\ y = 813622630 \\ le = y^{1/4.6} \\ le = 813622630^{1/4.6} \\ le = 86.50645 \space years \]

Life expectancy is about 86.5 years when tot_exp^0.06 = 2.5.

4. Build the following multiple regression model

LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

fit3 <- lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, data = who)
summary(fit3)

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

hist(resid(fit3), main = "Histogram of Residuals", xlab = "residuals")

plot(fitted(fit3), resid(fit3))

The p-value is < 0.05 so the model is statistically significant. The F-statistic of 34.49 tells us that adding the 3 variables performs better than just the intercept, but barely as the F-statistic penalizes you for adding variables. R$^2$ of 0.3574 means that 35.74% of the variability about the mean of life expectancy is explained by these 3 variables. This is a moderately weak correlation. Residual standard error of 8.765 means that if the residuals are normally distributed, 64% will be $\pm$ 8.765 years.

The residual analysis shows that the residuals have a strong right skew and do not show constant variance. Therefore, the linear model is not valid in this case.

5. Forecast LifeExp

when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

\[ LifeExp = 6.277*10^{1} + 1.497*10^{3}*PropMD +7.233*10^{-5}TotExp -6.026*10^{-3}*PropMD*TotExp \\ LifeExp = 6.277*10^{1} + 1.497*10^{3}*0.03 +7.233*10^{-5}*14 -6.026*10^{-3}*0.03*14 \\ LifeExp = 107.6808 \]

107.7 years seems very unrealistic first, that age for a human being is an outlier to the point where someone making it to that age will be featured in National or even International news. Life expectancy is the average time a person can expect to live, I don’t see approximately 50% of a population making it to 107 with modern technology. Furthermore, 3% of the population being MDs seems unreasonably high; that would be about 9.5 million doctors in the US. Total Expenditures of $14 seems unreasonably low as that includes both personal and government expenditures.