Dataset: who.csv

Country: name of the country

LifeExp: average life expectancy for the country in years

InfantSurvival: proportion of those surviving to one year or more

Under5Survival: proportion of those surviving to five years or more

TBFree: proportion of the population without TB.

PropMD: proportion of the population who are MDs

PropRN: proportion of the population who are RNs

PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate

GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate

TotExp: sum of personal and government expenditures.

# Read the data
who <- read.csv("https://raw.githubusercontent.com/L-Velasco/DATA605_SP19/master/HW/who.csv", stringsAsFactors = FALSE)

str(who)
## 'data.frame':    190 obs. of  10 variables:
##  $ Country       : chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ LifeExp       : int  42 71 71 82 41 73 75 69 82 80 ...
##  $ InfantSurvival: num  0.835 0.985 0.967 0.997 0.846 0.99 0.986 0.979 0.995 0.996 ...
##  $ Under5Survival: num  0.743 0.983 0.962 0.996 0.74 0.989 0.983 0.976 0.994 0.996 ...
##  $ TBFree        : num  0.998 1 0.999 1 0.997 ...
##  $ PropMD        : num  2.29e-04 1.14e-03 1.06e-03 3.30e-03 7.04e-05 ...
##  $ PropRN        : num  0.000572 0.004614 0.002091 0.0035 0.001146 ...
##  $ PersExp       : int  20 169 108 2589 36 503 484 88 3181 3788 ...
##  $ GovtExp       : int  92 3128 5184 169725 1620 12543 19170 1856 187616 189354 ...
##  $ TotExp        : int  112 3297 5292 172314 1656 13046 19654 1944 190797 193142 ...
dim(who)
## [1] 190  10
head(who)
##               Country LifeExp InfantSurvival Under5Survival  TBFree
## 1         Afghanistan      42          0.835          0.743 0.99769
## 2             Albania      71          0.985          0.983 0.99974
## 3             Algeria      71          0.967          0.962 0.99944
## 4             Andorra      82          0.997          0.996 0.99983
## 5              Angola      41          0.846          0.740 0.99656
## 6 Antigua and Barbuda      73          0.990          0.989 0.99991
##        PropMD      PropRN PersExp GovtExp TotExp
## 1 0.000228841 0.000572294      20      92    112
## 2 0.001143127 0.004614439     169    3128   3297
## 3 0.001060478 0.002091362     108    5184   5292
## 4 0.003297297 0.003500000    2589  169725 172314
## 5 0.000070400 0.001146162      36    1620   1656
## 6 0.000142857 0.002773810     503   12543  13046

1. LifeExp~TotExp

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

fit <- lm(LifeExp ~ TotExp, data = who)
plot(who$TotExp, who$LifeExp)
abline(fit)

# correlation
cor(who$LifeExp, who$TotExp)
## [1] 0.5076339
summary(fit)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14
plot(fit,which=1)

hist(resid(fit), main = "Histogram of Residuals", xlab = "residuals")

qqnorm(resid(fit))
qqline(resid(fit))

The scatterplot and the diagnostic plots seem to suggests non-linear relationship. The histogram of the residuals is non-normal and the points are falling off the theoretical line. Overall, the model seems to violate linearity and normality of errors assumptions.

The correlation measure is not very strong at 50.76%.

The F-statistic is 65.26 with very low p-value of 7.714e-14, which suggests significance. The R-squared is rather low, explaining only 25.77% variation in data. The p-value of the independent variable is significant with 7.71e-14.

2. LifeExp~TotExp (transformed)

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

who$LifeExp2 <- who$LifeExp^4.6
who$TotExp2 <- who$TotExp^0.06

head(who)
##               Country LifeExp InfantSurvival Under5Survival  TBFree
## 1         Afghanistan      42          0.835          0.743 0.99769
## 2             Albania      71          0.985          0.983 0.99974
## 3             Algeria      71          0.967          0.962 0.99944
## 4             Andorra      82          0.997          0.996 0.99983
## 5              Angola      41          0.846          0.740 0.99656
## 6 Antigua and Barbuda      73          0.990          0.989 0.99991
##        PropMD      PropRN PersExp GovtExp TotExp  LifeExp2  TotExp2
## 1 0.000228841 0.000572294      20      92    112  29305338 1.327251
## 2 0.001143127 0.004614439     169    3128   3297 327935478 1.625875
## 3 0.001060478 0.002091362     108    5184   5292 327935478 1.672697
## 4 0.003297297 0.003500000    2589  169725 172314 636126841 2.061481
## 5 0.000070400 0.001146162      36    1620   1656  26230450 1.560068
## 6 0.000142857 0.002773810     503   12543  13046 372636298 1.765748
fit2 <- lm(LifeExp2 ~ TotExp2, data = who)
plot(who$TotExp2, who$LifeExp2)
abline(fit2)

#correlation
cor(who$LifeExp2, who$TotExp2)
## [1] 0.8542642
summary(fit2)
## 
## Call:
## lm(formula = LifeExp2 ~ TotExp2, data = who)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp2      620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16
plot(fit2,which=1)

hist(resid(fit2), main = "Histogram of Residuals", xlab = "residuals")

qqnorm(resid(fit2))
qqline(resid(fit2))

The transformed model is better. It resolved the issues with the first model. The scatterplot and the diagnostic plots suggest meeting the linearity and normality assumptions.

The correlation measure is stronger at 85.42%.

The F-statistic is 507.7 with very low p-value of 2.2e-16, which suggests significance. The R-squared is higher, now explaining about 73% variation in data. The p-value of the independent variable is significant with <2e-16.

3. Forecast

Forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

new.df <- data.frame(TotExp2=c(1.5, 2.5))
round(predict(fit2, new.df)^(1/4.6))
##  1  2 
## 63 87

4. Multiple Regression Model

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

#correlation check against multicollinearity
#cor(who$PropMD, who$TotExp)
m <- lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, data = who)
summary(m)  
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

Although the independent variables seem significant, the model explains only 35.74% of the variation in the data.

5. Forecast

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

new.df2 <- data.frame(PropMD=0.03, TotExp=14)
round(predict(m, new.df2))
##   1 
## 108

The predicted value seem unrealistic because it doesn’t make sense and stands as an outlier with the rest of the data.