1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

library(tidyverse)
## -- Attaching packages ----------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0     v purrr   0.3.4
## v tibble  3.0.1     v dplyr   0.8.5
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts -------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
who.df<-read.csv(file="who.csv")
#head(who.df)
#nrow(who.df)
#ncol(who.df)
#summary(who.df)
sum(is.na(who.df))
## [1] 0
colnames(who.df)
##  [1] "Country"        "LifeExp"        "InfantSurvival" "Under5Survival"
##  [5] "TBFree"         "PropMD"         "PropRN"         "PersExp"       
##  [9] "GovtExp"        "TotExp"
glimpse(who.df)
## Rows: 190
## Columns: 10
## $ Country        <fct> Afghanistan, Albania, Algeria, Andorra, Angola,...
## $ LifeExp        <int> 42, 71, 71, 82, 41, 73, 75, 69, 82, 80, 64, 74,...
## $ InfantSurvival <dbl> 0.835, 0.985, 0.967, 0.997, 0.846, 0.990, 0.986...
## $ Under5Survival <dbl> 0.743, 0.983, 0.962, 0.996, 0.740, 0.989, 0.983...
## $ TBFree         <dbl> 0.99769, 0.99974, 0.99944, 0.99983, 0.99656, 0....
## $ PropMD         <dbl> 0.000228841, 0.001143127, 0.001060478, 0.003297...
## $ PropRN         <dbl> 0.000572294, 0.004614439, 0.002091362, 0.003500...
## $ PersExp        <int> 20, 169, 108, 2589, 36, 503, 484, 88, 3181, 378...
## $ GovtExp        <int> 92, 3128, 5184, 169725, 1620, 12543, 19170, 185...
## $ TotExp         <int> 112, 3297, 5292, 172314, 1656, 13046, 19654, 19...
plot(who.df$LifeExp~who.df$TotExp)

While there seems to be a positive relationship between Total Expenditures and Life Expectancy, the plot does not indicate a smooth linear relationship between the 2 variables. It seems to resemble a reverse “L” shape.

lm1<-lm(who.df$LifeExp~who.df$TotExp, who.df)
plot(who.df$LifeExp~who.df$TotExp)
abline(lm1, col="red")

summary(lm1)
## 
## Call:
## lm(formula = who.df$LifeExp ~ who.df$TotExp, data = who.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.475e+01  7.535e-01  85.933  < 2e-16 ***
## who.df$TotExp 6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14
hist(lm1$residuals)

The histogram of residuals indicates a left-skewed distribution i.e. a non-normal distribution which is also corraborated by the Q-Q plot below, as well as the information in the summary showing the median, quantiles, min and max values for the residuals.

plot(fitted(lm1),resid(lm1))

The residuals don’t seem to have a constant variance. Most of the values are bunched at the lower end of the predicted life expectancy values.

par(mfrow=c(2,2))
plot(lm1)

The F-statistic is the ratio of the mean regression sum of squares divided by the mean error sum of squares. Its value is 65.26 in this case. Given the tiny p-value associated with the F-statistic, we can reject the the null hypothesis that all of the regression coefficients are zero i.e. that they are statistically insignificant.

The standard errors associated with the co-efficients for intercept and the predictor variable (Total Expenditure) are also very small, indicating that these co-efficients are statistically significant.

However the adjusted R-squared value of about 25% is on the lower side indicating that this model explains only about 25% of the variability in the response variable.

Overall, the assumptions of a linear regression model are not met because of the following reasons: 1) Non-linear trend (scatter plot between predictor and response variable shows reverse L shape). 2) Residuals do not seem to be homoscedastic i.e. their variance is not constant. 3) Residuals are non-normal as shown by the q-q plot.

2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

# Separate the predictor and response variable into a separate dataframe
cols<-c('LifeExp', 'TotExp')
who.df2<-who.df[,cols]
dim(who.df2)
## [1] 190   2
colnames(who.df2)
## [1] "LifeExp" "TotExp"
head(who.df2)
##   LifeExp TotExp
## 1      42    112
## 2      71   3297
## 3      71   5292
## 4      82 172314
## 5      41   1656
## 6      73  13046

We now do a power transformation of the predictor and response variables.

#Scale the predictor and response variables by transforming their powers
who.df2$LifeExp<-who.df2$LifeExp^4.6
who.df2$TotExp<-who.df2$TotExp^0.06
head(who.df2)
##     LifeExp   TotExp
## 1  29305338 1.327251
## 2 327935478 1.625875
## 3 327935478 1.672697
## 4 636126841 2.061481
## 5  26230450 1.560068
## 6 372636298 1.765748
# Run the regression on the modified variables
lm2<-lm(LifeExp~TotExp, who.df2)
summary(lm2)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who.df2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp       620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16
plot(who.df2$LifeExp~who.df2$TotExp)
abline(lm2, col="red")

The above plot shows a much more linear relationship between the 2 variables.

hist(lm2$residuals)

The above histogram of residuals is still left-skewed, but less than in the previous model.

qqnorm(resid(lm2))
qqline(resid(lm2))

The q-q plot shows that the residuals are not normal, but they are more so than in the previous model.

plot(fitted(lm2),resid(lm2))

The variability in residuals is more even in the new model, unlike the previous model.

The F-statistic is 507.7 and its p-value is close to 0. This suggests that the TotExp is a statistically significant predictor of LifeExp in this model. The adjusted R-squared of about 73% indicates that this model can explain about 73% of the variability in life expectancy.

The p-values associated with the coefficients of intercept and predictor variable are close to 0 as well which implies that both of these values are statistically significant.

Overall, this model meets the conditions of linear regression better than the previous model, because of: 1) Linear trend between the response and predictor variable (as shown in scatter plot). 2) Nearly even variance in residuals (as shown in scatter plot of residuals). 3) Closer to normality in distribution of residuals (as shown in histogram of residuals).

So the transformation of the variables makes this model a better candidate for linear regression.

3. Using the results from 2, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

test.df<-data.frame(c(1.5,2.5))
colnames(test.df)<-c('TotExp')
#test.df
predval<-predict(lm2,newdata = test.df)
predval<-predval^(1/4.6)
test.df$TotExp<-test.df$TotExp^(1/0.06)
(combined<-cbind(test.df,predval))
##        TotExp  predval
## 1     860.705 63.31153
## 2 4288777.125 86.50645
knitr::kable(combined, digit = 1)
TotExp predval
860.7 63.3
4288777.1 86.5

The predicted values are 63.3 years and 86.5 years when the modified total expenditures (changed power) are 1.5 and 2.5 respectively.

4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

# Add new derived predictor variable based on product of 2 existing variables
cols<-c('LifeExp', 'PropMD', 'TotExp')
who.df3<-who.df[,cols]
who.df3$MDExp<-who.df3$PropMD * who.df3$TotExp
dim(who.df3)
## [1] 190   4
colnames(who.df3)
## [1] "LifeExp" "PropMD"  "TotExp"  "MDExp"
head(who.df3)
##   LifeExp      PropMD TotExp        MDExp
## 1      42 0.000228841    112   0.02563019
## 2      71 0.001143127   3297   3.76888972
## 3      71 0.001060478   5292   5.61204958
## 4      82 0.003297297 172314 568.17043526
## 5      41 0.000070400   1656   0.11658240
## 6      73 0.000142857  13046   1.86371242
# Fit the new model
lm3<-lm(LifeExp~TotExp+PropMD+MDExp, who.df3)
summary(lm3)
## 
## Call:
## lm(formula = LifeExp ~ TotExp + PropMD + MDExp, data = who.df3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.277e+01  7.956e-01  78.899  < 2e-16 ***
## TotExp       7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD       1.497e+03  2.788e+02   5.371 2.32e-07 ***
## MDExp       -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

We now run the same diagnostic plots as before.

hist(lm3$residuals)

The residuals continue to be left-skewed.

plot(fitted(lm3),resid(lm3))

The residuals indicate changing variance, and there are a few outliers.

par(mfrow=c(2,2))
plot(lm3)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

Based on a review of the summary data and diagnostic plots, it’s clear that this is not a good model, based on the reasons below:

  1. The co-efficient for the non-linear predictor variable (proportion of doctors times total expenditure) is negative, indicating a negative relationship between this combined variable and life expectancy. This seems counter-intuitive.

  2. While the p-values associated with the intercept and the other co-efficients, as well as the F-statistic are tiny indicating statistical significance, the residuals are highly heteroscedastic.

  3. The q-q plot shows deviation from normality in the residuals.

The adjusted R-squared value shows that this model explains about 35% of the variability in the life expectancy, which is better than the first model, but worse than the second model.

5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

test.df<-data.frame(0.03,14, 0.03*14)
colnames(test.df)<-c("PropMD","TotExp","MDExp" )
#test.df
predval<-predict(lm3,newdata = test.df)
predval
##       1 
## 107.696
(combined<-cbind(test.df,predval))
##   PropMD TotExp MDExp predval
## 1   0.03     14  0.42 107.696

The predicted values based on those inputs is about 108 years which seems unrealistic. Even if the proportion of doctors and total expenditure for any country are very high, having an average life expectancy of 108 years is unlikely given human biological constraints.