Libraries

library(ggplot2)
library(ggResidpanel)

Dataset Summary

The attached who.csv dataset contains real-world data from 2008. The variables included follow -

Country: name of the country

LifeExp: average life expectancy for the country in years

InfantSurvival: proportion of those surviving to one year or more

Under5Survival: proportion of those surviving to five years or more

TBFree: proportion of the population without TB.

PropMD: proportion of the population who are MDs

PropRN: proportion of the population who are RNs

PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate

GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate

TotExp: sum of personal and government expenditures.

Below is a preview if the data set -

df <- read.csv("https://raw.githubusercontent.com/soumya2g/R-CUNY-MSDS/master/DATA-605/Multi%20Linear%20Regression/Data/who.csv")
head(df)
Country LifeExp InfantSurvival Under5Survival TBFree PropMD PropRN PersExp GovtExp TotExp
Afghanistan 42 0.835 0.743 0.99769 0.0002288 0.0005723 20 92 112
Albania 71 0.985 0.983 0.99974 0.0011431 0.0046144 169 3128 3297
Algeria 71 0.967 0.962 0.99944 0.0010605 0.0020914 108 5184 5292
Andorra 82 0.997 0.996 0.99983 0.0032973 0.0035000 2589 169725 172314
Angola 41 0.846 0.740 0.99656 0.0000704 0.0011462 36 1620 1656
Antigua and Barbuda 73 0.990 0.989 0.99991 0.0001429 0.0027738 503 12543 13046
dim(df)
## [1] 190  10

Here is the statistical summary of the data -

summary(df)
##                 Country       LifeExp      InfantSurvival  
##  Afghanistan        :  1   Min.   :40.00   Min.   :0.8350  
##  Albania            :  1   1st Qu.:61.25   1st Qu.:0.9433  
##  Algeria            :  1   Median :70.00   Median :0.9785  
##  Andorra            :  1   Mean   :67.38   Mean   :0.9624  
##  Angola             :  1   3rd Qu.:75.00   3rd Qu.:0.9910  
##  Antigua and Barbuda:  1   Max.   :83.00   Max.   :0.9980  
##  (Other)            :184                                   
##  Under5Survival       TBFree           PropMD              PropRN         
##  Min.   :0.7310   Min.   :0.9870   Min.   :0.0000196   Min.   :0.0000883  
##  1st Qu.:0.9253   1st Qu.:0.9969   1st Qu.:0.0002444   1st Qu.:0.0008455  
##  Median :0.9745   Median :0.9992   Median :0.0010474   Median :0.0027584  
##  Mean   :0.9459   Mean   :0.9980   Mean   :0.0017954   Mean   :0.0041336  
##  3rd Qu.:0.9900   3rd Qu.:0.9998   3rd Qu.:0.0024584   3rd Qu.:0.0057164  
##  Max.   :0.9970   Max.   :1.0000   Max.   :0.0351290   Max.   :0.0708387  
##                                                                           
##     PersExp           GovtExp             TotExp      
##  Min.   :   3.00   Min.   :    10.0   Min.   :    13  
##  1st Qu.:  36.25   1st Qu.:   559.5   1st Qu.:   584  
##  Median : 199.50   Median :  5385.0   Median :  5541  
##  Mean   : 742.00   Mean   : 40953.5   Mean   : 41696  
##  3rd Qu.: 515.25   3rd Qu.: 25680.2   3rd Qu.: 26331  
##  Max.   :6350.00   Max.   :476420.0   Max.   :482750  
## 

Objective

In this exercise we will build a Multi-Factor linear regression model. Below are the objectives -

  1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

Visualize the Data

The first step in the simple linear regression modeling process is to determine whether or not it looks as though a linear relationship exists between the predictor and the output value. Let’s inspect through a scatter plot if there is any apparent linear relationship between the Preditor (TotExp) and Response variable (LifeExp) -

ggplot(df, aes(x=TotExp, y=LifeExp)) + 
  geom_point(size=2) +
  geom_smooth(method=lm) +
  ggtitle("Life Expectancy Vs. Total Expenditure Scatter Plot") +
  theme(plot.title = element_text(hjust = 0.5))

Linear Model Function

I am going to use R’s lm() function to generate a linear model. For the one factor model, R computes the values of \({ \beta }_{ 0 }\) and \({ \beta }_{ 1 }\) using the method of least squares which finds the line that most closely fits the measured data by minimizing the distance between the line and the data points.

lm_model <- lm(LifeExp ~ TotExp, df)

lm_model
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
## 
## Coefficients:
## (Intercept)       TotExp  
##   6.475e+01    6.297e-05

Model Coefficients Summary

summary(lm_model)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Model Interpretation

# Residuals plots
resid_panel(lm_model, plots='all', smoother = TRUE)

Comments

The p-value suggests a statistically significant correlation between total expenditures and life expectancy, since p<<0.05. The \({ R }^{ 2 }\) of 0.2577 means that about 25.77% of the variability of life expectancy about the mean is explained by the model. This is a moderately weak correlation. The F-statistic tells us that adding the variable ‘total expenditures’ to the model improves the model compared to only having an intercept. The residual standard error tells us that, if the residuals are normally distributed, about 64% of the residuals are between ±9.371 years. These statistics suggest we have a useful model.

The linear model, when plotted over the data, does not match the data very closely. Furthermore, the residual analysis shows that the residuals have a strong right skew and do not show constant variance. Therefore, the linear model is not valid in this case.

  1. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

Transforming the variables as noted in Question 2.

# Transformations
LifeExpNew <- df$LifeExp**4.6
TotExpNew <- df$TotExp**0.06


# Linear regression model build
lmNew_model <- lm(LifeExpNew ~ TotExpNew)

# Scatterplot of dependent and independent variables
ggplot(df, aes(x=TotExpNew, y=LifeExpNew)) +
  geom_point(size=2) +
  geom_smooth(method=lm) +
  ggtitle("Life Expectancy vs Total Expenditures (Modified)") +
  xlab("Total Expenditures") +
  ylab("Life Expectancy") +
  theme(plot.title = element_text(hjust = 0.5))

# Linear regression model summary
summary(lmNew_model)
FALSE 
FALSE Call:
FALSE lm(formula = LifeExpNew ~ TotExpNew)
FALSE 
FALSE Residuals:
FALSE        Min         1Q     Median         3Q        Max 
FALSE -308616089  -53978977   13697187   59139231  211951764 
FALSE 
FALSE Coefficients:
FALSE               Estimate Std. Error t value Pr(>|t|)    
FALSE (Intercept) -736527910   46817945  -15.73   <2e-16 ***
FALSE TotExpNew    620060216   27518940   22.53   <2e-16 ***
FALSE ---
FALSE Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE 
FALSE Residual standard error: 90490000 on 188 degrees of freedom
FALSE Multiple R-squared:  0.7298,  Adjusted R-squared:  0.7283 
FALSE F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16
# Residuals variability plot
ggplot(lmNew_model, aes(.fitted, .resid))+
    geom_point(size =2) +
    stat_smooth(method="loess")+
    geom_hline(yintercept=0, col="red", linetype="dashed") +
    xlab("Fitted values")+ylab("Residuals") +
    ggtitle("Residual vs Fitted Plot (New)")+theme_bw() +
    theme(plot.title = element_text(hjust = 0.5))

# Residuals Q-Q plot
resid_interact(lmNew_model, plots='qq', smoother = TRUE)
# Residuals plot
resid_interact(lmNew_model, plots='resid', smoother = TRUE)
# Residuals plot
resid_interact(lmNew_model, plots='yvp', smoother = TRUE)
# Residuals Standardized Vs. Leverage plot
resid_interact(lmNew_model, plots='lev', smoother = TRUE)
# Residuals Box plot
resid_interact(lmNew_model, plots='boxplot', smoother = TRUE)
# Residuals Locations-Scale plot
resid_interact(lmNew_model, plots='ls', smoother = TRUE)

Model Comparisons

Below is comparison between the two models using ggResidPanel compare function -

resid_compare(list(lm_model,lmNew_model),plots=c('resid','qq','yvp','lev','boxplot','ls'), smoother = TRUE)

Comments

QQ plot line and residual vs fitted graph suggests that at least model 2 is closer to normal distribution than model 1.

The p-value suggests a statistically significant correlation between total expenditures^0.06 and life expectancy^4.6, since p<<0.05. The \({ R }^{ 2 }\) of 0.7298 means that about 72.98% of the variability of life expectancy about the mean is explained by the model. This is a moderately strong correlation. The F-statistic tells us that adding the variable ‘total expenditures’ to the model improves the model compared to only having an intercept 507.7 is much larger than 65.26, so it is a better fit than before. Note that The residual standard error tells us that, if the residuals are normally distributed, about 64% of the residuals are between ±90490000 years^4.6. These statistics suggest we have a useful model.

The linear model, when plotted over the data, matches the data more closely. Furthermore, the residual analysis shows that the residuals are normally distributed and show constant variance; there is no noticeable trend. Therefore, the linear model is valid in this case.

Overall, model 2 is much better than model 1.

  1. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
predictdata <- data.frame(TotExpNew=c(1.5,2.5))
predict(lmNew_model, predictdata,interval="predict")^(1/4.6)
##        fit      lwr      upr
## 1 63.31153 35.93545 73.00793
## 2 86.50645 81.80643 90.43414

Predicting the values at 1.5 adn 2.5 provides the following results.

The prediction at 1.5 is 63 years with a CI(35.93545, 73.00793).

The prediction at 2.5 is 87 year with a CI(81.80643, 90.43414)

  1. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
# Multiple linear regression model build
lm4 <- lm(LifeExp ~ PropMD + TotExp + TotExp:PropMD, data=df)

# Linear regression model summary
summary(lm4)
FALSE 
FALSE Call:
FALSE lm(formula = LifeExp ~ PropMD + TotExp + TotExp:PropMD, data = df)
FALSE 
FALSE Residuals:
FALSE     Min      1Q  Median      3Q     Max 
FALSE -27.320  -4.132   2.098   6.540  13.074 
FALSE 
FALSE Coefficients:
FALSE                 Estimate Std. Error t value Pr(>|t|)    
FALSE (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
FALSE PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
FALSE TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
FALSE PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
FALSE ---
FALSE Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE 
FALSE Residual standard error: 8.765 on 186 degrees of freedom
FALSE Multiple R-squared:  0.3574,  Adjusted R-squared:  0.3471 
FALSE F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16
# Residuals variability plot
ggplot(lm4, aes(.fitted, .resid))+
    geom_point(size =2) +
    stat_smooth(method="loess")+
    geom_hline(yintercept=0, col="red", linetype="dashed") +
    xlab("Fitted values")+ylab("Residuals") +
    ggtitle("Residual vs Fitted Plot")+theme_bw() +
    theme(plot.title = element_text(hjust = 0.5))

# Residuals Q-Q plot
resid_interact(lm4, plots='qq', smoother = TRUE)
# Residuals plot
resid_interact(lm4, plots='resid', smoother = TRUE)
# Residuals plot
resid_interact(lm4, plots='yvp', smoother = TRUE)
# Residuals Standardized Vs. Leverage plot
resid_interact(lm4, plots='lev', smoother = TRUE)
# Residuals Box plot
resid_interact(lm4, plots='boxplot', smoother = TRUE)
# Residuals Locations-Scale plot
resid_interact(lm4, plots='ls', smoother = TRUE)

Comments

P-values for each variable and F-statistics are all below 0.05 so the results are statistically significant. We reject null hypothesis. Adjusted R^2 is around 0.3471 and it is substantially higher than model 1 but less than model 2. Residual standard error is smaller than model 1 but not model 2, with respect to residual standard error divided by variable coefficients. The model is not normally distributed according to QQ plot and residual vs fitted graph test, similar to model 1. The residual analysis shows that the residuals have a strong right skew and do not show constant variance. Therefore, the linear model is not valid in this case.

  1. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
newdata <- data.frame(PropMD=0.03, TotExp=14)
predict(lm4, newdata,interval="predict")
##       fit      lwr      upr
## 1 107.696 84.24791 131.1441

Predicting the values at PropMD=0.03, TotExp=14 provides the following results.

The prediction is 107.7 years with a CI(84.24791, 131.1441).

A prediction of Life Expectancy of 107.7 years seems unrealistic since this is way above the the commonly known human life expectancy, also the CI also shows 132 years which is abnormal.

Furthermore, 3% of the population being MDs seems unreasonably high and Total Expenditures of $14 seems unreasonably low as that includes both personal and government expenditures.