library(ggplot2)
library(ggResidpanel)The attached who.csv dataset contains real-world data from 2008. The variables included follow -
Country: name of the country
LifeExp: average life expectancy for the country in years
InfantSurvival: proportion of those surviving to one year or more
Under5Survival: proportion of those surviving to five years or more
TBFree: proportion of the population without TB.
PropMD: proportion of the population who are MDs
PropRN: proportion of the population who are RNs
PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate
TotExp: sum of personal and government expenditures.
Below is a preview if the data set -
df <- read.csv("https://raw.githubusercontent.com/soumya2g/R-CUNY-MSDS/master/DATA-605/Multi%20Linear%20Regression/Data/who.csv")
head(df)| Country | LifeExp | InfantSurvival | Under5Survival | TBFree | PropMD | PropRN | PersExp | GovtExp | TotExp |
|---|---|---|---|---|---|---|---|---|---|
| Afghanistan | 42 | 0.835 | 0.743 | 0.99769 | 0.0002288 | 0.0005723 | 20 | 92 | 112 |
| Albania | 71 | 0.985 | 0.983 | 0.99974 | 0.0011431 | 0.0046144 | 169 | 3128 | 3297 |
| Algeria | 71 | 0.967 | 0.962 | 0.99944 | 0.0010605 | 0.0020914 | 108 | 5184 | 5292 |
| Andorra | 82 | 0.997 | 0.996 | 0.99983 | 0.0032973 | 0.0035000 | 2589 | 169725 | 172314 |
| Angola | 41 | 0.846 | 0.740 | 0.99656 | 0.0000704 | 0.0011462 | 36 | 1620 | 1656 |
| Antigua and Barbuda | 73 | 0.990 | 0.989 | 0.99991 | 0.0001429 | 0.0027738 | 503 | 12543 | 13046 |
dim(df)## [1] 190 10
Here is the statistical summary of the data -
summary(df)## Country LifeExp InfantSurvival
## Afghanistan : 1 Min. :40.00 Min. :0.8350
## Albania : 1 1st Qu.:61.25 1st Qu.:0.9433
## Algeria : 1 Median :70.00 Median :0.9785
## Andorra : 1 Mean :67.38 Mean :0.9624
## Angola : 1 3rd Qu.:75.00 3rd Qu.:0.9910
## Antigua and Barbuda: 1 Max. :83.00 Max. :0.9980
## (Other) :184
## Under5Survival TBFree PropMD PropRN
## Min. :0.7310 Min. :0.9870 Min. :0.0000196 Min. :0.0000883
## 1st Qu.:0.9253 1st Qu.:0.9969 1st Qu.:0.0002444 1st Qu.:0.0008455
## Median :0.9745 Median :0.9992 Median :0.0010474 Median :0.0027584
## Mean :0.9459 Mean :0.9980 Mean :0.0017954 Mean :0.0041336
## 3rd Qu.:0.9900 3rd Qu.:0.9998 3rd Qu.:0.0024584 3rd Qu.:0.0057164
## Max. :0.9970 Max. :1.0000 Max. :0.0351290 Max. :0.0708387
##
## PersExp GovtExp TotExp
## Min. : 3.00 Min. : 10.0 Min. : 13
## 1st Qu.: 36.25 1st Qu.: 559.5 1st Qu.: 584
## Median : 199.50 Median : 5385.0 Median : 5541
## Mean : 742.00 Mean : 40953.5 Mean : 41696
## 3rd Qu.: 515.25 3rd Qu.: 25680.2 3rd Qu.: 26331
## Max. :6350.00 Max. :476420.0 Max. :482750
##
In this exercise we will build a Multi-Factor linear regression model. Below are the objectives -
The first step in the simple linear regression modeling process is to determine whether or not it looks as though a linear relationship exists between the predictor and the output value. Let’s inspect through a scatter plot if there is any apparent linear relationship between the Preditor (TotExp) and Response variable (LifeExp) -
ggplot(df, aes(x=TotExp, y=LifeExp)) +
geom_point(size=2) +
geom_smooth(method=lm) +
ggtitle("Life Expectancy Vs. Total Expenditure Scatter Plot") +
theme(plot.title = element_text(hjust = 0.5))I am going to use R’s lm() function to generate a linear model. For the one factor model, R computes the values of \({ \beta }_{ 0 }\) and \({ \beta }_{ 1 }\) using the method of least squares which finds the line that most closely fits the measured data by minimizing the distance between the line and the data points.
lm_model <- lm(LifeExp ~ TotExp, df)
lm_model##
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
##
## Coefficients:
## (Intercept) TotExp
## 6.475e+01 6.297e-05
summary(lm_model)##
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
# Residuals plots
resid_panel(lm_model, plots='all', smoother = TRUE)Below is comparison between the two models using ggResidPanel compare function -
resid_compare(list(lm_model,lmNew_model),plots=c('resid','qq','yvp','lev','boxplot','ls'), smoother = TRUE)QQ plot line and residual vs fitted graph suggests that at least model 2 is closer to normal distribution than model 1.
The p-value suggests a statistically significant correlation between total expenditures^0.06 and life expectancy^4.6, since p<<0.05. The \({ R }^{ 2 }\) of 0.7298 means that about 72.98% of the variability of life expectancy about the mean is explained by the model. This is a moderately strong correlation. The F-statistic tells us that adding the variable ‘total expenditures’ to the model improves the model compared to only having an intercept 507.7 is much larger than 65.26, so it is a better fit than before. Note that The residual standard error tells us that, if the residuals are normally distributed, about 64% of the residuals are between ±90490000 years^4.6. These statistics suggest we have a useful model.
The linear model, when plotted over the data, matches the data more closely. Furthermore, the residual analysis shows that the residuals are normally distributed and show constant variance; there is no noticeable trend. Therefore, the linear model is valid in this case.
Overall, model 2 is much better than model 1.
predictdata <- data.frame(TotExpNew=c(1.5,2.5))
predict(lmNew_model, predictdata,interval="predict")^(1/4.6)## fit lwr upr
## 1 63.31153 35.93545 73.00793
## 2 86.50645 81.80643 90.43414
Predicting the values at 1.5 adn 2.5 provides the following results.
The prediction at 1.5 is 63 years with a CI(35.93545, 73.00793).
The prediction at 2.5 is 87 year with a CI(81.80643, 90.43414)
# Multiple linear regression model build
lm4 <- lm(LifeExp ~ PropMD + TotExp + TotExp:PropMD, data=df)
# Linear regression model summary
summary(lm4)FALSE
FALSE Call:
FALSE lm(formula = LifeExp ~ PropMD + TotExp + TotExp:PropMD, data = df)
FALSE
FALSE Residuals:
FALSE Min 1Q Median 3Q Max
FALSE -27.320 -4.132 2.098 6.540 13.074
FALSE
FALSE Coefficients:
FALSE Estimate Std. Error t value Pr(>|t|)
FALSE (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
FALSE PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
FALSE TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
FALSE PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
FALSE ---
FALSE Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
FALSE
FALSE Residual standard error: 8.765 on 186 degrees of freedom
FALSE Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
FALSE F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
# Residuals variability plot
ggplot(lm4, aes(.fitted, .resid))+
geom_point(size =2) +
stat_smooth(method="loess")+
geom_hline(yintercept=0, col="red", linetype="dashed") +
xlab("Fitted values")+ylab("Residuals") +
ggtitle("Residual vs Fitted Plot")+theme_bw() +
theme(plot.title = element_text(hjust = 0.5))# Residuals Q-Q plot
resid_interact(lm4, plots='qq', smoother = TRUE)# Residuals plot
resid_interact(lm4, plots='resid', smoother = TRUE)# Residuals plot
resid_interact(lm4, plots='yvp', smoother = TRUE)# Residuals Standardized Vs. Leverage plot
resid_interact(lm4, plots='lev', smoother = TRUE)# Residuals Box plot
resid_interact(lm4, plots='boxplot', smoother = TRUE)# Residuals Locations-Scale plot
resid_interact(lm4, plots='ls', smoother = TRUE)P-values for each variable and F-statistics are all below 0.05 so the results are statistically significant. We reject null hypothesis. Adjusted R^2 is around 0.3471 and it is substantially higher than model 1 but less than model 2. Residual standard error is smaller than model 1 but not model 2, with respect to residual standard error divided by variable coefficients. The model is not normally distributed according to QQ plot and residual vs fitted graph test, similar to model 1. The residual analysis shows that the residuals have a strong right skew and do not show constant variance. Therefore, the linear model is not valid in this case.
newdata <- data.frame(PropMD=0.03, TotExp=14)
predict(lm4, newdata,interval="predict")## fit lwr upr
## 1 107.696 84.24791 131.1441
Predicting the values at PropMD=0.03, TotExp=14 provides the following results.
The prediction is 107.7 years with a CI(84.24791, 131.1441).
A prediction of Life Expectancy of 107.7 years seems unrealistic since this is way above the the commonly known human life expectancy, also the CI also shows 132 years which is abnormal.
Furthermore, 3% of the population being MDs seems unreasonably high and Total Expenditures of $14 seems unreasonably low as that includes both personal and government expenditures.
Comments
The p-value suggests a statistically significant correlation between total expenditures and life expectancy, since p<<0.05. The \({ R }^{ 2 }\) of 0.2577 means that about 25.77% of the variability of life expectancy about the mean is explained by the model. This is a moderately weak correlation. The F-statistic tells us that adding the variable ‘total expenditures’ to the model improves the model compared to only having an intercept. The residual standard error tells us that, if the residuals are normally distributed, about 64% of the residuals are between ±9.371 years. These statistics suggest we have a useful model.
The linear model, when plotted over the data, does not match the data very closely. Furthermore, the residual analysis shows that the residuals have a strong right skew and do not show constant variance. Therefore, the linear model is not valid in this case.
Transforming the variables as noted in Question 2.