The attached who.csv dataset contains real-world data from 2008. The variables included follow.
data <- read.csv("who.csv")
# Show Entire dataset
DT::datatable(data)
str(data)
## 'data.frame': 190 obs. of 10 variables:
## $ Country : Factor w/ 190 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ LifeExp : int 42 71 71 82 41 73 75 69 82 80 ...
## $ InfantSurvival: num 0.835 0.985 0.967 0.997 0.846 0.99 0.986 0.979 0.995 0.996 ...
## $ Under5Survival: num 0.743 0.983 0.962 0.996 0.74 0.989 0.983 0.976 0.994 0.996 ...
## $ TBFree : num 0.998 1 0.999 1 0.997 ...
## $ PropMD : num 0.0002288 0.0011431 0.0010605 0.0032973 0.0000704 ...
## $ PropRN : num 0.000572 0.004614 0.002091 0.0035 0.001146 ...
## $ PersExp : int 20 169 108 2589 36 503 484 88 3181 3788 ...
## $ GovtExp : int 92 3128 5184 169725 1620 12543 19170 1856 187616 189354 ...
## $ TotExp : int 112 3297 5292 172314 1656 13046 19654 1944 190797 193142 ...
summary(data)
## Country LifeExp InfantSurvival Under5Survival
## Afghanistan : 1 Min. :40.00 Min. :0.8350 Min. :0.7310
## Albania : 1 1st Qu.:61.25 1st Qu.:0.9433 1st Qu.:0.9253
## Algeria : 1 Median :70.00 Median :0.9785 Median :0.9745
## Andorra : 1 Mean :67.38 Mean :0.9624 Mean :0.9459
## Angola : 1 3rd Qu.:75.00 3rd Qu.:0.9910 3rd Qu.:0.9900
## Antigua and Barbuda: 1 Max. :83.00 Max. :0.9980 Max. :0.9970
## (Other) :184
## TBFree PropMD PropRN PersExp
## Min. :0.9870 Min. :0.0000196 Min. :0.0000883 Min. : 3.00
## 1st Qu.:0.9969 1st Qu.:0.0002444 1st Qu.:0.0008455 1st Qu.: 36.25
## Median :0.9992 Median :0.0010474 Median :0.0027584 Median : 199.50
## Mean :0.9980 Mean :0.0017954 Mean :0.0041336 Mean : 742.00
## 3rd Qu.:0.9998 3rd Qu.:0.0024584 3rd Qu.:0.0057164 3rd Qu.: 515.25
## Max. :1.0000 Max. :0.0351290 Max. :0.0708387 Max. :6350.00
##
## GovtExp TotExp
## Min. : 10.0 Min. : 13
## 1st Qu.: 559.5 1st Qu.: 584
## Median : 5385.0 Median : 5541
## Mean : 40953.5 Mean : 41696
## 3rd Qu.: 25680.2 3rd Qu.: 26331
## Max. :476420.0 Max. :482750
##
Scatterplot of Life Expectancy and Total Expenditure
# Linear regression model build
lm <- lm(LifeExp ~ TotExp, data=data)
plot(data$TotExp, data$LifeExp, xlab = "Total Expenditures", ylab = "Life Expectancy", main="Life Expectancy vs Total Expenditures")
abline(lm)
Run summary function to provide the F statistics, R-squared, standard error, and p-values
LifeExp_simple.lm <- lm(LifeExp ~ TotExp, data = data)
summary(LifeExp_simple.lm)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.753374534 0.753536611 85.933 < 2e-16 ***
## TotExp 0.000062970 0.000007795 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
The F-statistic is 65.26 with a p-value of 7.714e-14 (test for overall regression model). P-value for F-test and TotExp is less than 0.05 but Adjusted R-squared is way too low, 0.2537, which suggests that we can reject the null hypothesis (regression model is same as intercept-only model) and accept the alternative that this simple linear regression model provides a better fit than the intercept-only model. The \(R^2\)=0.2577 value is not strong, which tells us that the model only accounts for 25.77% of the variation in the data. Standard error, 9.371, is higher than what we want it to be.
About F-statistic in linear regression model: http://blog.minitab.com/blog/adventures-in-statistics-2/what-is-the-f-test-of-overall-significance-in-regression-analysis
# Residuals variability plot
plot(LifeExp_simple.lm$fitted.values, LifeExp_simple.lm$residuals,
xlab="Fitted Values", ylab="Residuals",
main="Residuals Plot 1 for Linear Model")
abline(h=0,lm)
hist(resid(LifeExp_simple.lm), main = "Histogram of Residuals", xlab = "Residuals")
qqnorm(LifeExp_simple.lm$residuals)
qqline(LifeExp_simple.lm$residuals)
# Residual Analysis
ols_plot_resid_qq(LifeExp_simple.lm)
ols_plot_resid_hist(LifeExp_simple.lm)
ols_plot_resid_fit(LifeExp_simple.lm)
crPlots(LifeExp_simple.lm)
# Homoscedasticity
ncvTest(LifeExp_simple.lm)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 2.599177, Df = 1, p = 0.10692
# Independence
durbinWatsonTest(LifeExp_simple.lm)
## lag Autocorrelation D-W Statistic p-value
## 1 0.06872515 1.802451 0.172
## Alternative hypothesis: rho != 0
gvlma(LifeExp_simple.lm)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = data)
##
## Coefficients:
## (Intercept) TotExp
## 64.75337453 0.00006297
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = LifeExp_simple.lm)
##
## Value p-value Decision
## Global Stat 56.737011 0.00000000001405 Assumptions NOT satisfied!
## Skewness 30.532757 0.00000003282766 Assumptions NOT satisfied!
## Kurtosis 0.002804 0.95777263030755 Assumptions acceptable.
## Link Function 26.074703 0.00000032845930 Assumptions NOT satisfied!
## Heteroscedasticity 0.126747 0.72182921484679 Assumptions acceptable.
Looking at residuals plots it is clear that there is no constant variability and that residuals are not normally distributed. This model is not a good model.
Create new column with \(LifeExp^{4.6}\) values
data$LifeExp_4.6 <- (data$LifeExp)^4.6
Create new column with \(TotExp^{4.6}\) values
data$TotExp_0.06 <- (data$TotExp)^0.06
Plot total expenditures vs life expectancy for who data set
lm <- lm(LifeExp_4.6 ~ TotExp_0.06, data = data)
plot(data$TotExp_0.06, data$LifeExp_4.6, xlab = "Total Expenditures ^ 0.06", ylab = "Life Expectancy ^ 4.6", main='Average Life Expectancy vs Expenditure')
abline(lm)
Run simple linear regression on transformed LifeExp and TotExp
data_transformed.lm <- lm(LifeExp_4.6 ~ TotExp_0.06, data = data)
Run summary function and get the F statistics, R-squared, standard error, and p-values
summary(data_transformed.lm)
##
## Call:
## lm(formula = LifeExp_4.6 ~ TotExp_0.06, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp_0.06 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
# Residuals variability plot
plot(data_transformed.lm$fitted.values, data_transformed.lm$residuals,
xlab="Fitted Values", ylab="Residuals",
main="Residuals Plot 2 for Linear Model")
abline(h=0)
hist(resid(data_transformed.lm), main = "Histogram of Residuals", xlab = "Residuals")
qqnorm(data_transformed.lm$residuals)
qqline(data_transformed.lm$residuals)
# Residual Analysis
ols_plot_resid_qq(data_transformed.lm)
ols_plot_resid_hist(data_transformed.lm)
ols_plot_resid_fit(data_transformed.lm)
crPlots(data_transformed.lm)
# Homoscedasticity
ncvTest(data_transformed.lm)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.4278017, Df = 1, p = 0.51307
# Independence
durbinWatsonTest(data_transformed.lm)
## lag Autocorrelation D-W Statistic p-value
## 1 0.04038156 1.909075 0.536
## Alternative hypothesis: rho != 0
gvlma(data_transformed.lm)
##
## Call:
## lm(formula = LifeExp_4.6 ~ TotExp_0.06, data = data)
##
## Coefficients:
## (Intercept) TotExp_0.06
## -736527909 620060216
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = data_transformed.lm)
##
## Value p-value Decision
## Global Stat 27.8117 0.00001362 Assumptions NOT satisfied!
## Skewness 17.1372 0.00003478 Assumptions NOT satisfied!
## Kurtosis 7.4581 0.00631520 Assumptions NOT satisfied!
## Link Function 2.9866 0.08395884 Assumptions acceptable.
## Heteroscedasticity 0.2299 0.63158532 Assumptions acceptable.
As per QQ plot line and residual vs fitted graph suggests that at least model 2 is closer to normal distribution than model 1. However, Since mean < median, we can say that the model is still negatively skewed. Residual standard error is 90490000 but since variables are rescaled, we cannot really say residual standard error for model 2 is higher than model 1. Relative to the size of TotExp_power coefficient, residual standard error rather decreased in model 2.
The F-statistic for the transformed model is 507.7 (same degrees of freedom as model from 1.) is much better vs. the the previous model. The p-value is even better. In the transformed model the R2=0.7298, which is a lot better than the previous model (25.77%), as it means the explained variability is 72.9% compared to 25% for the 1st model. As can be inferred, the second model is better than the first one.
Transformed model from (2) using equation:
\[ LifeExp_{4.6} = -736527910 + 620060216*TotExp^{0.06} \]
forecast_LifeExp <- function(TotExp_0.06_value){
y <- -736527910 + 620060216 *(TotExp_0.06_value)
y <- y^(1/4.6)
y
#return (-736527910 + 620060216 * (TotExp_0.06_value))^(1/4.6)
}
Forecast life expectancy when \(TotExp^{.06}\) =1.5
forecast_LifeExp(1.5)
## [1] 63.31153
Then forecast life expectancy when \(TotExp^{.06}\)=2.5
forecast_LifeExp(2.5)
## [1] 86.50645
Predicting the values at 1.5 and 2.5 provides the following results.
The prediction at 1.5 is 63 years.
The prediction at 2.5 is 87 years.
\[ LifeExp = b_0 + b_1 * PropMd + b_2 * TotExp + b_3 * PropMD * TotExp \]
#LifeExp_multiple.lm <- lm(data$LifeExp ~ data$PropMD + data$TotExp + (data$PropMD * data$TotExp))
data$PropMD_TotExp_0.06 <- data$PropMD * data$TotExp_0.06
#LifeExp_multiple.lm <- lm(data$LifeExp_4.6 ~ data$PropMD + data$TotExp_0.06 + data$PropMD:data$TotExp_0.06)
LifeExp_multiple.lm <- lm(data$LifeExp_4.6 ~ data$PropMD + data$TotExp_0.06 + data$PropMD_TotExp_0.06)
summary(LifeExp_multiple.lm)
##
## Call:
## lm(formula = data$LifeExp_4.6 ~ data$PropMD + data$TotExp_0.06 +
## data$PropMD_TotExp_0.06)
##
## Residuals:
## Min 1Q Median 3Q Max
## -296470018 -47729263 12183210 60285515 212311883
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -724418697 50825046 -14.253 <2e-16 ***
## data$PropMD 47273338389 22577050673 2.094 0.0376 *
## data$TotExp_0.06 604795792 30232703 20.005 <2e-16 ***
## data$PropMD_TotExp_0.06 -21214671638 11308191249 -1.876 0.0622 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 88520000 on 186 degrees of freedom
## Multiple R-squared: 0.7441, Adjusted R-squared: 0.74
## F-statistic: 180.3 on 3 and 186 DF, p-value: < 2.2e-16
hist(resid(LifeExp_multiple.lm), xlab = "Residuals")
plot(fitted(LifeExp_multiple.lm), resid(LifeExp_multiple.lm))
The F-statistic from our multiple regression model is 180.3. The p-value is strong for the model, and strong for all of the factors except PropMD x TotExp0.06 (0.0622). Finally, the \(R^2\)=0.7441, which means that the model accounts for 74.41% of variability in the data.
LifeExp_multiple.lm$coefficients
## (Intercept) data$PropMD data$TotExp_0.06
## -724418697 47273338389 604795792
## data$PropMD_TotExp_0.06
## -21214671638
\[ LifeExp_{4.6} = 724418697 + 47273338389 * PropMD + 604795792 * TotExp_{0.06} + -21214671638 * PropMD * TotExp_{0.06} \]
forecast_LifeExp_2 <- function(PropMD, TotExp_0.06){
LifeExp_4.6 <- 724418697 + 47273338389 * PropMD + 604795792 * TotExp_0.06 + -21214671638 * PropMD * TotExp_0.06
return(LifeExp_4.6^(1/4.6))
}
forecast_LifeExp_2(.03, 14^0.06)
## [1] 106.3698
This seems like an outlier as life expectancy of 106 does not look realistic. The problem with this model is that we are using the proportion of MDs living in a country as well as the total expenditure as independent variables for life expectancy, and we really do not know if these actually make a true difference in real life.
Life expentancy of propMD around 3%
LifeExp_propMD_.03 <- subset(data, data$PropMD >.02 & data$PropMD < .04, select = c("LifeExp", "PropMD", "TotExp"))
head(LifeExp_propMD_.03)
## LifeExp PropMD TotExp
## 45 80 0.03322813 40749
## 146 82 0.03512903 281653
The forecast for proportion of population that are medical doctors is 3% and total expenditure of $14 is about 82 years. Looking at the data below for propMD that falls around 3%, I do see that 80 to 82 years of life expectancy is present; however, the total expenditure is a lot higher than 14. So, I do not think that this is a realistic forecast.
Life expentancy of total expenditure around 14
The minimum total expenditure in the dataset is $13.
min(data$TotExp)
## [1] 13
To get an idea of the life expectancy of countries with total expenditure close to $14, below is a list of entries with total expenditures under $100. As you can see, only 1 entry has a total expenditure that is closest to $14, and the life expectancy of this country is 49 years. Also, for all the countries in this list, the propMD is very low (less than 1%).
LifeExp_TotExp_.04 <- subset(data, data$TotExp >= 13 & data$TotExp < 100, select = c("LifeExp", "PropMD", "TotExp"))
LifeExp_TotExp_.04
## LifeExp PropMD TotExp
## 14 63 0.000274894 87
## 28 49 0.000024500 13
## 47 47 0.000096100 71
## 56 63 0.000045800 88
## 58 56 0.000023900 70
## 70 53 0.000107505 87
## 118 62 0.000194783 80
## 122 42 0.000021500 94
Based on actual data, I would say that a life expectancy of 82 years for a country with propMD of 3% and total expenditure of 14 is not realistic. Countries with propMD of about 3% tend to have total expenditure that is way above $14.