## installed_and_loaded.packages.
## prettydoc TRUE
## dplyr TRUE
## ggplot2 TRUE
## gvlma TRUE
## MASS TRUE
## lmtest TRUE
## car TRUE
The Data
The attached who.csv dataset contains real-world data from 2008. The variables included following:
- Country: name of the country
- LifeExp: average life expectancy for the country in years
- InfantSurvival: proportion of those surviving to one year or more
- Under5Survival: proportion of those surviving to five years or more
- TBFree: proportion of the population without TB.
- PropMD: proportion of the population who are MDs
- PropRN: proportion of the population who are RNs
- PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
- GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate
- TotExp: sum of personal and government expenditures.
url <- "https://raw.githubusercontent.com/kylegilde/D605-Math/master/who.csv"
who_df <- read.csv(url)
glimpse(who_df)## Observations: 190
## Variables: 10
## $ Country <fctr> Afghanistan, Albania, Algeria, Andorra, Angola...
## $ LifeExp <int> 42, 71, 71, 82, 41, 73, 75, 69, 82, 80, 64, 74,...
## $ InfantSurvival <dbl> 0.835, 0.985, 0.967, 0.997, 0.846, 0.990, 0.986...
## $ Under5Survival <dbl> 0.743, 0.983, 0.962, 0.996, 0.740, 0.989, 0.983...
## $ TBFree <dbl> 0.99769, 0.99974, 0.99944, 0.99983, 0.99656, 0....
## $ PropMD <dbl> 0.000228841, 0.001143127, 0.001060478, 0.003297...
## $ PropRN <dbl> 0.000572294, 0.004614439, 0.002091362, 0.003500...
## $ PersExp <int> 20, 169, 108, 2589, 36, 503, 484, 88, 3181, 378...
## $ GovtExp <int> 92, 3128, 5184, 169725, 1620, 12543, 19170, 185...
## $ TotExp <int> 112, 3297, 5292, 172314, 1656, 13046, 19654, 19...
Questions
1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
Model Summary
options(scipen = 6)
summary(who_df[c("LifeExp", "TotExp")])## LifeExp TotExp
## Min. :40.00 Min. : 13
## 1st Qu.:61.25 1st Qu.: 584
## Median :70.00 Median : 5541
## Mean :67.38 Mean : 41696
## 3rd Qu.:75.00 3rd Qu.: 26331
## Max. :83.00 Max. :482750
who_model <- lm(LifeExp ~ TotExp, who_df)
summary(who_model)##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.753374534 0.753536611 85.933 < 2e-16 ***
## TotExp 0.000062970 0.000007795 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
intercept <- coef(who_model)[1]
slope <- coef(who_model)[2]a <- ggplot(who_model, aes(TotExp, LifeExp))
a + geom_point() + geom_abline(slope = slope, intercept = intercept, show.legend = TRUE)Model Interpretation
The residual summary statistics indicate a possibly skewed distribution, but we will confirm with further diagnostics.
Using the point estimates of the coefficients, the linear model is expressed as \(\widehat{LifeExp} = 64.475 + .00006*{TotExp}\)
For each additional dollar increase in TotExp, the model expects on average an increase of 0.00006 years in LifeExp.
The y-intercept’s standard error is more than 60 times smaller than its coefficient value, which means there is relatively little variability. The TotExp’s standard error is about 8 times smaller than its coefficient value. It is within the 5 to 10 times smaller range that typically characterizes a good model.
The coefficient t-values are quite large, which indicates that our point estimates are far from 0 and supports rejecting the null hypothesis that there is no relationhip between the variables.
Both the y-intercept & TotExp’s p-values are near zero, which means that the probability of observing this relationship due to chance is very small.
At 9.3 years, the residual standard error is relatively large. This is the average amount that our response variable will deviate from the regression line.
However, in this model, multiple \(R^2\) is 0.2537, which means that only \(25\%\) of the variance in the response variable can be explained by the independent variable.
The F-statistic at 65.26 is quite large, and the model’s p-value is near zero. If the model’s diagnostics are sufficient, these values indicate that we would reject the null hypothesis that there is no relationship between the variables.
Model Diagnostics
Let’s assess if this linear model is reliable.
Linearity: Do the variables have a linear relationship?
The Component+Residual plot shows that this is NOT a linear relationship.
crPlots(who_model)Normality: Are the model’s residuals LifeExpributed nearly normally?
No, per the histogram and Q-Q plot, the residuals are not anywhere near normally distributed.
sresid <- studres(who_model)
hist(sresid, freq = FALSE, breaks = 10, main = "LifeExpribution of Studentized Residuals")
xfit <- seq(min(sresid), max(sresid), length = 40)
yfit <- dnorm(xfit)
lines(xfit, yfit)plot(who_model, which = 2)Homoscedasticity: Is there constant variability among the residuals?
Based on the scatter plot, the residuals are not indistinctly distributed.
The Non-constant Variance Score Test has a p-value of >.05, which indicates that we would fail to reject the null hypothesis of homoscedasticity.
plot(who_model, which = 1)ncvTest(who_model)## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 2.599177 Df = 1 p = 0.1069193
Independence: Are the data from a random sample and not from a time series?
The Durbin Watson test’s p-value is >.05. Therefore, we fail to reject the null hypothesis of independence (no autocorrelation).
durbinWatsonTest(who_model)## lag Autocorrelation D-W Statistic p-value
## 1 0.06872515 1.802451 0.166
## Alternative hypothesis: rho != 0
Conclusion
While the coefficient values are likely relevant to modeling LifeExp, based upon my diagnostics and the gvlma function, the conditions for linear regression have not been met.
gvlma(who_model)##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_df)
##
## Coefficients:
## (Intercept) TotExp
## 64.75337453 0.00006297
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = who_model)
##
## Value p-value Decision
## Global Stat 56.737011 1.405e-11 Assumptions NOT satisfied!
## Skewness 30.532757 3.283e-08 Assumptions NOT satisfied!
## Kurtosis 0.002804 9.578e-01 Assumptions acceptable.
## Link Function 26.074703 3.285e-07 Assumptions NOT satisfied!
## Heteroscedasticity 0.126747 7.218e-01 Assumptions acceptable.
plot(gvlma(who_model))2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
Model Summary
options(scipen = 0)
options(digits = 6)
who_df$LifeExp4.6 <- who_df$LifeExp^4.6
who_df$TotExp.06 <- who_df$TotExp^0.06
who_model <- lm(LifeExp4.6 ~ TotExp.06, who_df)
summary(who_model)##
## Call:
## lm(formula = LifeExp4.6 ~ TotExp.06, data = who_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.09e+08 -5.40e+07 1.37e+07 5.91e+07 2.12e+08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.37e+08 4.68e+07 -15.7 <2e-16 ***
## TotExp.06 6.20e+08 2.75e+07 22.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90500000 on 188 degrees of freedom
## Multiple R-squared: 0.73, Adjusted R-squared: 0.728
## F-statistic: 508 on 1 and 188 DF, p-value: <2e-16
intercept <- coef(who_model)[1]
slope <- coef(who_model)[2]a <- ggplot(who_model, aes(TotExp.06, LifeExp4.6))
a + geom_point() + geom_abline(slope = slope, intercept = intercept, show.legend = TRUE)Model Interpretation
The residual summary statistics indicate a possibly skewed distribution not centered around zero, but we will confirm with further diagnostics.
Using the point estimates of the coefficients, the linear model is expressed as \(\widehat{LifeExp4.6} = 64.475 + 620060216*{TotExp.06} - 736527910\)
For each additional unit increase in TotExp.06, the model expects on average a unit increase of 620060216 in LifeExp4.6.
The y-intercept’s standard error is more than 15 times smaller than its coefficient value. The TotExp.06’s standard error is about 22 times smaller than its coefficient value. Both of these ratios are indicative of a good model with relatively little variance in its coefficients.
The coefficient t-values are large, which indicates that our point estimates are far from 0 and supports rejecting the null hypothesis that there is no relationhip between the variables.
Both the y-intercept & TotExp.06’s p-values are near zero, which means that the probability of observing this relationship due to chance is very small.
The Residual Standard Error seems some what large given that 90500000^(1/4.6) = 53.6669 years. This is the average amount that our response variable will deviate from the regression line.
90500000^(1/4.6)## [1] 53.6669
In this model, multiple \(R^2\) is 0.728, which means that \(72\%\) of the variance in the response variable can be explained by the independent variable.
The F-statistic at 508 is quite large, and the model’s p-value is near zero. If the model’s diagnostics are sufficient, these values indicate that we would reject the null hypothesis that there is no relationship between the variables.
Model Diagnostics
Let’s assess if this linear model is reliable.
Linearity: Do the variables have a linear relationship?
The Component+Residual plot shows that this is now a linear relationship.
crPlots(who_model)Normality: Are the model’s residuals LifeExp4.6ributed nearly normally?
Per the histogram and Q-Q plot, the residuals are more normally distributed. However, they are left skewed.
sresid <- studres(who_model)
hist(sresid, freq = FALSE, breaks = 10, main = "LifeExp4.6ribution of Studentized Residuals")
xfit <- seq(min(sresid), max(sresid), length = 40)
yfit <- dnorm(xfit)
lines(xfit, yfit)plot(who_model, which = 2)Homoscedasticity: Is there constant variability among the residuals?
Based on the scatter plot, the residuals are much more indistinctly distributed vis-a-vis the fitted values, but not completely so.
The Non-constant Variance Score Test has a p-value of >.05, which indicates that we would fail to reject the null hypothesis of homoscedasticity.
plot(who_model, which = 1)ncvTest(who_model)## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.427802 Df = 1 p = 0.51307
Independence: Are the data from a random sample and not from a time series?
The Durbin Watson test’s p-value is >.05. Therefore, we fail to reject the null hypothesis of independence of no autocorrelation).
durbinWatsonTest(who_model)## lag Autocorrelation D-W Statistic p-value
## 1 0.0403816 1.90907 0.524
## Alternative hypothesis: rho != 0
Conclusion
With Adjusted R-squared at 0.74, a large F-statistic and tiny p-value, the second model is better than the first. Based upon my diagnostics and the gvlma function, the second model is much closer to meeting the conditions for linear regression, but not sufficiently still.
gvlma(who_model)##
## Call:
## lm(formula = LifeExp4.6 ~ TotExp.06, data = who_df)
##
## Coefficients:
## (Intercept) TotExp.06
## -7.37e+08 6.20e+08
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = who_model)
##
## Value p-value Decision
## Global Stat 27.81 1.36e-05 Assumptions NOT satisfied!
## Skewness 17.14 3.48e-05 Assumptions NOT satisfied!
## Kurtosis 7.46 6.32e-03 Assumptions NOT satisfied!
## Link Function 2.99 8.40e-02 Assumptions acceptable.
## Heteroscedasticity 0.23 6.32e-01 Assumptions acceptable.
plot(gvlma(who_model))3. Using the results from 3, forecast life expectancy when TotExp.06^.06 =1.5. Then forecast life expectancy when TotExp.06^.06=2.5.
as.numeric(intercept + slope * 2.5)## [1] 813622630
4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?
LifeExp4.6 = b0 + b1 x PropMd + b2 x TotExp.06 + b3 x PropMD x TotExp.06
Model Summary
who_model <- lm(LifeExp4.6 ~ PropMD + TotExp.06 + PropMD * TotExp.06, who_df)
summary(who_model)##
## Call:
## lm(formula = LifeExp4.6 ~ PropMD + TotExp.06 + PropMD * TotExp.06,
## data = who_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.96e+08 -4.77e+07 1.22e+07 6.03e+07 2.12e+08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.24e+08 5.08e+07 -14.25 <2e-16 ***
## PropMD 4.73e+10 2.26e+10 2.09 0.038 *
## TotExp.06 6.05e+08 3.02e+07 20.00 <2e-16 ***
## PropMD:TotExp.06 -2.12e+10 1.13e+10 -1.88 0.062 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 88500000 on 186 degrees of freedom
## Multiple R-squared: 0.744, Adjusted R-squared: 0.74
## F-statistic: 180 on 3 and 186 DF, p-value: <2e-16
Model Interpretation
From the residual summary statistics, there may be a possibly skewed distribution not centered around zero, but we will confirm with further diagnostics.
Using the point estimates of the coefficients, the linear model is expressed as $ = - 7.24e+08 + 4.73e+10PropMD + 6.05e+08TotExp.06 - 2.12e+10PropMD * TotExp.06 $
For each additional unit increase in PropMD, the model expects on average an increase of 4.73e+10 in LifeExp4.6, holding all other variables constant. For each additional unit increase in TotExp.06, the model expects on average an increase of 6.05e+08 in LifeExp4.6, holding all other variables constant. For each additional unit increase in PropMD:TotExp.06, the model expects on average a decrease of 2.12e+10 in LifeExp4.6, holding all other variables constant.
The ratio of the coefficents to the standard errors is large for the y-intercept and TotExp.06, but is small for PropMD and PropMD:TotExp.06. Consequently, we can expect little variability in the coefficient estimates of the former 2 and greater variability in the later 2.
coef(summary(who_model))[, 1]/coef(summary(who_model))[, 2]## (Intercept) PropMD TotExp.06 PropMD:TotExp.06
## -14.25318 2.09387 20.00469 -1.87604
The coefficient’s t-values are not very large for PropMD and PropMD:TotExp.06 and their p-values are not near zero like the other 2. PropMD:TotExp.06’s p-value is larger than our significance thresdold.
The Residual Standard Error seems rather large since 88500000^(1/4.6) = 53.4068 years. This is the average amount that our response variable will deviate from the regression line.
88500000^(1/4.6)## [1] 53.4068
In this model, multiple \(R^2\) is 0.74, which means that \(74\%\) of the variance in the response variable can be explained by the independent variable.
The F-statistic at 180 is quite large, and the model’s p-value is near zero. If the model’s diagnostics are sufficient, these values indicate that we would reject the null hypothesis that there is no relationship between the variables.
Model Diagnostics
Let’s assess if this linear model is reliable.
Linearity: Do the variables have a linear relationship?
The relationships between the independent variables and LifeExp4.6 all appear to be linear.
who_df$PropMDTotExp.06 <- who_df$PropMD * who_df$TotExp.06
pairs(subset(who_df, select = c(LifeExp4.6, PropMD, TotExp.06, PropMDTotExp.06)))Normality: Are the model’s residuals LifeExp4.6ributed nearly normally?
Per the histogram and Q-Q plot, the residuals are left skewed.
sresid <- studres(who_model)
hist(sresid, freq = FALSE, breaks = 10, main = "LifeExp4.6ribution of Studentized Residuals")
xfit <- seq(min(sresid), max(sresid), length = 40)
yfit <- dnorm(xfit)
lines(xfit, yfit)plot(who_model, which = 2)Homoscedasticity: Is there constant variability among the residuals?
Based on the scatter plot, the residuals are somewhat curvilinear.
The Non-constant Variance Score Test has a p-value of >.05, which indicates that we would fail to reject the null hypothesis of homoscedasticity.
plot(who_model, which = 1)ncvTest(who_model)## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.401179 Df = 1 p = 0.526481
Independence: Are the data from a random sample and not from a time series?
The Durbin Watson test’s p-value is >.05. Therefore, we fail to reject the null hypothesis of independence, i.e. no autocorrelation).
durbinWatsonTest(who_model)## lag Autocorrelation D-W Statistic p-value
## 1 -0.0037112 1.99799 0.966
## Alternative hypothesis: rho != 0
Conclusion
While the Adjusted R-squared at .74 is slightly higher than the 2nd model, I would say that this more complex model is weaker because one of the coefficient p-values is greater than the significance threshold, and 2 of them have large standard errors.
gvlma(who_model)##
## Call:
## lm(formula = LifeExp4.6 ~ PropMD + TotExp.06 + PropMD * TotExp.06,
## data = who_df)
##
## Coefficients:
## (Intercept) PropMD TotExp.06 PropMD:TotExp.06
## -7.24e+08 4.73e+10 6.05e+08 -2.12e+10
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = who_model)
##
## Value p-value Decision
## Global Stat 26.556 2.44e-05 Assumptions NOT satisfied!
## Skewness 15.231 9.51e-05 Assumptions NOT satisfied!
## Kurtosis 6.681 9.74e-03 Assumptions NOT satisfied!
## Link Function 4.345 3.71e-02 Assumptions NOT satisfied!
## Heteroscedasticity 0.298 5.85e-01 Assumptions acceptable.
plot(gvlma(who_model))5. Forecast LifeExp4.6 when PropMD=.03 and TotExp.06= 14. Does this forecast seem realistic? Why or why not?
No, Given that PropMD = .03 is near the max value and TotExp.06 = 14 is much greater than the variable’s known range, a predicted value of 66.977, which is less than the mean and median does not seem realistic.
options(scipen = 999)
summary(subset(who_df, select = c(PropMD, TotExp.06, LifeExp)))## PropMD TotExp.06 LifeExp
## Min. :0.0000196 Min. :1.166 Min. :40.00
## 1st Qu.:0.0002444 1st Qu.:1.465 1st Qu.:61.25
## Median :0.0010474 Median :1.677 Median :70.00
## Mean :0.0017954 Mean :1.684 Mean :67.38
## 3rd Qu.:0.0024584 3rd Qu.:1.842 3rd Qu.:75.00
## Max. :0.0351290 Max. :2.193 Max. :83.00
PropMD <- 0.03
TotExp.06 <- 14
coefs <- as.numeric(coef(who_model))
predicted_value <- coefs[1] + coefs[2] * PropMD + coefs[3] * TotExp.06 + coefs[4] *
PropMD * TotExp.06
predicted_value^(1/4.6)## [1] 66.97703
References
https://cran.r-project.org/doc/contrib/Faraway-PRA.pdf
Interpreting Model Output In R
Global Validation of Linear Models Assumptions
http://r-statistics.co/Assumptions-of-Linear-Regression.html
https://www.statmethods.net/stats/regression.html
http://ademos.people.uic.edu/Chapter12.html#121_example_2:_a_thanksgiving_not_to_remember,_part_2
http://www.statisticshowto.com/durbin-watson-test-coefficient/
http://www.ianruginski.com/regressionassumptionswithR_tutorial.html