DATA 605 FUNDAMENTALS OF COMPUTATIONAL MATHEMATICS

Homework Week 12: Multiple Linear Regression

Kyle Gilde

11/16/2017

##           installed_and_loaded.packages.
## prettydoc                           TRUE
## dplyr                               TRUE
## ggplot2                             TRUE
## gvlma                               TRUE
## MASS                                TRUE
## lmtest                              TRUE
## car                                 TRUE

The Data

The attached who.csv dataset contains real-world data from 2008. The variables included following:

  • Country: name of the country
  • LifeExp: average life expectancy for the country in years
  • InfantSurvival: proportion of those surviving to one year or more
  • Under5Survival: proportion of those surviving to five years or more
  • TBFree: proportion of the population without TB.
  • PropMD: proportion of the population who are MDs
  • PropRN: proportion of the population who are RNs
  • PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
  • GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate
  • TotExp: sum of personal and government expenditures.
url <- "https://raw.githubusercontent.com/kylegilde/D605-Math/master/who.csv"

who_df <- read.csv(url)

glimpse(who_df)
## Observations: 190
## Variables: 10
## $ Country        <fctr> Afghanistan, Albania, Algeria, Andorra, Angola...
## $ LifeExp        <int> 42, 71, 71, 82, 41, 73, 75, 69, 82, 80, 64, 74,...
## $ InfantSurvival <dbl> 0.835, 0.985, 0.967, 0.997, 0.846, 0.990, 0.986...
## $ Under5Survival <dbl> 0.743, 0.983, 0.962, 0.996, 0.740, 0.989, 0.983...
## $ TBFree         <dbl> 0.99769, 0.99974, 0.99944, 0.99983, 0.99656, 0....
## $ PropMD         <dbl> 0.000228841, 0.001143127, 0.001060478, 0.003297...
## $ PropRN         <dbl> 0.000572294, 0.004614439, 0.002091362, 0.003500...
## $ PersExp        <int> 20, 169, 108, 2589, 36, 503, 484, 88, 3181, 378...
## $ GovtExp        <int> 92, 3128, 5184, 169725, 1620, 12543, 19170, 185...
## $ TotExp         <int> 112, 3297, 5292, 172314, 1656, 13046, 19654, 19...

Questions

1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

Model Summary

options(scipen = 6)
summary(who_df[c("LifeExp", "TotExp")])
##     LifeExp          TotExp      
##  Min.   :40.00   Min.   :    13  
##  1st Qu.:61.25   1st Qu.:   584  
##  Median :70.00   Median :  5541  
##  Mean   :67.38   Mean   : 41696  
##  3rd Qu.:75.00   3rd Qu.: 26331  
##  Max.   :83.00   Max.   :482750
who_model <- lm(LifeExp ~ TotExp, who_df)
summary(who_model)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##                 Estimate   Std. Error t value Pr(>|t|)    
## (Intercept) 64.753374534  0.753536611  85.933  < 2e-16 ***
## TotExp       0.000062970  0.000007795   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14
intercept <- coef(who_model)[1]
slope <- coef(who_model)[2]
a <- ggplot(who_model, aes(TotExp, LifeExp))
a + geom_point() + geom_abline(slope = slope, intercept = intercept, show.legend = TRUE)

Model Interpretation

  • The residual summary statistics indicate a possibly skewed distribution, but we will confirm with further diagnostics.

  • Using the point estimates of the coefficients, the linear model is expressed as \(\widehat{LifeExp} = 64.475 + .00006*{TotExp}\)

  • For each additional dollar increase in TotExp, the model expects on average an increase of 0.00006 years in LifeExp.

  • The y-intercept’s standard error is more than 60 times smaller than its coefficient value, which means there is relatively little variability. The TotExp’s standard error is about 8 times smaller than its coefficient value. It is within the 5 to 10 times smaller range that typically characterizes a good model.

  • The coefficient t-values are quite large, which indicates that our point estimates are far from 0 and supports rejecting the null hypothesis that there is no relationhip between the variables.

  • Both the y-intercept & TotExp’s p-values are near zero, which means that the probability of observing this relationship due to chance is very small.

  • At 9.3 years, the residual standard error is relatively large. This is the average amount that our response variable will deviate from the regression line.

  • However, in this model, multiple \(R^2\) is 0.2537, which means that only \(25\%\) of the variance in the response variable can be explained by the independent variable.

  • The F-statistic at 65.26 is quite large, and the model’s p-value is near zero. If the model’s diagnostics are sufficient, these values indicate that we would reject the null hypothesis that there is no relationship between the variables.

Model Diagnostics

Let’s assess if this linear model is reliable.

Linearity: Do the variables have a linear relationship?

The Component+Residual plot shows that this is NOT a linear relationship.

crPlots(who_model)

Normality: Are the model’s residuals LifeExpributed nearly normally?

No, per the histogram and Q-Q plot, the residuals are not anywhere near normally distributed.

sresid <- studres(who_model)
hist(sresid, freq = FALSE, breaks = 10, main = "LifeExpribution of Studentized Residuals")
xfit <- seq(min(sresid), max(sresid), length = 40)
yfit <- dnorm(xfit)
lines(xfit, yfit)

plot(who_model, which = 2)

Homoscedasticity: Is there constant variability among the residuals?

  • Based on the scatter plot, the residuals are not indistinctly distributed.

  • The Non-constant Variance Score Test has a p-value of >.05, which indicates that we would fail to reject the null hypothesis of homoscedasticity.

plot(who_model, which = 1)

ncvTest(who_model)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 2.599177    Df = 1     p = 0.1069193

Independence: Are the data from a random sample and not from a time series?

The Durbin Watson test’s p-value is >.05. Therefore, we fail to reject the null hypothesis of independence (no autocorrelation).

durbinWatsonTest(who_model)
##  lag Autocorrelation D-W Statistic p-value
##    1      0.06872515      1.802451   0.166
##  Alternative hypothesis: rho != 0

Conclusion

While the coefficient values are likely relevant to modeling LifeExp, based upon my diagnostics and the gvlma function, the conditions for linear regression have not been met.

gvlma(who_model)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_df)
## 
## Coefficients:
## (Intercept)       TotExp  
## 64.75337453   0.00006297  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = who_model) 
## 
##                        Value   p-value                   Decision
## Global Stat        56.737011 1.405e-11 Assumptions NOT satisfied!
## Skewness           30.532757 3.283e-08 Assumptions NOT satisfied!
## Kurtosis            0.002804 9.578e-01    Assumptions acceptable.
## Link Function      26.074703 3.285e-07 Assumptions NOT satisfied!
## Heteroscedasticity  0.126747 7.218e-01    Assumptions acceptable.
plot(gvlma(who_model))

2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

Model Summary

options(scipen = 0)
options(digits = 6)

who_df$LifeExp4.6 <- who_df$LifeExp^4.6
who_df$TotExp.06 <- who_df$TotExp^0.06

who_model <- lm(LifeExp4.6 ~ TotExp.06, who_df)
summary(who_model)
## 
## Call:
## lm(formula = LifeExp4.6 ~ TotExp.06, data = who_df)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -3.09e+08 -5.40e+07  1.37e+07  5.91e+07  2.12e+08 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.37e+08   4.68e+07   -15.7   <2e-16 ***
## TotExp.06    6.20e+08   2.75e+07    22.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90500000 on 188 degrees of freedom
## Multiple R-squared:  0.73,   Adjusted R-squared:  0.728 
## F-statistic:  508 on 1 and 188 DF,  p-value: <2e-16
intercept <- coef(who_model)[1]
slope <- coef(who_model)[2]
a <- ggplot(who_model, aes(TotExp.06, LifeExp4.6))
a + geom_point() + geom_abline(slope = slope, intercept = intercept, show.legend = TRUE)

Model Interpretation

  • The residual summary statistics indicate a possibly skewed distribution not centered around zero, but we will confirm with further diagnostics.

  • Using the point estimates of the coefficients, the linear model is expressed as \(\widehat{LifeExp4.6} = 64.475 + 620060216*{TotExp.06} - 736527910\)

  • For each additional unit increase in TotExp.06, the model expects on average a unit increase of 620060216 in LifeExp4.6.

  • The y-intercept’s standard error is more than 15 times smaller than its coefficient value. The TotExp.06’s standard error is about 22 times smaller than its coefficient value. Both of these ratios are indicative of a good model with relatively little variance in its coefficients.

  • The coefficient t-values are large, which indicates that our point estimates are far from 0 and supports rejecting the null hypothesis that there is no relationhip between the variables.

  • Both the y-intercept & TotExp.06’s p-values are near zero, which means that the probability of observing this relationship due to chance is very small.

  • The Residual Standard Error seems some what large given that 90500000^(1/4.6) = 53.6669 years. This is the average amount that our response variable will deviate from the regression line.

90500000^(1/4.6)
## [1] 53.6669
  • In this model, multiple \(R^2\) is 0.728, which means that \(72\%\) of the variance in the response variable can be explained by the independent variable.

  • The F-statistic at 508 is quite large, and the model’s p-value is near zero. If the model’s diagnostics are sufficient, these values indicate that we would reject the null hypothesis that there is no relationship between the variables.

Model Diagnostics

Let’s assess if this linear model is reliable.

Linearity: Do the variables have a linear relationship?

The Component+Residual plot shows that this is now a linear relationship.

crPlots(who_model)

Normality: Are the model’s residuals LifeExp4.6ributed nearly normally?

Per the histogram and Q-Q plot, the residuals are more normally distributed. However, they are left skewed.

sresid <- studres(who_model)
hist(sresid, freq = FALSE, breaks = 10, main = "LifeExp4.6ribution of Studentized Residuals")
xfit <- seq(min(sresid), max(sresid), length = 40)
yfit <- dnorm(xfit)
lines(xfit, yfit)

plot(who_model, which = 2)

Homoscedasticity: Is there constant variability among the residuals?

  • Based on the scatter plot, the residuals are much more indistinctly distributed vis-a-vis the fitted values, but not completely so.

  • The Non-constant Variance Score Test has a p-value of >.05, which indicates that we would fail to reject the null hypothesis of homoscedasticity.

plot(who_model, which = 1)

ncvTest(who_model)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.427802    Df = 1     p = 0.51307

Independence: Are the data from a random sample and not from a time series?

The Durbin Watson test’s p-value is >.05. Therefore, we fail to reject the null hypothesis of independence of no autocorrelation).

durbinWatsonTest(who_model)
##  lag Autocorrelation D-W Statistic p-value
##    1       0.0403816       1.90907   0.524
##  Alternative hypothesis: rho != 0

Conclusion

With Adjusted R-squared at 0.74, a large F-statistic and tiny p-value, the second model is better than the first. Based upon my diagnostics and the gvlma function, the second model is much closer to meeting the conditions for linear regression, but not sufficiently still.

gvlma(who_model)
## 
## Call:
## lm(formula = LifeExp4.6 ~ TotExp.06, data = who_df)
## 
## Coefficients:
## (Intercept)    TotExp.06  
##   -7.37e+08     6.20e+08  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = who_model) 
## 
##                    Value  p-value                   Decision
## Global Stat        27.81 1.36e-05 Assumptions NOT satisfied!
## Skewness           17.14 3.48e-05 Assumptions NOT satisfied!
## Kurtosis            7.46 6.32e-03 Assumptions NOT satisfied!
## Link Function       2.99 8.40e-02    Assumptions acceptable.
## Heteroscedasticity  0.23 6.32e-01    Assumptions acceptable.
plot(gvlma(who_model))

3. Using the results from 3, forecast life expectancy when TotExp.06^.06 =1.5. Then forecast life expectancy when TotExp.06^.06=2.5.

as.numeric(intercept + slope * 2.5)
## [1] 813622630

4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp4.6 = b0 + b1 x PropMd + b2 x TotExp.06 + b3 x PropMD x TotExp.06

Model Summary

who_model <- lm(LifeExp4.6 ~ PropMD + TotExp.06 + PropMD * TotExp.06, who_df)

summary(who_model)
## 
## Call:
## lm(formula = LifeExp4.6 ~ PropMD + TotExp.06 + PropMD * TotExp.06, 
##     data = who_df)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -2.96e+08 -4.77e+07  1.22e+07  6.03e+07  2.12e+08 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -7.24e+08   5.08e+07  -14.25   <2e-16 ***
## PropMD            4.73e+10   2.26e+10    2.09    0.038 *  
## TotExp.06         6.05e+08   3.02e+07   20.00   <2e-16 ***
## PropMD:TotExp.06 -2.12e+10   1.13e+10   -1.88    0.062 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 88500000 on 186 degrees of freedom
## Multiple R-squared:  0.744,  Adjusted R-squared:  0.74 
## F-statistic:  180 on 3 and 186 DF,  p-value: <2e-16

Model Interpretation

  • From the residual summary statistics, there may be a possibly skewed distribution not centered around zero, but we will confirm with further diagnostics.

  • Using the point estimates of the coefficients, the linear model is expressed as $ = - 7.24e+08 + 4.73e+10PropMD + 6.05e+08TotExp.06 - 2.12e+10PropMD * TotExp.06 $

  • For each additional unit increase in PropMD, the model expects on average an increase of 4.73e+10 in LifeExp4.6, holding all other variables constant. For each additional unit increase in TotExp.06, the model expects on average an increase of 6.05e+08 in LifeExp4.6, holding all other variables constant. For each additional unit increase in PropMD:TotExp.06, the model expects on average a decrease of 2.12e+10 in LifeExp4.6, holding all other variables constant.

  • The ratio of the coefficents to the standard errors is large for the y-intercept and TotExp.06, but is small for PropMD and PropMD:TotExp.06. Consequently, we can expect little variability in the coefficient estimates of the former 2 and greater variability in the later 2.

coef(summary(who_model))[, 1]/coef(summary(who_model))[, 2]
##      (Intercept)           PropMD        TotExp.06 PropMD:TotExp.06 
##        -14.25318          2.09387         20.00469         -1.87604
  • The coefficient’s t-values are not very large for PropMD and PropMD:TotExp.06 and their p-values are not near zero like the other 2. PropMD:TotExp.06’s p-value is larger than our significance thresdold.

  • The Residual Standard Error seems rather large since 88500000^(1/4.6) = 53.4068 years. This is the average amount that our response variable will deviate from the regression line.

88500000^(1/4.6)
## [1] 53.4068
  • In this model, multiple \(R^2\) is 0.74, which means that \(74\%\) of the variance in the response variable can be explained by the independent variable.

  • The F-statistic at 180 is quite large, and the model’s p-value is near zero. If the model’s diagnostics are sufficient, these values indicate that we would reject the null hypothesis that there is no relationship between the variables.

Model Diagnostics

Let’s assess if this linear model is reliable.

Linearity: Do the variables have a linear relationship?

The relationships between the independent variables and LifeExp4.6 all appear to be linear.

who_df$PropMDTotExp.06 <- who_df$PropMD * who_df$TotExp.06
pairs(subset(who_df, select = c(LifeExp4.6, PropMD, TotExp.06, PropMDTotExp.06)))

Normality: Are the model’s residuals LifeExp4.6ributed nearly normally?

Per the histogram and Q-Q plot, the residuals are left skewed.

sresid <- studres(who_model)
hist(sresid, freq = FALSE, breaks = 10, main = "LifeExp4.6ribution of Studentized Residuals")
xfit <- seq(min(sresid), max(sresid), length = 40)
yfit <- dnorm(xfit)
lines(xfit, yfit)

plot(who_model, which = 2)

Homoscedasticity: Is there constant variability among the residuals?

  • Based on the scatter plot, the residuals are somewhat curvilinear.

  • The Non-constant Variance Score Test has a p-value of >.05, which indicates that we would fail to reject the null hypothesis of homoscedasticity.

plot(who_model, which = 1)

ncvTest(who_model)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.401179    Df = 1     p = 0.526481

Independence: Are the data from a random sample and not from a time series?

The Durbin Watson test’s p-value is >.05. Therefore, we fail to reject the null hypothesis of independence, i.e. no autocorrelation).

durbinWatsonTest(who_model)
##  lag Autocorrelation D-W Statistic p-value
##    1      -0.0037112       1.99799   0.966
##  Alternative hypothesis: rho != 0

Conclusion

While the Adjusted R-squared at .74 is slightly higher than the 2nd model, I would say that this more complex model is weaker because one of the coefficient p-values is greater than the significance threshold, and 2 of them have large standard errors.

gvlma(who_model)
## 
## Call:
## lm(formula = LifeExp4.6 ~ PropMD + TotExp.06 + PropMD * TotExp.06, 
##     data = who_df)
## 
## Coefficients:
##      (Intercept)            PropMD         TotExp.06  PropMD:TotExp.06  
##        -7.24e+08          4.73e+10          6.05e+08         -2.12e+10  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = who_model) 
## 
##                     Value  p-value                   Decision
## Global Stat        26.556 2.44e-05 Assumptions NOT satisfied!
## Skewness           15.231 9.51e-05 Assumptions NOT satisfied!
## Kurtosis            6.681 9.74e-03 Assumptions NOT satisfied!
## Link Function       4.345 3.71e-02 Assumptions NOT satisfied!
## Heteroscedasticity  0.298 5.85e-01    Assumptions acceptable.
plot(gvlma(who_model))

5. Forecast LifeExp4.6 when PropMD=.03 and TotExp.06= 14. Does this forecast seem realistic? Why or why not?

No, Given that PropMD = .03 is near the max value and TotExp.06 = 14 is much greater than the variable’s known range, a predicted value of 66.977, which is less than the mean and median does not seem realistic.

options(scipen = 999)
summary(subset(who_df, select = c(PropMD, TotExp.06, LifeExp)))
##      PropMD            TotExp.06        LifeExp     
##  Min.   :0.0000196   Min.   :1.166   Min.   :40.00  
##  1st Qu.:0.0002444   1st Qu.:1.465   1st Qu.:61.25  
##  Median :0.0010474   Median :1.677   Median :70.00  
##  Mean   :0.0017954   Mean   :1.684   Mean   :67.38  
##  3rd Qu.:0.0024584   3rd Qu.:1.842   3rd Qu.:75.00  
##  Max.   :0.0351290   Max.   :2.193   Max.   :83.00
PropMD <- 0.03
TotExp.06 <- 14
coefs <- as.numeric(coef(who_model))

predicted_value <- coefs[1] + coefs[2] * PropMD + coefs[3] * TotExp.06 + coefs[4] * 
    PropMD * TotExp.06

predicted_value^(1/4.6)
## [1] 66.97703