MSDS Spring 2018

DATA 605 Fundamentals of Computational Mathematics

Jiadi Li

HW #12 - Multiple Regression

The who.csv dataset contains real-world data from 2008. The variables included follow.

Variable Name Description
Country name of the country
LifeExp average life expectancy for the country in years
InfantSurvival proportion of those surviving to one year or more
Under5Survival proportion of those surviving to five years or more
TBFree proportion of the population without TB
PropMD proportion of the population who are MDs
PropRN proportion of the population who are RNs
PersExp mean personal expenditures on healthcare in US dollars at average exchange rate
GovtExp mean government expenditures per capita on healthcare, US dollars at average exchange rate
TotExp sum of personal and government expenditures


0. Load the Data

who <- read.csv('https://raw.githubusercontent.com/xiaoxiaogao-DD/store/master/who.csv')

head(who)
##               Country LifeExp InfantSurvival Under5Survival  TBFree
## 1         Afghanistan      42          0.835          0.743 0.99769
## 2             Albania      71          0.985          0.983 0.99974
## 3             Algeria      71          0.967          0.962 0.99944
## 4             Andorra      82          0.997          0.996 0.99983
## 5              Angola      41          0.846          0.740 0.99656
## 6 Antigua and Barbuda      73          0.990          0.989 0.99991
##        PropMD      PropRN PersExp GovtExp TotExp
## 1 0.000228841 0.000572294      20      92    112
## 2 0.001143127 0.004614439     169    3128   3297
## 3 0.001060478 0.002091362     108    5184   5292
## 4 0.003297297 0.003500000    2589  169725 172314
## 5 0.000070400 0.001146162      36    1620   1656
## 6 0.000142857 0.002773810     503   12543  13046
summary(who)
##                 Country       LifeExp      InfantSurvival  
##  Afghanistan        :  1   Min.   :40.00   Min.   :0.8350  
##  Albania            :  1   1st Qu.:61.25   1st Qu.:0.9433  
##  Algeria            :  1   Median :70.00   Median :0.9785  
##  Andorra            :  1   Mean   :67.38   Mean   :0.9624  
##  Angola             :  1   3rd Qu.:75.00   3rd Qu.:0.9910  
##  Antigua and Barbuda:  1   Max.   :83.00   Max.   :0.9980  
##  (Other)            :184                                   
##  Under5Survival       TBFree           PropMD              PropRN         
##  Min.   :0.7310   Min.   :0.9870   Min.   :0.0000196   Min.   :0.0000883  
##  1st Qu.:0.9253   1st Qu.:0.9969   1st Qu.:0.0002444   1st Qu.:0.0008455  
##  Median :0.9745   Median :0.9992   Median :0.0010474   Median :0.0027584  
##  Mean   :0.9459   Mean   :0.9980   Mean   :0.0017954   Mean   :0.0041336  
##  3rd Qu.:0.9900   3rd Qu.:0.9998   3rd Qu.:0.0024584   3rd Qu.:0.0057164  
##  Max.   :0.9970   Max.   :1.0000   Max.   :0.0351290   Max.   :0.0708387  
##                                                                           
##     PersExp           GovtExp             TotExp      
##  Min.   :   3.00   Min.   :    10.0   Min.   :    13  
##  1st Qu.:  36.25   1st Qu.:   559.5   1st Qu.:   584  
##  Median : 199.50   Median :  5385.0   Median :  5541  
##  Mean   : 742.00   Mean   : 40953.5   Mean   : 41696  
##  3rd Qu.: 515.25   3rd Qu.: 25680.2   3rd Qu.: 26331  
##  Max.   :6350.00   Max.   :476420.0   Max.   :482750  
## 
hist(who$LifeExp,main = 'Histogram: Life Expectancy in each country',xlab = 'Life Expectancy')




1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

cor(who$LifeExp,who$TotExp)
## [1] 0.5076339
m1 <- lm(LifeExp ~ TotExp, data = who)
summary(m1)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14
plot(who$TotExp,who$LifeExp,main = 'Scatterplot & Regression Model: Life Expectancy ~ Expenditures',xlab = 'Predictor: personal and government expenditures',ylab = 'avg life expectancy (yr)')
abline(m1)

p-values: p-values of the intercept and TotExp are less than 0.05 showing that the null hypothesis (corresponding coefficient should be 0) should be rejected and therefpre the linear model is statistically significant. (also verified by number of stars next to the p-values)

F-statistics and standard error: Both of these two parameters are measures of goodness of fit. The F-statistics shows that the relationship between predictor and response variable is only 65.26% which is relatively low. The residual standard error is the average amount that the response value will deviate from the true regression line. Since this model is evaluating life expetancy in number of years, based on the value added by TotExp, 9.371 is relatively high.

R\(^2\): R\(^2\) is the proportion of variation in the dependent (response) variable that has been explained by the model. Adjusted R\(^2\) penalizes total value for the number of terms in the model. While both R\(^2\) and adjusted R\(^2\) are relatively low in this model, only approximately 25%, other parameters should be evaluted before we discard the model with a low R\(^2\) value.

Assumptions of simple linear regression: To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.
(1) linearity: the model is not linear based on the histogram above.
(2) nearly normal residuals: based on the QQ-plot below, there are many descrepencies between the base line and the line created by the residuals.

qqnorm(m1$residuals)
qqline(m1$residuals)  # adds diagonal line to the normal prob plot

*(3) constant variability: most of the life expectancy data concentrate in the low range of the expenditure.

plot(fitted(m1),resid(m1))




2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better”?

who$LifeExp4.6 <- who$LifeExp^4.6
who$TotExp0.06 <- who$TotExp^0.06

cor(who$LifeExp4.6,who$TotExp0.06)
## [1] 0.8542642
m2 <- lm(LifeExp4.6 ~ TotExp0.06, data = who)
summary(m2)
## 
## Call:
## lm(formula = LifeExp4.6 ~ TotExp0.06, data = who)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp0.06   620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16
plot(who$TotExp0.06,who$LifeExp4.6,main = 'Regression Model: Life Expectancy^4.6 ~ Expenditures^0.06',xlab = 'Predictor: personal and government expenditures raised to 0.06 power',ylab = 'avg life expectancy (yr) raised to 4.6 power')
abline(m2)

p-values: p-values of the intercept and TotExp0.06 are less than 0.05 showing that the null hypothesis (corresponding coefficient should be 0) should be rejected and therefpre the linear model is statistically significant. (also verified by number of stars next to the p-values)

F-statistics and standard error: Both of these two parameters are measures of goodness of fit. The F-statistics shows the relationship between predictor and response variable which is much higher this time. The residual standard error is the average amount that the response value will deviate from the true regression line. Even though 90,490,000 seems to be a large number, it’s smaller than the previous one when comparing to the estimate created by TotExp0.06.

R\(^2\): R\(^2\) is the proportion of variation in the dependent (response) variable that has been explained by the model. Adjusted R\(^2\) penalizes total value for the number of terms in the model. Both of them are increasing to roughly 73% which is pretty high especially comaring to the previous one.

Assumptions of simple linear regression: To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.
(1) linearity: the model is linear with a 0.85 correlation coefficient based on the histogram above.
(2) nearly normal residuals: based on the QQ-plot below, the plot is mostly aligned with the base line with some descrepencies towards both ends especially for the lower end.

qqnorm(m2$residuals)
qqline(m2$residuals)  # adds diagonal line to the normal prob plot

*(3) constant variability: all points are randomly distributed with some empty area.

plot(fitted(m2),resid(m2))

Overall, this model is much better than the first one.



3. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
The equation created by the result from lastquestion is: LifeExp4.6 = -736527910 + 620060216\(\times\)TotExp0.06

#when TotExp^.06 =1.5
TE0.06 <- 1.5
LE4.6 <- -736527910 + 620060216*TE0.06 
LE4.6^(1/4.6)
## [1] 63.31153
#when TotExp^.06 =2.5
TE0.06 <- 2.5
LE4.6 <- -736527910 + 620060216*TE0.06 
LE4.6^(1/4.6)
## [1] 86.50645




4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?
\(\space\space\space\space\space\space\space\space\space\space\space\space\space\)LifeExp = b\(_0\) + b\(_1\) \(\times\) PropMD + b\(_2\) \(\times\) TotExp + b\(_3\) \(\times\) PropMD \(\times\) TotExp

m3 <- lm(LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who)
summary(m3)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

p-values: p-values of all coefficients and the whole model are less than 0.05 showing that the null hypothesis (corresponding coefficient should be 0) should be rejected and therefpre the linear model is statistically significant. (also verified by number of stars next to the p-values)

F-statistics and standard error: Both of these two parameters are measures of goodness of fit. The F-statistics shows that the relationship between predictor and response variable is even lower than the first model. The residual standard error is the average amount that the response value will deviate from the true regression line. As compared to the estimates’ values, 8.765 is still considered high.

R\(^2\): R\(^2\) is the proportion of variation in the dependent (response) variable that has been explained by the model. Adjusted R\(^2\) penalizes total value for the number of terms in the model. While both R\(^2\) and adjusted R\(^2\) are relatively low in this model, approximately 35.7%, but higher than model 1, other parameters should be evaluted before we discard the model with a low R\(^2\) value.

Assumptions of simple linear regression: To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.
(1) linearity: the model is not linear based on the distribution of LifeExp.
(2) nearly normal residuals: based on the QQ-plot below, there are many descrepencies between the base line and the line created by the residuals.

qqnorm(m3$residuals)
qqline(m3$residuals)  # adds diagonal line to the normal prob plot

*(3) constant variability: most of the life expectancy data concentrate in the low range of the expenditure.

plot(fitted(m3),resid(m3))




5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

PropMD <- 0.03
TotExp <- 14

m3$coefficients[1] + m3$coefficients[2]*PropMD + m3$coefficients[3]*TotExp + m3$coefficients[4]*PropMD*TotExp
## (Intercept) 
##     107.696

Life expectancy is a statisticall measure of the average time an organism is expected to live. In this scenario, life expectancy is the average of each country. 107.70 years is too high and unrealistic.
Moreover, a PropMD of 0.03 is among the highest within the dataset while a TotExp of 14 is among the lowest. This pair itself doesn’t seem to be realistic.