Multiple Regression - Data Format

The attached who.csv dataset contains real-world data from 2008. The variables included follow.

Country: name of the country LifeExp: average life expectancy for the country in years InfantSurvival: proportion of those surviving to one year or more Under5Survival: proportion of those surviving to five years or more TBFree: proportion of the population without TB. PropMD: proportion of the population who are MDs PropRN: proportion of the population who are RNs PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate TotExp: sum of personal and government expenditures.

#Read file

myFile <- file.choose()
myData  <- read.csv(file=myFile, header=TRUE)

1. LifeExp~TotExp

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

#plot LifeExp vs TotExp
plot(LifeExp~TotExp,data=myData)

lifeExp_TotExp <- lm(myData$LifeExp~myData$TotExp)

summary(lifeExp_TotExp)
## 
## Call:
## lm(formula = myData$LifeExp ~ myData$TotExp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.475e+01  7.535e-01  85.933  < 2e-16 ***
## myData$TotExp 6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

The F statistics is 65.26. Since it is only single variable, it is not useful in this case.

This value compares the current model to a model that has one fewer parameters. Because the one-factor model already has only a single parameter, this test is not particularly useful in this case. [LinearRegression_fulltext.pdf, page 33.]

R^2 is 0.2577. The model explains 25.77% of data variation.

Multiplying this value by 100 gives a value that we can interpret as a percentage between 0 and 100. Consequently, you should not ever expect an R2 value of exactly one. In general, values of R2 that are closer to one indicate a better-fitting model. However, a good model does not necessarily require a large R2 value. It may still accurately predict future observations, even with a small R2 value. [LinearRegression_fulltext.pdf, page 32.]

Standard Error is 7.795e-06. The ratio of coefficient and standard error (6.297e-05/7.795e-06) = 8.07. It is a good model.

The Std. Error column shows the statistical standard error foreachofthe coefficients. For a good model, we typically would like to see a standard error that is at least five to ten times smaller than the corresponding coefficient. [LinearRegression_fulltext.pdf, page 31.][http://onlinestatbook.com/2/regression/accuracy.html]

p-value is 7.714e-14. Since it is very small, we reject null hypothese.

Transform data to the 4.6 power.

  1. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
#transform data by power
myData.LifeExp_46 <- myData$LifeExp^4.6
myData.TotExp_006 <- myData$TotExp^0.06

#plot transformed data
plot(myData.LifeExp_46~myData.TotExp_006)

transform_lifeExp_TotExp_lm <- lm(myData.LifeExp_46~myData.TotExp_006)

summary(transform_lifeExp_TotExp_lm)
## 
## Call:
## lm(formula = myData.LifeExp_46 ~ myData.TotExp_006)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -736527910   46817945  -15.73   <2e-16 ***
## myData.TotExp_006  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

F-statistics is 507.7. Since it is only single variable, it is not useful in this case.

This value compares the current model to a model that has one fewer parameters. Because the one-factor model already has only a single parameter, this test is not particularly useful in this case. [LinearRegression_fulltext.pdf, page 33.]

R^2 is 0.7298. The model explains 72.98% of data variation.

Multiplying this value by 100 gives a value that we can interpret as a percentage between 0 and 100. Consequently, you should not ever expect an R2 value of exactly one. In general, values of R2 that are closer to one indicate a better-fitting model. However, a good model does not necessarily require a large R2 value. It may still accurately predict future observations, even with a small R2 value. [LinearRegression_fulltext.pdf, page 32.]

Standard error is 27518940. The ratio of coefficient and standard error (620060216/27518940) = 22.53. It is a good model.

The Std. Error column shows the statistical standard error foreachofthe coefficients. For a good model, we typically would like to see a standard error that is at least five to ten times smaller than the corresponding coefficient. [LinearRegression_fulltext.pdf, page 31.]

P-values is < 2.2e-16. It is small and we reject the null hypothese.

From the distribution of the scatter chart, R^2 and standard error values, the transformed dataset creates a better model.

Forecast

  1. Using the results from 2, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

TotExp = sum of personal and government expenditures The forecast life expentancy = (-736527910) + 620060216 * (TotExp^0.06_value). For TotalExp^0.06 = 1.5

# for 1.5
TotExp_006_value <- 1.5
forecast_life_expectancy <- (-736527910) + 620060216 * (TotExp_006_value) 
print(paste("For transformed sum of personal and government expenditures=", TotExp_006_value, " the forecast life expectancy is", forecast_life_expectancy, ", or convert value =", forecast_life_expectancy^(1/4.6), sep=" "))
## [1] "For transformed sum of personal and government expenditures= 1.5  the forecast life expectancy is 193562414 , or convert value = 63.3115334478635"
# for 2.5
TotExp_006_value <- 2.5
forecast_life_expectancy <- (-736527910) + 620060216 * (TotExp_006_value) 
print(paste("For transformed sum of personal and government expenditures=", TotExp_006_value, " the forecast life expectancy is", forecast_life_expectancy, ", or convert value =", forecast_life_expectancy^(1/4.6), sep=" "))
## [1] "For transformed sum of personal and government expenditures= 2.5  the forecast life expectancy is 813622630 , or convert value = 86.5064484928337"

Build multiple regression model

  1. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

where PropMD: proportion of the population who are MDs

multiple_reg_model_lm <- lm(formula = LifeExp~TotExp+PropMD+(PropMD*TotExp),data=myData)

summary(multiple_reg_model_lm)
## 
## Call:
## lm(formula = LifeExp ~ TotExp + PropMD + (PropMD * TotExp), data = myData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp:PropMD -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

F-statistic is 34.49.

http://www.statisticshowto.com/probability-and-statistics/f-statistic-value-test/ The F value in regression is the result of a test where the null hypothesis is that all of the regression coefficients are equal to zero. In other words, the model has no predictive capability. Basically, the f-test compares your model with zero predictor variables (the intercept only model), and decides whether your added coefficients improved the model. If you get a significant result, then whatever coefficients you included in your model improved the model’s fit.

Read your p-value first. If the p-value is small (less than your alpha level), you can accept the null hypothesis. Only then should you consider the f-value. If you fail to reject the null, discard the f-value result.

R^2 is 0.3574. The model explains 35.74% of data variation

Standard Error for TotExp is 8.982e-06. The ratio of coefficient and st. error is (7.233e-05/8.982e-06) = 8.053 Standard Error for PropMD is 2.788e+02. The ratio of coefficient and st. error is ( 1.497e+03 / 2.788e+02) = 5.371 Standard Error for TotExp:PropMD is 1.472e-03. The ratio of coefficient and st. error is ( -6.026e-03 / 1.472e-03) = -4.093

P-value is < 2.2e-16. It is significant small and we reject the null hypothese.

The model is not as good as the LifeExp~TotExp.

Forecast multiple-regression

  1. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

LifeExp = (6.277e+01)+(7.233e-05)x PropMd + (1.497e+03) x TotExp +(-6.026e-03) x PropMD x TotExp

PropMD_value <- 0.03
TotExp_value <- 14
PropMD_TotExp_value <- 0.03*14
forecast_LifeExp2 <- (6.277e+01)+(7.233e-05)* PropMD_value + (1.497e+03) * TotExp_value +(-6.026e-03) * PropMD_TotExp_value

print(paste("For PropMD=", PropMD_value, " and TotExp =", TotExp_value, ", the forecast life expectancy is", forecast_LifeExp2, sep=" "))
## [1] "For PropMD= 0.03  and TotExp = 14 , the forecast life expectancy is 21020.7674712499"

This model doesn’t seem like realistic because life can’t be over 21020 years old.