library(tidyverse)

The attached who.csv dataset contains real-world data from 2008. The variables included are:

Country: name of the country
LifeExp: average life expectancy for the country in years
InfantSurvival: proportion of those surviving to one year or more
Under5Survival: proportion of those surviving to five years or more
TBFree: proportion of the population without TB.
PropMD: proportion of the population who are MDs
PropRN: proportion of the population who are RNs
PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate
TotExp: sum of personal and government expenditures.

First I save the WHO data to a local variable: who_data.

who_data <- read.csv("who.csv", stringsAsFactors = F)

Review the WHO Data:

pairs(who_data[2:10], gap = 0.3)

1:

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, \(R^2\), standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

1: Answer:

First the plot of LifeExp~TotExp:

# LifeExp is the response (y-axis) variable (left of ~)
# TotExp is the explanatory (x-axis) variables.  
# plot(data = who_data, LifeExp~TotExp)
ggplot(who_data, aes(x = TotExp, y = LifeExp)) + 
  geom_point(size = 3, alpha = .4) +
  geom_smooth(method = "lm", se = FALSE, alpha = .2) +
  labs(title = "Average Life Expectancy (Yrs) per Country vs 
       Personal & Gov Spending", 
       x = "Personal & Gov Spending", 
       y = "Avg Life Expectancy (Years)")

Next, some simple linear regression saved under the linear model: spend_v_lifeExp along with the summary:

spend_v_lifeExp <- lm(LifeExp~TotExp, data = who_data)
summary(spend_v_lifeExp)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

# F-statistic: (for models including non-intercept terms)
# a 3-vector with the value of the F-statistic with its 
# numerator and denominator degrees of freedom.

To check that this F-statistic is significant, I’ll get the critical value:

alpha <- 0.05
qf(alpha, 1, 188)

## [1] 0.003942653

Since the F-statistic of 65.26 amount is greater than this value of 0.003942653, this tells me that something is significant with respect to my model’s paramaters and reinforces my confidence in the mild linear relation shown in the \(R^2\) value. The small p-value lets me know that all my results are significant. However, since the R^2 values isn’t big and the data doesn’t appear to be linear, I’m not satisfied with the model.

Further, the residuals, below, do not appear normal at all.

plot(data=who_data, spend_v_lifeExp$residuals~TotExp)

Therefore, the assumtions of linear regression are not met.

2:

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, \(R^2\), standard error, and p-values. Which model is “better?”

2: Answer:

First I’ll add those colums of data to the existing data set, create a linear model and then I’ll plot:

who_data <- who_data %>% 
  mutate(LifeExpEXP4.6 = LifeExp^(4.6),
         TotExpEXP.06 = TotExp^(0.06))

spend_v_lifeExpNEW <- lm(LifeExpEXP4.6~TotExpEXP.06, data = who_data)

ggplot(who_data, aes(x = TotExpEXP.06, y = LifeExpEXP4.6)) + 
  geom_point(size = 3, alpha = .4) +
  geom_smooth(method = "lm", se = FALSE, alpha = .2) +
  labs(title = "Average Life Expectancy (Yrs) per Country vs 
       Personal & Gov Spending", 
       x = "Personal & Gov Spending", 
       y = "Avg Life Expectancy (Years)")

And now we’l examine the results of the Linear Model:

summary(spend_v_lifeExpNEW)

## 
## Call:
## lm(formula = LifeExpEXP4.6 ~ TotExpEXP.06, data = who_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -736527910   46817945  -15.73   <2e-16 ***
## TotExpEXP.06  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

As shown above, there’s major improvement in the model by using the LifeExpEXP4.6 and TotExpEXP.06. The F-statistic is larger and therefore “more” signifcant. The \(R^2\) value has improved as well as has the p-value.

plot(data=who_data, spend_v_lifeExpNEW$residuals~TotExpEXP.06)

Above, the plot of the residuals vs TotExpEXP.06 shows improved uniformity about 0.

Given this information the assumptions of simple linear regression are met and I’m happy to confirm that this is the superior model.

3:

Using the results from 3, forecast life expectancy when \(TotExp^.06 =1.5\). Then forecast life expectancy when \(TotExp^.06=2.5\).

3 Answer:

The model from question 2 is:

\[\text{Life Expectantcy}^{4.6}= -736527910 + 620060216 \times \text{Total Health Expenses}^{0.06}\]

For \(\text{Total Health Expenses}^{0.06} = 1.5\) I get:

\[\text{Life Expectantcy}^{4.6}= -736527910 + 620060216 \times 1.5\]

calculating: \(\implies \text{Life Expectantcy}^{4.6}=193562414 \implies \text{Life Expectantcy}\approx\underline{64.31}\).

For \(\text{Total Health Expenses}^{0.06} = 2.5\) I get:

\[\text{Life Expectantcy}^{4.6}= -736527910 + 620060216 \times 2.5\]

calculating: \(\implies \text{Life Expectantcy}^{4.6}=193562414 \implies \text{Life Expectantcy}\approx\underline{86.51}\).

4:

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

\[LifeExp = b_0+b_1 \times PropMd + b_2 \times TotExp +b_3 \times PropMD \times TotExp\]

4 Answer:

First I’ll add the column PropMD * TotExp as PropMDXTotExp:

who_data <- who_data %>% 
  mutate(PropMDXTotExp = PropMD * TotExp)

And then I’ll create the given linear model over the data:

q3_model <- lm(LifeExp~PropMD + TotExp + PropMDXTotExp,
                         data = who_data)
#62.77 + 1497 * .03 + 0.00007233 * 14 - 0.006026 * (.03 * 14)
summary(q3_model)

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMDXTotExp, data = who_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMDXTotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

As shown in the summary, this model doesn’t appear to be better than the previous model and only marginally better than the original model.

\(\text{LifeExp}=62.77 + 1497\beta_{\text{PropMD}} + 0.00007233 \beta_{\text{TotExp}}-0.006026\beta_{\text{PropMD}\times\text{TotExp}}\)

5:

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

5 Answer:

For \(\text{PropMD} = 0.03\) and \(\text{TotExp} = 14\), I get:

\(\text{LifeExp}=62.77 + 1497 \times .03 + 0.00007233 \times 14-0.006026\times(.03 \times 14)\approx\underline{107.68}\). Which is too high. This model is not great.

Notes, References:

What should I do with F statistic in Regression model?

If my F-statistic is significant that gives me extra confidence on the R² value that i have got . In case i get insignificant F-Statistic or if p values for F are greater that level of significance ( say 0.05 or 0.01 ) then personally i would stay away from that model since i will not be able to confidently comment on the R² values

F-Stat Value Interpretation:

What is the F Statistic:

An F statistic is a value you get when you run an ANOVA test or a regression analysis to find out if the means between two populations are significantly different. It’s similar to a T statistic from a T-Test; A-T test will tell you if a single variable is statistically significant and an F test will tell you if a group of variables are jointly significant.

On Statistical Significance:

Simply put, if you have significant result, it means that your results likely did not happen by chance. If you don’t have statistically significant results, you throw your test data out (as it doesn’t show anything!); in other words, you can’t reject the null hypothesis.

On the F Statistic and p-value:

The F statistic must be used in combination with the p value when you are deciding if your overall results are significant. Why? If you have a significant result, it doesn’t mean that all your variables are significant. The statistic is just comparing the joint effect of all the variables together.

For example, if you are using the F Statistic in regression analysis (perhaps for a change in R Squared, the Coefficient of Determination), you would use the p value to get the “big picture.”

If the p value is less than the alpha level, go to Step 2 (otherwise your results are not significant and you cannot reject the null hypothesis). A common alpha level for tests is 0.05.

Study the individual p values to find out which of the individual variables are statistically significant.

The F value in Regression:

The F value in regression is the result of a test where the null hypothesis is that all of the regression coefficients are equal to zero. In other words, the model has no predictive capability. Basically, the f-test compares your model with zero predictor variables (the intercept only model), and decides whether your added coefficients improved the model. If you get a significant result, then whatever coefficients you included in your model improved the model’s fit.

Read your p-value first. If the p-value is small (less than your alpha level), you can accept the null hypothesis. Only then should you consider the f-value. If you fail to reject the null, discard the f-value result.

If you want to know whether your regression F-value is significant, you’ll need to find the critical value in the f-table. For example, let’s say you had 3 regression degrees of freedom (df1) and 120 residual degrees of freedom (df2). An F statistic of at least 3.95 is needed to reject the null hypothesis at an alpha level of 0.1. At this level, you stand a 1% chance of being wrong (Archdeacon, 1994, p.168). For more details on how to do this, see: F Test. F Values will range from 0 to an arbitrarily large number.

HW12_605

jbrnbrg

November 14, 2017