The attached who.csv dataset contains real-world data from 2008. The variables included follow.

PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate

GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate

Data Import

# Loading who.csv dataset

who.csv = read.csv("who.csv")

Question 1

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

Simple linear regression

# simple linear regression

simple.lm = lm(LifeExp~TotExp, data=who.csv)
summary(simple.lm)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who.csv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

linear Plot

# scatterplot of LifeExp~TotExp

ggplot(data = who.csv, aes(x = TotExp , y = LifeExp)) + 
  geom_point() +
  geom_abline(slope = coef(simple.lm)[[2]], intercept = coef(simple.lm)[[1]])

The \(r^2\) value of 0.2577 mean that roughly 25% of the variability were observed.

The F statistics have a value 65.26 on 1 and 188 DF mean that one independent variable were tested from 188 and therefore, I believe doesn’t have much value and have no affect.

with the p-value of 7.714e-14 we can said that the value seems small due to the fact this value is only base on single test with one dependent and independent variable.

Residual standard error of 9.371 should show how close the sample mean deviates from the actual mean of the population, therefore as sample mean increases the standard error show be decreasing.

Base on the graph above we can conclude that the assumptions of simple linear regression does seems to met because it doesn’t show or describe any relationship between the dependent and independent variable in a straight line.

Question 2

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

Simple Linear regression 2

# life expectancy raise to the 4.6 power

who.csv <- who.csv %>%
  mutate(LifeExpto4.6 = LifeExp^4.6, TotExpto0.06 = TotExp^0.06)

simple.lm2 <- lm(LifeExpto4.6 ~ TotExpto0.06, data = who.csv)

summary(simple.lm2)

## 
## Call:
## lm(formula = LifeExpto4.6 ~ TotExpto0.06, data = who.csv)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -736527910   46817945  -15.73   <2e-16 ***
## TotExpto0.06  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

Linear Plot 2

This is a plot of simple linear regression after raising LifeExp to 4.6 power and TotExp to 0.06 power to see the linearly.

# simple linear 2 plot

ggplot(data = who.csv, aes(x = TotExpto0.06, y = LifeExpto4.6)) + geom_point() + 
  geom_abline(slope = coef(simple.lm2)[[2]], intercept = coef(simple.lm2)[[1]])

The \(r^2\) value of 0.7298 mean that roughly 72% of the variability were observed.

The F statistics have a value 507.7 on 1 and 188 DF mean that one independent variable were tested from 188 and therefore, I believe doesn’t have much value.

The p-value of 2.2e-16 we can said that the value seems small due to the fact this value is only base on single test with one dependent and independent variable.

Base on the graph above we can conclude that the new model simple lm2 fit the data well because it a linear relationship between the dependent and independent variable in a straight line.

Since we want our residuals to be regularly distributed, the first and third values should be about 1.5 times the standard error, and the residual standard error is 90490000.

Question 3

Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

forecast_lifeExpectancy <- function( TotExpto0.06 ){
  return( (620060216 * TotExpto0.06 - 736527910  )^(1/4.6))
}

The life expectancy when TotExp^.06 =1.5, is forecasted to be 63.3115334.

The life expectancy when TotExp^.06=2.5, is forecasted to be 86.5064485.

Question 4

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

Multiple Regression Analysis

multiple_regression <- lm(LifeExp ~ PropMD + TotExp + (PropMD*TotExp), data = who.csv)

summary(multiple_regression)

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data = who.csv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

Plot

This is a plot of multiple regression for LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

# multiple regression Plot

plot(multiple_regression)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

The \(R^2\) value of 0.3574 states that the model on accounts for about 35% of the variability observed.

The F statistic states that these variables do contribute to the value of life expectancy.

The p-values are quite small, as such we can reject the hypothesis that the independent variables have no impact on life expectancy.

The residual standard error is 8.765, we would like our residuals to be normally distributed as such the first and third quantifies should be around 1.5 times the standard error.

As can be seen from the plots below, the model does not really meet the conditions for a linear regression. Namely, the residuals vs fitted does not display constant variability.

Question 5

Forecast LifeExp when PropMD =.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

forecast_lifeExpectancy <- function( PropMD,TotExp ){
  return( 62.7727 + 1497.494 * PropMD + 
             7.233324e-05 * TotExp - 0.006025686 * PropMD * TotExp)
}

The model predicts that the life expectancy would be 107.6960019.

This forecast doesn’t seem realistic, and it speculates that if the proportion of doctors was increased and the total spending was decreased we would still expect a high life expectancy.

Assignment 12

James Naval