The variables included the following:

Country: name of the country

LifeExp: average life expectancy for the country in years

InfantSurvival: proportion of those surviving to one year or more

Under5Survival: proportion of those surviving to five years or more

TBFree: proportion of the population without TB.

PropMD: proportion of the population who are MDs

PropRN: proportion of the population who are RNs

PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate

GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate TotExp: sum of personal and government expenditures.

data = read.csv("/Users/Michele/Desktop/who - Sheet1.csv")
head(data)
##               Country LifeExp InfantSurvival Under5Survival  TBFree
## 1         Afghanistan      42          0.835          0.743 0.99769
## 2             Albania      71          0.985          0.983 0.99974
## 3             Algeria      71          0.967          0.962 0.99944
## 4             Andorra      82          0.997          0.996 0.99983
## 5              Angola      41          0.846          0.740 0.99656
## 6 Antigua and Barbuda      73          0.990          0.989 0.99991
##        PropMD      PropRN PersExp GovtExp TotExp
## 1 0.000228841 0.000572294      20      92    112
## 2 0.001143127 0.004614439     169    3128   3297
## 3 0.001060478 0.002091362     108    5184   5292
## 4 0.003297297 0.003500000    2589  169725 172314
## 5 0.000070400 0.001146162      36    1620   1656
## 6 0.000142857 0.002773810     503   12543  13046
  1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

Notice that our model is not a very good one, although our variables are significant with low p-values, our model is not very good. Notice that the r2 is very low with an explained variance of .2577. Our F-statistic is 65. We desire a higher F-statistic, and this is a useful metric for comparing different models to each other.

lif_exp_tot_exp = lm(data$LifeExp ~ data$TotExp)
summary(lif_exp_tot_exp)
## 
## Call:
## lm(formula = data$LifeExp ~ data$TotExp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## data$TotExp 6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Notice from the plot that a linear relationship is not the best relationship that describes our data. Once age hits 80, it increases as a very rapid rate, which our model does not appear to catch.

In addition, the assumptions of simple linear regression are not met because our data is not normally distributed. We are also using a population dataset so we do not need to worry about issues of sampling, however our data is incredibly skewed.

plot(data$LifeExp, data$TotExp)
abline(lif_exp_tot_exp, col = "red")

Looking at our diagnostic plots below, we can see that some residuals are very high, especially when the fitted values are around 65. In the Normal Q-Q plot shows that our residuals follow a slighly skewed distribution. In the Scale-Location plot, our residuals appear to not be following the model very well, since there are a lot of residuals that do not follow the line. In the Residuals vs Leverage plot, we see that a few residuals have higher leverage than our other models and some have very high influence.

plot(lif_exp_tot_exp)

  1. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
data_manipulation <- data
data_manipulation$LifeExp <- data_manipulation$LifeExp^4.6
data_manipulation$TotExp <- data_manipulation$TotExp^0.6

Notice that our model has improved significantly. Our p-values demonstrate that the variables are still statistically significant. However, notice that the r2 is still low with an explained variance of .5728. Our F-statistic is 252. We desire a higher F-statistic, and this is a useful metric for comparing different models to each other. Notice that is is much higher than the other model, which is better for rejecting the null hypothesis that there is no relationship between the variables LifeExp and TotExp.

lif_exp_tot_exp_mani = lm(data_manipulation$LifeExp ~ data_manipulation$TotExp)
summary(lif_exp_tot_exp_mani)
## 
## Call:
## lm(formula = data_manipulation$LifeExp ~ data_manipulation$TotExp)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -257351739  -82599957   14030425   93896945  237720335 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              211907647   10234512   20.70   <2e-16 ***
## data_manipulation$TotExp    238461      15021   15.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 113800000 on 188 degrees of freedom
## Multiple R-squared:  0.5728, Adjusted R-squared:  0.5705 
## F-statistic:   252 on 1 and 188 DF,  p-value: < 2.2e-16

Notice that from manipulating the data, our relationship between the two variables demonstrates less of a major shift from 70 to 80 years old.

plot(data_manipulation$LifeExp, data_manipulation$TotExp)

Looking at our diagnostic plots below, we can see that some residuals are more normal than the original model. The Normal Q-Q plot shows that our residuals follow a less skewed distribution. In the Scale-Location plot, our residuals appear to be more evenly distributed. In the Residuals vs Leverage plot, we see that a no residuals have a very high leverage/influence.

plot(lif_exp_tot_exp_mani)

From this analysis, we can conclude that our model has improved by manipulating our dataset. Our R2 value has increased, the F-statistic has increased, and our residuals have become more normal.

  1. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
intercept <- lif_exp_tot_exp_mani$coefficient[1]
slope <- lif_exp_tot_exp_mani$coefficient[2]

print(intercept + (1.5 * slope))
## (Intercept) 
##   212265338
print(intercept + (2.5 * slope))
## (Intercept) 
##   212503799
  1. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

The following model contains all significant variables, and our p-values demonstrate that the variables are statistically significant. Also notice that the standard error of each of the terms is low for TotExp and the interaction term, but is quite high for PropMD. However, notice that the r2 is low with an explained variance of .3574. Our F-statistic is 34. We desire a higher F-statistic, and this is the lowest F-statistic we have so far. This is likely because this metric penalizes the user for utilizing more variables, and this model has two variables along with an interaction term.

data$PropMD_TotExp <- data$PropMD * data$TotExp
lm_prop_md = lm(data$LifeExp ~ data$PropMD + data$TotExp + data$PropMD_TotExp, data = data)
summary(lm_prop_md)
## 
## Call:
## lm(formula = data$LifeExp ~ data$PropMD + data$TotExp + data$PropMD_TotExp, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.277e+01  7.956e-01  78.899  < 2e-16 ***
## data$PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## data$TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## data$PropMD_TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16
  1. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

Notice that the following life expectancy value does not make much sense. Our prediction is 107 years old, which is very high for most countries. Therefore this model must not be good at analyzing life expectancy.

intercept = lm_prop_md$coefficients[1]
slope_md = lm_prop_md$coefficients[2]
slope_tot_exp = lm_prop_md$coefficients[3]
slope_md_tot_exp = lm_prop_md$coefficients[4]

propmd = .03
totexp = 14
propmd_totexp = propmd * totexp

print(intercept + (propmd * slope_md) + (totexp * slope_tot_exp) + (propmd_totexp * slope_md_tot_exp))
## (Intercept) 
##     107.696