The variables included the following:
Country: name of the country
LifeExp: average life expectancy for the country in years
InfantSurvival: proportion of those surviving to one year or more
Under5Survival: proportion of those surviving to five years or more
TBFree: proportion of the population without TB.
PropMD: proportion of the population who are MDs
PropRN: proportion of the population who are RNs
PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate TotExp: sum of personal and government expenditures.
data = read.csv("/Users/Michele/Desktop/who - Sheet1.csv")
head(data)
## Country LifeExp InfantSurvival Under5Survival TBFree
## 1 Afghanistan 42 0.835 0.743 0.99769
## 2 Albania 71 0.985 0.983 0.99974
## 3 Algeria 71 0.967 0.962 0.99944
## 4 Andorra 82 0.997 0.996 0.99983
## 5 Angola 41 0.846 0.740 0.99656
## 6 Antigua and Barbuda 73 0.990 0.989 0.99991
## PropMD PropRN PersExp GovtExp TotExp
## 1 0.000228841 0.000572294 20 92 112
## 2 0.001143127 0.004614439 169 3128 3297
## 3 0.001060478 0.002091362 108 5184 5292
## 4 0.003297297 0.003500000 2589 169725 172314
## 5 0.000070400 0.001146162 36 1620 1656
## 6 0.000142857 0.002773810 503 12543 13046
Notice that our model is not a very good one, although our variables are significant with low p-values, our model is not very good. Notice that the r2 is very low with an explained variance of .2577. Our F-statistic is 65. We desire a higher F-statistic, and this is a useful metric for comparing different models to each other.
lif_exp_tot_exp = lm(data$LifeExp ~ data$TotExp)
summary(lif_exp_tot_exp)
##
## Call:
## lm(formula = data$LifeExp ~ data$TotExp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## data$TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
Notice from the plot that a linear relationship is not the best relationship that describes our data. Once age hits 80, it increases as a very rapid rate, which our model does not appear to catch.
In addition, the assumptions of simple linear regression are not met because our data is not normally distributed. We are also using a population dataset so we do not need to worry about issues of sampling, however our data is incredibly skewed.
plot(data$LifeExp, data$TotExp)
abline(lif_exp_tot_exp, col = "red")
Looking at our diagnostic plots below, we can see that some residuals are very high, especially when the fitted values are around 65. In the Normal Q-Q plot shows that our residuals follow a slighly skewed distribution. In the Scale-Location plot, our residuals appear to not be following the model very well, since there are a lot of residuals that do not follow the line. In the Residuals vs Leverage plot, we see that a few residuals have higher leverage than our other models and some have very high influence.
plot(lif_exp_tot_exp)
data_manipulation <- data
data_manipulation$LifeExp <- data_manipulation$LifeExp^4.6
data_manipulation$TotExp <- data_manipulation$TotExp^0.6
Notice that our model has improved significantly. Our p-values demonstrate that the variables are still statistically significant. However, notice that the r2 is still low with an explained variance of .5728. Our F-statistic is 252. We desire a higher F-statistic, and this is a useful metric for comparing different models to each other. Notice that is is much higher than the other model, which is better for rejecting the null hypothesis that there is no relationship between the variables LifeExp and TotExp.
lif_exp_tot_exp_mani = lm(data_manipulation$LifeExp ~ data_manipulation$TotExp)
summary(lif_exp_tot_exp_mani)
##
## Call:
## lm(formula = data_manipulation$LifeExp ~ data_manipulation$TotExp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -257351739 -82599957 14030425 93896945 237720335
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 211907647 10234512 20.70 <2e-16 ***
## data_manipulation$TotExp 238461 15021 15.88 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 113800000 on 188 degrees of freedom
## Multiple R-squared: 0.5728, Adjusted R-squared: 0.5705
## F-statistic: 252 on 1 and 188 DF, p-value: < 2.2e-16
Notice that from manipulating the data, our relationship between the two variables demonstrates less of a major shift from 70 to 80 years old.
plot(data_manipulation$LifeExp, data_manipulation$TotExp)
Looking at our diagnostic plots below, we can see that some residuals are more normal than the original model. The Normal Q-Q plot shows that our residuals follow a less skewed distribution. In the Scale-Location plot, our residuals appear to be more evenly distributed. In the Residuals vs Leverage plot, we see that a no residuals have a very high leverage/influence.
plot(lif_exp_tot_exp_mani)
From this analysis, we can conclude that our model has improved by manipulating our dataset. Our R2 value has increased, the F-statistic has increased, and our residuals have become more normal.
intercept <- lif_exp_tot_exp_mani$coefficient[1]
slope <- lif_exp_tot_exp_mani$coefficient[2]
print(intercept + (1.5 * slope))
## (Intercept)
## 212265338
print(intercept + (2.5 * slope))
## (Intercept)
## 212503799
LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
The following model contains all significant variables, and our p-values demonstrate that the variables are statistically significant. Also notice that the standard error of each of the terms is low for TotExp and the interaction term, but is quite high for PropMD. However, notice that the r2 is low with an explained variance of .3574. Our F-statistic is 34. We desire a higher F-statistic, and this is the lowest F-statistic we have so far. This is likely because this metric penalizes the user for utilizing more variables, and this model has two variables along with an interaction term.
data$PropMD_TotExp <- data$PropMD * data$TotExp
lm_prop_md = lm(data$LifeExp ~ data$PropMD + data$TotExp + data$PropMD_TotExp, data = data)
summary(lm_prop_md)
##
## Call:
## lm(formula = data$LifeExp ~ data$PropMD + data$TotExp + data$PropMD_TotExp,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## data$PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## data$TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## data$PropMD_TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
Notice that the following life expectancy value does not make much sense. Our prediction is 107 years old, which is very high for most countries. Therefore this model must not be good at analyzing life expectancy.
intercept = lm_prop_md$coefficients[1]
slope_md = lm_prop_md$coefficients[2]
slope_tot_exp = lm_prop_md$coefficients[3]
slope_md_tot_exp = lm_prop_md$coefficients[4]
propmd = .03
totexp = 14
propmd_totexp = propmd * totexp
print(intercept + (propmd * slope_md) + (totexp * slope_tot_exp) + (propmd_totexp * slope_md_tot_exp))
## (Intercept)
## 107.696