The attached who.csv dataset contains real-world data from 2008. The variables included follow.

Country: name of the country LifeExp: average life expectancy for the country in years InfantSurvival: proportion of those surviving to one year or more Under5Survival: proportion of those surviving to five years or more TBFree: proportion of the population without TB. PropMD: proportion of the population who are MDs PropRN: proportion of the population who are RNs PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate TotExp: sum of personal and government expenditures.

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
file_path <- "/Users/mollysiebecker/DATA 605/homework/who.csv"
who_data <- read.csv(file_path)
  1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
ggplot(who_data, mapping = aes(x = TotExp, y = LifeExp)) + 
  geom_point() +
  labs(title = "Life Expectancy vs. Total Expenditures", x = "Total Expenditures", y = "Average Life Expectancy")

m_simple <- lm(LifeExp ~ TotExp, data = who_data)
summary(m_simple)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Right away, the assumptions of simple linear regression are not met, because we can see from the scatter plot that a linear model does not appear to be a good fit for the data. The F-statistic of 65.26 indicates that this model is better than one with only the y-intercept and not the TotExp variable. Since there is also a very low p-value, we can see that this relationship between life expectancy and total expenditure is highly unlikely to be explained only by chance. The r-squared of ~.25 indicates a low correlation, as only 25% of the variation in the data is explained by the model. The standard error is ~9.3, which can be used to show that the residuals are not normally distributed since they are not equal to approximately 1.5 times the SE. Overall, a simple linear model is not appropriate for this data.

  1. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
who_data_transformed <- who_data
who_data_transformed$LifeExp <- who_data_transformed$LifeExp^4.6
who_data_transformed$TotExp <- who_data_transformed$TotExp^.06

ggplot(who_data_transformed, mapping = aes(x = TotExp, y = LifeExp)) + 
  geom_point() +
  labs(title = "Life Expectancy vs. Total Expenditures (Transformed)", x = "(Total Expenditures)^.06", y = "(Average Life Expectancy)^4.6")

m_transformed <- lm(LifeExp ~ TotExp, data = who_data_transformed)
summary(m_transformed)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_data_transformed)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp       620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

This model is better than the previous simple linear model. The F-statistic of 507.7 indicates that this model is better than one with only the y-intercept and not the TotExp^.06 variable. Again, the very low p-value indicates that this relationship between life expectancy and total expenditure is highly unlikely to be explained only by chance. The r-squared of ~.73 indicates a moderate correlation, as 73% of the variation in the data can be explained by the model. The standard error is 90490000, which can be used to show that the residuals are not normally distributed since they are not equal to approximately 1.5 times the SE.

  1. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
(-736527910+620060216*1.5)^(1/4.6)
## [1] 63.31153
(-736527910+620060216*2.5)^(1/4.6)
## [1] 86.50645

Inputting the given values and then taking the inverse of raising to the 4.6th power, we find that the life expectancies are approximately 63.3 and 86.5 years, respectively.

  1. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
quadratic_data <- who_data %>% 
  mutate(quadratic_term = PropMD * TotExp)
m_quad <- lm(LifeExp ~ PropMD + TotExp + quadratic_term, data = quadratic_data)
summary(m_quad)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + quadratic_term, data = quadratic_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD          1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp          7.233e-05  8.982e-06   8.053 9.39e-14 ***
## quadratic_term -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

This model is better than the previous simple linear model, but not as good as the model with the transformed data. The F-statistic of 34.5 indicates that this model is better than one with only the y-intercept and not the other variables. Again, the very low p-value indicates that this relationship between life expectancy and total expenditure is highly unlikely to be explained only by chance. The r-squared of ~.35 indicates a low correlation, as only 35% of the variation in the data can be explained by the model. The standard error is 8.765, which can be used to show that the residuals are not normally distributed since they are not equal to approximately 1.5 times the SE.

  1. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
6.277e+01+1.497e+03*.03+7.233e-05*14+-6.026e-03*.03*14
## [1] 107.6785

No, this forecast does not seem realistic since ~108 is a very old age. It also seems unrealistic for a country to have one of the highest proportions of doctors at .03, but one of the lowest total expenditures at 14, which is likely contributing to the unrealistic prediction.