Packages.

library(dplyr)
library(ggplot2)

Data.

data <- read.csv(full_path)

head(data)
##               Country LifeExp InfantSurvival Under5Survival  TBFree      PropMD
## 1         Afghanistan      42          0.835          0.743 0.99769 0.000228841
## 2             Albania      71          0.985          0.983 0.99974 0.001143127
## 3             Algeria      71          0.967          0.962 0.99944 0.001060478
## 4             Andorra      82          0.997          0.996 0.99983 0.003297297
## 5              Angola      41          0.846          0.740 0.99656 0.000070400
## 6 Antigua and Barbuda      73          0.990          0.989 0.99991 0.000142857
##        PropRN PersExp GovtExp TotExp
## 1 0.000572294      20      92    112
## 2 0.004614439     169    3128   3297
## 3 0.002091362     108    5184   5292
## 4 0.003500000    2589  169725 172314
## 5 0.001146162      36    1620   1656
## 6 0.002773810     503   12543  13046

Item 1

Provide a scatterplot of LifeExp~TotExp and simple linear regression, the F statistics, R^2, standard error,and p-values. Whether the assumptions of simple linear regression are met.

ggplot(data, aes(x = TotExp, y = LifeExp)) +
  geom_point() + 
  ggtitle("Scatterplot of Life Expectancy vs. Total Healthcare Expenditure") +
  xlab("Total Expenditures (US$)") +
  ylab("Life Expectancy (years)")

model <- lm(LifeExp ~ TotExp, data = data)

summary(model)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14
par(mfrow=c(2,2))
plot(model)

ggplot(data, aes(x = TotExp, y = LifeExp)) +
  geom_point() +
  geom_smooth(method = "lm", col = "blue") +
  ggtitle("Scatterplot with Regression Line: Life Expectancy vs. Total Healthcare Expenditure") +
  xlab("Total Expenditures (US$)") +
  ylab("Life Expectancy (years)")
## `geom_smooth()` using formula = 'y ~ x'

INTERPRETATION.

  • Coefficient for TotExp: 6.297e-05. This implies that for every additional dollar spent on healthcare (both personal and government), life expectancy is predicted to increase by approximately 0.00006297 years (or about 0.023 hours).
  • Intercept (Estimate of LifeExp when TotExp is 0):6.475e+01 (or 64.75 years). This suggests that in the absence of healthcare expenditures, the baseline life expectancy in the data set is about 64.75 years.
  • R-squared: 0.2577. This indicates that about 25.77% of the variance in life expectancy is explained by the total healthcare expenditures. This suggests a moderate relationship.
  • The residuals range from -24.764 to 13.292, which suggests some variability in the model’s predictions across different data points.
  • From the plots, the relationship might not be perfectly linear and there seems to be potential heteroscedasticity. There may be a few moderate outliers too.
  • The F statistic was 65.26 and it is calculated as the ratio of the mean squared regression (MSR) to the mean squared error (MSE). So it helps compare the fit of the intercept-only model (a model with no predictors) with the model that includes the TotExp predictor. A large F-statistic indicates that the model provides a better fit to the data than the intercept-only model. So in this case it shows that the regression model fits the data significantly better than an intercept-only model.

Item 2.

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

data$LifeExp_trans <- data$LifeExp^4.6
data$TotExp_trans <- data$TotExp^0.06

ggplot(data, aes(x = TotExp_trans, y = LifeExp_trans)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue") +
  labs(title = "Transformed Life Expectancy vs. Transformed Total Expenditure",
       x = "Total Expenditures^0.06",
       y = "Life Expectancy^4.6")
## `geom_smooth()` using formula = 'y ~ x'

model_trans <- lm(LifeExp_trans ~ TotExp_trans, data = data)
summary_trans <- summary(model_trans)
print(summary_trans)
## 
## Call:
## lm(formula = LifeExp_trans ~ TotExp_trans, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -736527910   46817945  -15.73   <2e-16 ***
## TotExp_trans  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model_trans)

INTERPRETATION

  • it seems that there is a clear positive linear relationship between Life Expectancy^4.6 and Total Expenditures^0.06. The points are clustered around the regression line, and the line goes through the center of the points, indicating a good fit.
  • the residuals vs. fitted plot shows a random pattern, suggesting that the relationship between the variables is linear, and there are no obvious problems with non-linearity.
  • the normal q-q plot shows that the residuals mostly follow the reference line, suggesting that the residuals are approximately normally distributed.
  • the scale-location plot shows a random scatter of points, indicating that the variances of the errors are consistent across all levels of the independent variable.
  • there don’t seem to be individual points that have high leverage and are outliers.
  • The R-squared increased to 0.7298, so about 73% of the variability in the transformed Life Expectancy can be explained by the transformed Total Expenditure.
  • F-statistic: At 507.7 with a p-value of < 2.2e-16, this suggests that the model fit is statistically significant. This is a stronger result than the original model.

So this model is better!

Item 3.

Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

We can do this by plugging in the values of \(TotExp^{0.06}\) into the regression equation:

\[ \text{LifeExp}^{4.6} = \text{Intercept} + \text{Coefficient} \times (TotExp^{0.06}) \]

The estimated intercept is -736,527,910 and the estimated coefficient for \(TotExp^{0.06}\) is 620,060,216.

intercept <- -736527910
coefficient <- 620060216

predicted_lifeexp_1_5 <- intercept + coefficient * (1.5)

predicted_lifeexp_2_5 <- intercept + coefficient * (2.5)

forecast_lifeexp_1_5 <- predicted_lifeexp_1_5^(1/4.6)
forecast_lifeexp_2_5 <- predicted_lifeexp_2_5^(1/4.6)

forecast_lifeexp_1_5
## [1] 63.31153
forecast_lifeexp_2_5
## [1] 86.50645

Item 4.

model4 <- lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, data = data)

summary(model4)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model4)

INTERPRETATION.

  • The model is statistically significant and seems to include predictors of life expectancy.
  • The diagnostics are not very concerning, but there does seem to be a degree of heteroscedasticity, non-linearity, and potentially some outliers.
  • At an R-squared of 0.3574, about 35.74% of the variability in life expectancy is explained by the model.
  • The F-statistic is 34.49 with a very low p-value (< 2.2e-16), so the model as a whole is statistically significant.

Item 5.

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

intercept <- 6.277e+01
coeff_PropMD <- 1.497e+03
coeff_TotExp <- 7.233e-05
coeff_Interaction <- -6.026e-03

PropMD_value <- 0.03
TotExp_value <- 14

LifeExp_forecast <- intercept + 
                    (coeff_PropMD * PropMD_value) + 
                    (coeff_TotExp * TotExp_value) + 
                    (coeff_Interaction * PropMD_value * TotExp_value)

LifeExp_forecast
## [1] 107.6785

INTERPRETATION.

  • A life expectancy of 107.7 years is quite high. And a proportion of MDs in a country of .03 is super high as well and not realistic to start with (looking at the data, the vast majority of countries had a PropMD < .01). The exceptions with such a high PropMD were Cyprus and San Marino.
  • A TotExp of 14 is also super small on the other hand, with the lowerst value in the data set of TotExp being 13, and second lowest being 70.
  • The negative interaction term suggests that the effect of PropMD on life expectancy is less pronounced at higher levels of TotExp. Since we’re applying very low TotExp and quite high PropMD, the model gives a significant increase in life expectancy due to the high weight of PropMD. However, the interaction suggests this is less plausible at higher expenditure levels.

All in all, no, this is not a realistic result mainly because this model cannot accurately predict life expectancy at these numbers due to the limitations in the data set in relation to these numbers. Therefore, the model is likely over-estimating the life expectancy at these numbers.