data <- read.csv("https://raw.githubusercontent.com/johnnydrodriguez/data605/main/who.csv", header = TRUE, sep = ',', na.strings="", fill = TRUE)
head(data)
## Country LifeExp InfantSurvival Under5Survival TBFree PropMD
## 1 Afghanistan 42 0.835 0.743 0.99769 0.000228841
## 2 Albania 71 0.985 0.983 0.99974 0.001143127
## 3 Algeria 71 0.967 0.962 0.99944 0.001060478
## 4 Andorra 82 0.997 0.996 0.99983 0.003297297
## 5 Angola 41 0.846 0.740 0.99656 0.000070400
## 6 Antigua and Barbuda 73 0.990 0.989 0.99991 0.000142857
## PropRN PersExp GovtExp TotExp
## 1 0.000572294 20 92 112
## 2 0.004614439 169 3128 3297
## 3 0.002091362 108 5184 5292
## 4 0.003500000 2589 169725 172314
## 5 0.001146162 36 1620 1656
## 6 0.002773810 503 12543 13046
Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables.
# Simple Linear Regression
model <- lm(LifeExp ~ TotExp, data = data)
#Plot
plot(data$TotExp, data$LifeExp,
main = "Life Expectancy vs. Total Expenditure",
xlab = "Total Expenditure",
ylab = "Life Expectancy")
abline(model, col = "red")
Provide and interpret the F statistics, R^2, standard error,and p-values only.
F-statistic and its p-value: The F-statistic is 65.26. it tests TotExp to check if it significantly predicts LifeExp. The p-value associated with the F-statistic is approximately 7.714×10e−14 and very small. When compared to the alpha = 0.05 threshold, we can reject the null hypothesis; this suggest TotExp is a statistically significant predictor of LifeExp.
R-squared: The R-squared value is 0.2577. This indicates that approximately 25.77% of the variability in LifeExp can be explained by TotExp. While this shows some level of predictive power, it also implies that there are other factors can affect LifeExp.
Standard Error Residual Standard Error: The residual standard error is 9.371. This suggests that on average, the observed LifeExp values deviate from the predicted values by about 9.371 years.
Coefficients and their p-values: The estimate for the intercept is approximately 64.75 with a very small < 2e-16 p-value indicating that it is statistically significant when compared against alpha = .05 threshold. We can interpret this to mean that when TotExp is zero, the average predicted LifeExp is about 64.75 years
TotExp: The coefficient for TotExp is 6.297×10e−5 with a standard error of 7.795×10e−6 7 and a p-value of 7.71×10e−14. This tells us that for each unit increase in TotExp, LifeExp is expected to increase by approximately 0.00006297 years. This effect is statistically significant.
Conclusion The analysis suggests that TotExp has a positive, but very small impact on LifeExp. The low R-squared value suggests limited overall explanatory power. The simple linear model suggests that while expenditure contributes to life expectancy, it is not the sole determinant and that the overally impact of a per unit of expenditure is very small on life expectancy.
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
Discuss whether the assumptions of simple linear regression met.
Residuals vs Fitted There seems to be a curve, suggesting possible non-linearity. The spread of residuals fan out slightly for higher fitted values, suggesting potential heteroscedasticity. This plot suggests the model does not meet the assumptions of linearity and homoscedascity.
Q-Q Plot The Q-Q plot shows that many points lie along the 45 degree reference line, but there is some deviation at both ends, indicating some non-normality, potentially due to outliers or skewness in the residuals.
Scale-Location There is a pattern that does not be evenly spread across the a horizontal line. Here them the points are clustered on one end and then spread varies significantly. This plot points to heteroscedascity and does meet the basic model assumptions.
Residuals vs Leverage There are a few points outside the Cook’s distance lines, suggesting the presence of influential observations that may impact the model.
Conclusion The linearity assumption is violated as well as the homoscedasticity assumption may be violated. The normality of residuals assumption is potential met, but shows non normality at the tails. The influence of outliers is a concern due to points that lie outside the Cook’s distance lines. Overall, the simple linear regression model does not fit the data well and should not be used for predictions.
Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06).
# Create the new data frame with transformed variables
data_transformed <- data.frame(
Country = data$Country,
LifeExp_transformed = data$LifeExp^4.6,
TotExp_transformed = data$TotExp^0.06
)
# Display the first few rows of the new data frame to verify
head(data_transformed)
## Country LifeExp_transformed TotExp_transformed
## 1 Afghanistan 29305338 1.327251
## 2 Albania 327935478 1.625875
## 3 Algeria 327935478 1.672697
## 4 Andorra 636126841 2.061481
## 5 Angola 26230450 1.560068
## 6 Antigua and Barbuda 372636298 1.765748
Results & Conclusion
F-statistic and its p-value F-statistic: The F-statistic is 507.7. A higher F-statistic indicates a more significant improvement. In the context of this model, it suggests that the model with TotExp_transformed is a significantly better predictor of LifeExp_transformed. p-value of the F-statistic: A p-value less than 2.2×10e− 16 suggests that the model is highly significant.
R-squared: This tells us that 72.98% of the variance in the transformed life expectancy variable can be explained by the transformed total expenditure variable. This is a large portion, which suggests a strong relationship.
Standard Error: The SW seems very high. However, since we have raised LifeExp to the power of 4.6, the scale of the response variable has been altered dramatically - which make real-world interpretation difficult.
Coefficients and Their p-values: Interpreting the intercept on a transformed scale is also challenging. The intercept is the expected value of the transformed response variable when the transformed predictor is zero, and its value of -736,527,910 is not meaningful,
TotExp_transformed: TotExp_transformed is 620,060,216 with a p-value of less than 2×10e−16, indicating a highly significant positive relationship between the transformed predictor and the response variable. Holding all else constant, a one-unit increase in the transformed total expenditure is associated with an increase of 620,060,216 units in the transformed life expectancy. However, because of the transformation, interpretation is challenging.
Conclusion: The transformed model indicates a strong and statistically significant relationship between total expenditure and life expectancy, with high explanatory power of variance in life expectancy being explained by expenditure after the transformation. However, interpreting the coefficients and SE is not as apparent due to the transformation.
# Simple Linear Regression
model_transformed <- lm(LifeExp_transformed ~ TotExp_transformed, data = data_transformed)
# Plot
plot(data_transformed$TotExp_transformed, data_transformed$LifeExp_transformed,
main = "Transformed Scatterplot of LifeExp^4.6 vs. TotExp^.06",
xlab = "Total Expenditure^0.06",
ylab = "Life Expectancy^4.6")
abline(model_transformed, col = "red")
##
## Call:
## lm(formula = LifeExp_transformed ~ TotExp_transformed, data = data_transformed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp_transformed 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
# Predict the
new_data <- data.frame(TotExp_transformed = c(1.5, 2.5))
predictions_transformed <- predict(model_transformed, newdata = new_data)
#To make the result interpretable we take the 4.6th root ( raise to the power of 1/4.6)
predictions <- predictions_transformed^(1/4.6)
# Using sprintf() to create formatted strings for each prediction, rounded to 0 decimal places
prediction1 <- sprintf("The forecasted life expectancy for a total expenditure transformed value of 1.5 is: %.2f", predictions[1])
prediction2 <- sprintf("The forecasted life expectancy for a total expenditure transformed value of 2.5 is: %.2f", predictions[2])
# Display the predictions
prediction1
## [1] "The forecasted life expectancy for a total expenditure transformed value of 1.5 is: 63.31"
## [1] "The forecasted life expectancy for a total expenditure transformed value of 2.5 is: 86.51"
F-statistic and its p-value: The p-value associated with the F-statistic is approximately 3.479e-15. A p-value this small (significantly below 0.05 threshold) indicates that we can reject the null hypothesis.
Adjusted R-squared: Approximately 29.21% of the variance in Life Expectancy is explained by the model. This suggests that while the model captures a significant portion of variability, there’s still much unexplained variability.
Standard Error: The typical prediction error is about 9.127 years.
Coefficients and their p-values:
Intercept: The estimate for the intercept is approximately 63.97, with a standard error of 0.7706, The p-value of less than 2e-16 suggests that the intercept is significantly different from zero.
PropMD: The coefficient for PropMD is 650.8, with a standard error of 194.6. The associated p-value is 0.000998, indicating that PropMD is a significant predictor of LifeExp.
TotExp: The coefficient for TotExp is 0.00005378 with a standard error of 0.000008074. The p-value is approximately 2.95e-10, indicating that TotExp is also a significant predictor of LifeExp.
Conclusion: The model suggests that both PropMD and TotExp are significant predictors of LifeExp. The positive coefficients indicate that increases in these variables are associated with increases in LifeExp. However, the low R-squared values indicate that while the model is statistically significant, there are other factors not included in the model that may affect Life Expectancy as well.
# Create Multiple Linear Regression model
model_multiple <- lm(LifeExp ~ PropMD + TotExp, data = data)
# Summary
summary(model_multiple)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.996 -4.880 3.042 6.958 13.415
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.397e+01 7.706e-01 83.012 < 2e-16 ***
## PropMD 6.508e+02 1.946e+02 3.344 0.000998 ***
## TotExp 5.378e-05 8.074e-06 6.661 2.95e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.127 on 187 degrees of freedom
## Multiple R-squared: 0.2996, Adjusted R-squared: 0.2921
## F-statistic: 39.99 on 2 and 187 DF, p-value: 3.479e-15
It’s unclear if the forecast is realistic for several reason. Primarily, as noted above, while the model has predicative explanatory power, its low R-squared values do not explain a large portion of the variability – suggesting that more factors are necessary to improve predictions. Also, there are no analogs in the data. Countries with MD proportions as high as .03 typically have much higher Total Expenditures. This scenario creates a case with a high proportion of MDs (near the top) while maintaining low expenditures (at the bottom) - which does not appear to be represented in the data. That said, it would make sense that a higher proportion of doctors in a population would help increase life expectancy – but a country would need to accomplish this increase without any significant investment in doing so.
# Create a new data frame
new_data <- data.frame(PropMD = 0.03, TotExp = 14)
# Predict LifeExp for the given PropMD and TotExp
predicted_LifeExp <- predict(model_multiple, newdata = new_data)
# Prediction
prediction3 <- sprintf("The forecasted life expectancy for PropMD = 0.03 and TotExp = 14 is: %.2f", predicted_LifeExp)
print(prediction3)
## [1] "The forecasted life expectancy for PropMD = 0.03 and TotExp = 14 is: 83.49"