who <- read.csv('/Users/haigbedros/Desktop/MSDS/Spring 24/605/HW/HW12/who.csv')
head(who)
## Country LifeExp InfantSurvival Under5Survival TBFree PropMD
## 1 Afghanistan 42 0.835 0.743 0.99769 0.000228841
## 2 Albania 71 0.985 0.983 0.99974 0.001143127
## 3 Algeria 71 0.967 0.962 0.99944 0.001060478
## 4 Andorra 82 0.997 0.996 0.99983 0.003297297
## 5 Angola 41 0.846 0.740 0.99656 0.000070400
## 6 Antigua and Barbuda 73 0.990 0.989 0.99991 0.000142857
## PropRN PersExp GovtExp TotExp
## 1 0.000572294 20 92 112
## 2 0.004614439 169 3128 3297
## 3 0.002091362 108 5184 5292
## 4 0.003500000 2589 169725 172314
## 5 0.001146162 36 1620 1656
## 6 0.002773810 503 12543 13046
plot(who$TotExp, who$LifeExp)
model <- lm(LifeExp ~ TotExp, data = who)
summary(model)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
par(mfrow = c(2, 2))
plot(model)
Interpretation:
F-statistic: The F-statistic is 65.26, which is quite high, and it tells us that the model fits the data well compared to a model with no predictors.
\(R^2\): The \(R^2\) value is 0.2577, meaning about 25.77% of the variability in life expectancy is explained by total expenditure.
Standard Error: The residual standard error is 9.371. That means on average, the actual values are about 9.371 units away from the predicted values.
p-values: The p-value for TotExp is extremely low (7.71e-14), indicating that total expenditure is a highly significant predictor of life expectancy.
Based on the scatterplot, the assumptions of simple linear regression may not be fully met. The data points show a potential non-linear pattern and suggest increasing variance in Life Expectancy as Total Expenditures increase, which could violate linearity and homoscedasticity assumptions.
Which model is “better?”
who$LifeExp_raised <- who$LifeExp^4.6
who$TotExp_raised <- who$TotExp^.06
plot(who$TotExp_raised, who$LifeExp_raised)
transformed_model <- lm(who$LifeExp_raised ~ who$TotExp_raised, data = who)
summary(transformed_model)
##
## Call:
## lm(formula = who$LifeExp_raised ~ who$TotExp_raised, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## who$TotExp_raised 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(transformed_model)
From the summary we got:
The transformed model looks better as it has a higher R-squared value, indicating a stronger explanatory power than the original model.
intercept <- -736527910
coefficient <- 620060216
# when TotExp^.06 =1.5
predicted_LifeExp_raised_1_5 <- intercept + coefficient * (1.5)
# when TotExp^.06=2.5
predicted_LifeExp_raised_2_5 <- intercept + coefficient * (2.5)
life_exp_original_1_5 <- predicted_LifeExp_raised_1_5^(1/4.6)
life_exp_original_2_5 <- predicted_LifeExp_raised_2_5^(1/4.6)
life_exp_original_1_5
## [1] 63.31153
life_exp_original_2_5
## [1] 86.50645
LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
new_model <- lm(LifeExp ~ PropMD + TotExp + PropMD:TotExp, data = who)
summary(new_model)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD:TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
The model is statistically significant as indicated by the F-statistic (34.49) with a p-value less than 2.2e-16. However, it only explains about 35.74% of the variation in life expectancy (R-squared = 0.3574), which is moderate.
The standard error of 8.765 suggests average prediction errors of that magnitude.
All predictors, including the interaction term, are significant (p < 0.05), meaning both the proportion of medical doctors and total expenditure, along with their interaction, are important for predicting life expectancy. Despite its statistical significance, the model leaves room for improvement in explaining life expectancy’s variability.
# Coefficients from the regression model
b0 = 62.77
b1 = 1497
b2 = 7.233e-05
b3 = -0.006026
# Values for PropMD and TotExp
PropMD = 0.03
TotExp = 14
# Forecasting LifeExp using the regression equation
LifeExp_forecast = b0 + (b1 * PropMD) + (b2 * TotExp) + (b3 * PropMD * TotExp)
LifeExp_forecast
## [1] 107.6785
The forecast of 107.6785 years for life expectancy is higher than current global averages and may be considered unrealistic, suggesting that the model may not be accurately capturing the complex factors that determine life expectancy.