who_data <- read_csv(url, show_col_types = FALSE)
head(who_data)
## # A tibble: 6 × 10
## Country LifeExp InfantSurvival Under5Survival TBFree PropMD PropRN PersExp
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanis… 42 0.835 0.743 0.998 2.29e-4 5.72e-4 20
## 2 Albania 71 0.985 0.983 1.00 1.14e-3 4.61e-3 169
## 3 Algeria 71 0.967 0.962 0.999 1.06e-3 2.09e-3 108
## 4 Andorra 82 0.997 0.996 1.00 3.30e-3 3.5 e-3 2589
## 5 Angola 41 0.846 0.74 0.997 7.04e-5 1.15e-3 36
## 6 Antigua … 73 0.99 0.989 1.00 1.43e-4 2.77e-3 503
## # ℹ 2 more variables: GovtExp <dbl>, TotExp <dbl>
plot(who_data$TotExp, who_data$LifeExp, main="Life Expectancy vs Total Expenditure",
xlab="Total Expenditure", ylab="Life Expectancy")
lm1 <- lm(LifeExp~TotExp, data = who_data)
summary(lm1)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
The F statistic is 65.26. The F statistic tests whether at least one predictor variable has a non-zero coefficient. The p-value is 7.714e-14 and is extremely small. This p-value suggests that it is highly unlikely that the observed results are due to random chance. Since 65.26 is quite large, and in conjunction with the small p-value, this indicates that the model as a whole is statistically significant.
The multiple \(R^2\) is 0.2577. This means that 25.77% of the variability in life expectancy can be explained by the total expenditure using this model. Although this does suggest that there are likely other predictors that help explain the variability in life expectancy.
The residual standard error is 9.371. A lower RSE indicates suggests that the observed values are closer to the predicted values. A RSE of 9.371 suggests that, on average, the observed life expectancy values deviate from the predicted values by 9.371.
From the below Q-Q plot, we can see that the residuals do not fall along the normal line. We are not able to confirm normality.
qqnorm(resid(lm1))
qqline(resid(lm1))
In order to confirm linearity, we will plot the residuals vs fitted values. To assume linearity, there needs to be no discernable trend in the points. In the below plot, we can see that the points are populated to the left side of the graph. We cannot cofirm lienarity based on this plot.
plot(fitted(lm1),resid(lm1))
In order to confirm homoscedasticity, we will plot the residuals vs fitted values again with a horizontal line at 0. Ideally, the points should be randomly dispersed about 0, with no discernible pattern. From the below plot, we can not confirm homoscedasticity because there is a discernible pattern and the points aren’t scatted about 0.
plot(fitted(lm1),resid(lm1))
abline(h = 0, col = "red", lty = 2)
Using the above residuals vs fitted values plot, we cannot assume independence as there is a discernable trend in the points.
In conclusion, the assumptions for linear regression are not satisified.
plot(who_data$TotExp^0.06, who_data$LifeExp^4.6, main="Life Expectancy vs Total Expenditure",
xlab="Total Expenditure", ylab="Life Expectancy")
lm2 <- lm(I(LifeExp^4.6) ~ I(TotExp^0.06), data = who_data)
summary(lm2)
##
## Call:
## lm(formula = I(LifeExp^4.6) ~ I(TotExp^0.06), data = who_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## I(TotExp^0.06) 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
The F statistic is 507.7. The p-value is 2.2e-16 and is extremely small. This p-value suggests that it is highly unlikely that the observed results are due to random chance. Since the F statistic is large and the p-value is small, it indiciates that total expenditure is a significant predictor in life expectancy. The F statistic is much higher than in the previous model and the p-value is much lower than the previous mode.
The multiple \(R^2\) is 0.7298. This means that 72.98% of the variability in life expectancy can be explained by the total expenditure using this model. This is higher than the previous model by almost 50%.
The residual standard error is 90,490,000. A RSE of 90,490,000 suggests that, on average, the observed life expectancy values deviate from the predicted values by 90,490,000. The large residual standard error suggests that there may still be unexplained variability in Life Expectancy.
Overall, while the transformed model represents a significant improvement over the original model, further analysis may be necessary to better understand and account for the remaining variability in Life Expectancy.
From the below Q-Q plot, we can see that majority of the points fall along the normal line, but at both ends the points trail off. This is an improvement from the previous models Q-Q plot. We can assume normality based on this plot.
qqnorm(resid(lm2))
qqline(resid(lm2))
In the below plot, we can see that the points are scatter about the plot with no discernible pattern. We can assume linearity based on this plot.
plot(fitted(lm2),resid(lm2))
Based on the below plot, the points are scattered about 0 with no discerinble pattern. We can assume homoscedasticity.
plot(fitted(lm2),resid(lm2))
abline(h = 0, col = "red", lty = 2)
Using the above residuals vs fitted values plot, we can assume independence as there is no discernable trend in the plotted residuals.
In conclusion, the assumptions for linear regression are satisified.
The second linear model is better than the first. The second model has a significantly larger F statistic and lower p-value compared to the first model. This indicates a stronger overall relationship between the predictor and response variables and a higher level of confidence in the observed results. The \(R^2\) has improved significantly in the second model vs the first.
# Given coefficients from the linear model
beta0 <- coef(lm2)[1]
beta1 <- coef(lm2)[2]
# Given TotExp^0.06 values
TotExp1 <- 1.5
TotExp2 <- 2.5
# Forecast LifeExp^4.6 for TotExp^0.06 = 1.5
LifeExp1 <- beta0 + beta1 * TotExp1
# Forecast LifeExp^4.6 for TotExp^0.06 = 2.5
LifeExp2 <- beta0 + beta1 * TotExp2
# Taking the 4.6th root to obtain forecasted life expectancy
LifeExpForecast1 <- LifeExp1^(1/4.6)
LifeExpForecast2 <- LifeExp2^(1/4.6)
print(paste("Forecasted Life Expectancy when TotExp^0.06 = 1.5:", round(LifeExpForecast1, 2)))
## [1] "Forecasted Life Expectancy when TotExp^0.06 = 1.5: 63.31"
print(paste("Forecasted Life Expectancy when TotExp^0.06 = 2.5:", round(LifeExpForecast2, 2)))
## [1] "Forecasted Life Expectancy when TotExp^0.06 = 2.5: 86.51"
LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
lm3 <- lm(LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who_data)
summary(lm3)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
The F statistic is 34.49. The p-value is 2.2e-16 and is extremely small. This p-value suggests that it is highly unlikely that the observed results are due to random chance. Since the F statistic is fairly large and the p-value is small, it indiciates that the predictors in the model are significant.
The adjusted \(R^2\) is 0.3471. This means that 34.71% of the variability in life expectancy can be explained by the total expenditure using this model. This suggests that there may be other variables that are not in this model that can further explain the variability.
The residual standard error is 8.765. A RSE of 8.765 suggests that, on average, the observed life expectancy values deviate from the predicted values by 8.765. This is a fairly low residual standard error, which suggest that the observed values are closer to the predicted values.
From the below Q-Q plot, we can see that the residuals do not fall along the normal line perfectly. It is hard to confirm normality based on this plot.
qqnorm(resid(lm3))
qqline(resid(lm3))
In the below residuals vs fitted values plot, we can see that the points are populated towards the top left-hand corner of the plot. In order to confirm linearity, there would need to be no discernible pattern in the points. We cannot confirm linearity based on this plot.
plot(fitted(lm3),resid(lm3))
In order to confirm homoscedasticity, we would need to see the points scattered about 0 in no discerinble pattern. In this case, we cannot confirm homoscedasticity.
plot(fitted(lm3),resid(lm3))
abline(h = 0, col = "red", lty = 2)
Using the above residuals vs fitted values plot, we cannot assume independence as there is a discernable trend in the plotted residuals.
Overall, the assumptions for linear regression are not satisfied. This is not the best model.
# Extract coefficients from the linear model lm3
beta0 <- coef(lm3)[1]
beta1 <- coef(lm3)[2]
beta2 <- coef(lm3)[3]
beta3 <- coef(lm3)[4]
# Define the values of PropMD and TotExp
PropMD_value <- 0.03
TotExp_value <- 14
# Compute the predicted LifeExp
LifeExp_prediction <- beta0 + beta1 * PropMD_value + beta2 * TotExp_value + beta3 * PropMD_value * TotExp_value
print(LifeExp_prediction)
## (Intercept)
## 107.696
A life expectancy of 107.696 does not seem realistic as the average life expectancy worldwide is around 71.33. 197 would be considered an outlier.