Question 1: Simple Linear Regression (Fitting and Interpretation)

AllCountries <- read.csv("AllCountries.csv")

# Fit simple linear regression: LifeExpectancy ~ GDP
model_slr <- lm(LifeExpectancy ~ GDP, data = AllCountries)

summary(model_slr)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = AllCountries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

According to the model’s intercept, a nation with a total GDP of 0 should expect its average lifetime to be around 67.77 years, even though few nations have a GDP of 0; thus, this is mostly a mathematical function rather than a realistic value.

Based on a slope of approximately 0.000534 per dollar increase in per capita GDP across all countries, the model suggests that a country should expect its citizens’ average lifetime to increase by approximately 0.000534 years.

Additionally, using an R² of approximately .44, it can be concluded that GDP per capita accounts for about 44% of the variance in life expectancy between nations. While there is some correlation between wealth and longevity, over half of the variance comes from other causes.


Question 2: Multiple Linear Regression (Fitting and Interpretation)

# Fit multiple linear regression: LifeExpectancy ~ GDP + Health + Internet
model_mlr <- lm(LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)

summary(model_mlr)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

The coefficient for Health is about 0.284. For every 1% increase in government health care spending, life expectancy rises by about 0.284 years on average. This suggests countries spending more on health care usually have slightly longer life expectancy than those that do not, after accounting for economic status and Internet access.

The adjusted R² for the multiple regression model was approximately 0.72, compared to approximately 0.44 for the simple regression model. This difference between the two models indicates that including Health and Internet as independent variables strongly increases the multiple regression model’s ability to explain variation in life expectancy.


Question 3: Checking Assumptions (Homoscedasticity and Normality)

We can assess homoscedasticity by plotting the residuals against their fitted values. We want to see that the points are randomly scattered about the horizontal line at zero without any funnel shapes or other patterns. If the model is misspecified, we will see that the residuals appear to spread out or fan as fitted values increase, indicating that the model’s errors are not constant over the prediction interval, ultimately yielding unreliable standard errors and hypothesis tests.

We can assess the normality of residuals using a Q-Q plot. In a Q-Q plot, we compare the distribution of residuals to a theoretical normal distribution. We expect the points on a Q-Q plot to be aligned closely along the diagonal reference line. If we see a violation of normality, for example, if the tails are heavy or the curve is S-shaped, we can conclude that the residuals are skewed or have an excessive number of outliers, both of which will undermine the validity of confidence intervals and p-values.

par(mfrow = c(1, 2))

# Residuals vs Fitted (homoscedasticity)
plot(model_slr$fitted.values, model_slr$residuals,
     xlab = "Fitted Values",
     ylab = "Residuals",
     main = "Residuals vs Fitted")
abline(h = 0, col = "red", lty = 2)

# Q-Q plot (normality)
qqnorm(model_slr$residuals, main = "Normal Q-Q Plot")
qqline(model_slr$residuals, col = "red")

Looking at the Residuals vs Fitted plot, the spread of residuals is not perfectly constant — there is some evidence of heteroscedasticity, particularly at lower fitted values where the spread appears wider. This does not match the ideal outcome of uniform scatter and suggests the model’s predictions may be less reliable at lower life expectancy levels.

For the Q-Q plot, the residuals follow the diagonal reasonably well in the middle range but deviate at the tails, indicating some departure from normality; likely due to a few outlier countries. This mild violation means that while the model is broadly valid, confidence intervals and p-values should be interpreted with some caution.


Question 4: Diagnosing Model Fit (RMSE and Residuals)

# Calculate RMSE for the multiple regression model
residuals_mlr <- model_mlr$residuals
rmse <- sqrt(mean(residuals_mlr^2, na.rm = TRUE))
cat("RMSE:", round(rmse, 4), "years\n")
## RMSE: 4.0564 years
# Identify countries with the largest residuals
AllCountries_complete <- AllCountries[complete.cases(AllCountries[, c("LifeExpectancy", "GDP", "Health", "Internet")]), ]
AllCountries_complete$residuals <- residuals(model_mlr)

# Top 5 largest absolute residuals
top_residuals <- AllCountries_complete %>%
  arrange(desc(abs(residuals))) %>%
  select(Country, LifeExpectancy, GDP, Health, Internet, residuals) %>%
  head(5)

print(top_residuals)
##         Country LifeExpectancy  GDP Health Internet residuals
## 1 Cote d'Ivoire           54.1 1716   4.88     43.8 -14.56617
## 2       Lesotho           54.6 1324  10.90     29.8 -12.88475
## 3       Nigeria           53.9 2028   5.01     27.7 -11.74177
## 4  Sierra Leone           52.2  523   7.91     13.2 -11.36546
## 5      Eswatini           58.3 4140  15.23     30.3 -10.41987

The multiple regression model’s root mean square error indicates the average size of the prediction errors for each year of life expected. Therefore, this is a reasonable but not necessarily precise estimate given that the prediction of life expectancy of 217 countries having very different social, historical, and health conditions is complex.

For specific countries, there are large residuals indicating a lack of confidence in the model’s predictions. In countries that have either unusually high or low life expectancy based on their GDP, health expenditures and internet access, there may be structural conditions that cannot be explained by the model. These outliers could warrant further investigation.


Question 5: Hypothetical Example (Multicollinearity in Multiple Regression)

When Electricity and Energy have a high correlation to each other and are both regression model predictors, the regression model has difficulty distinguishing how much each of them contributes to Carbon Dioxide (CO₂) emissions. The coefficients of electricity and energy are unstable and show significant fluctuations in the slope estimates produced by the model based on small changes in data changes, resulting in a significant inflation of the standard errors and thus making it difficult to ascertain what effect each of these variables actually has on the outcome of interest.

Consequently, although the overall fit is good (high R²), the individual coefficients for the predictors of electricity and energy may not be statistically significant and cannot be interpreted reliably. One predictor can even be negative against expectation based on theory.

A common method of detecting multicollinearity is to calculate the Variance Inflation Factor (VIF) for each predictor. A VIF greater than 5 or 10 is typically signal that multicollinearity exists. Remedies for multicollinearity include: removing one of the predictors from the regression model, combining the correlated variables into one composite variable (like total energy consumption), or applying dimensionality reduction techniques like principal component analysis before fitting a regression model. ```