knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
AllCountries <- read.csv("AllCountries.csv")
# Quick check
str(AllCountries)
## 'data.frame': 217 obs. of 26 variables:
## $ Country : chr "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## $ Code : chr "AFG" "ALB" "DZA" "ASM" ...
## $ LandArea : num 652.86 27.4 2381.74 0.2 0.47 ...
## $ Population : num 37.172 2.866 42.228 0.055 0.077 ...
## $ Density : num 56.9 104.6 17.7 277.3 163.8 ...
## $ GDP : int 521 5254 4279 NA 42030 3432 16864 11653 4212 NA ...
## $ Rural : num 74.5 39.7 27.4 12.8 11.9 34.5 75.4 8.1 36.9 56.6 ...
## $ CO2 : num 0.29 1.98 3.74 NA 5.83 1.29 5.74 4.78 1.9 8.41 ...
## $ PumpPrice : num 0.7 1.36 0.28 NA NA 0.97 NA 1.1 0.77 NA ...
## $ Military : num 3.72 4.08 13.81 NA NA ...
## $ Health : num 2.01 9.51 10.73 NA 14.02 ...
## $ ArmedForces : int 323 9 317 NA NA 117 0 105 49 NA ...
## $ Internet : num 11.4 71.8 47.7 NA 98.9 14.3 76 75.8 69.7 97.2 ...
## $ Cell : num 67.4 123.7 111 NA 104.4 ...
## $ HIV : num NA 0.1 0.1 NA NA 1.9 NA 0.4 0.2 NA ...
## $ Hunger : num 30.3 5.5 4.7 NA NA 23.9 NA 3.8 4.3 NA ...
## $ Diabetes : num 9.6 10.1 6.7 NA 8 3.9 13.2 5.5 7.1 11.6 ...
## $ BirthRate : num 32.5 11.7 22.3 NA NA 41.3 16.1 17 13.1 11 ...
## $ DeathRate : num 6.6 7.5 4.8 NA NA 8.4 5.8 7.6 9.7 8.9 ...
## $ ElderlyPop : num 2.6 13.6 6.4 NA NA 2.5 7.2 11.3 11.4 13.6 ...
## $ LifeExpectancy: num 64 78.5 76.3 NA NA 61.8 76.5 76.7 74.8 76 ...
## $ FemaleLabor : num 50.3 55.9 16.4 NA NA 76.4 NA 57.1 55.8 NA ...
## $ Unemployment : num 1.5 13.9 12.1 NA NA 7.3 NA 9.5 17.7 NA ...
## $ Energy : int NA 808 1328 NA NA 545 NA 2030 1016 NA ...
## $ Electricity : int NA 2309 1363 NA NA 312 NA 3075 1962 NA ...
## $ Developed : int NA 1 1 NA NA 1 NA 2 1 NA ...
summary(AllCountries$LifeExpectancy)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 52.20 66.90 74.30 72.46 77.70 84.70 18
summary(AllCountries$GDP)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 275 2032 5950 14733 17298 114340 30
# Simple linear regression
mod_simple <- lm(LifeExpectancy ~ GDP, data = AllCountries)
summary(mod_simple)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = AllCountries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
The intercept is 68.42, which represents the predicted life expectancy for a country with GDP per capita equal to $0.
The slope for GDP is estimated at 0.0002476. This means that for each additional 1000 US dollar of GDP per capita, the model predicts life expectancy to increase by around 0.25 years, or 3 months.
The R² value is 0.4304, which means that about 43% of the variation in life expectancy across countries is explained by differences in GDP per capita in this model. The higher the R², the better GDP alone explains variation in life expectancy; a lower R² means there is still a lot of variation that GDP cannot account for.
mod_multi <- lm(LifeExpectancy ~ GDP + Health + Internet,
data = AllCountries)
summary(mod_multi)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
Health Estimate = 0.2479
Holding GDP and Internet constant, a one–percentage point increase in healthcare spending is associated with an increase in life expectancy for about 0.248 years on average. This coefficient is positive and statistically significant (p = 0.000247), meaning Health is an important predictor of life expectancy after accounting for GDP and Internet usage.
Internet Estimate = 0.1903
Holding GDP and Health constant, a one–percentage point increase in the population using the Internet is associated with an increase of about 0.19 years in life expectancy. This relationship is also strongly statistically significant (p < 2e-16), suggesting that countries with wider Internet access tend to have higher life expectancies, even after adjusting for income and healthcare spending.
GDP Estimate = 0.00002367
The GDP coefficient is positive but not statistically significant (p = 0.302). This means that after controlling for Health and Internet access, GDP alone does not appear to explain additional variation in life expectancy. This makes sense: GDP is strongly correlated with Internet access and Health spending, so once those are included, GDP loses unique explanatory power.
Simple Model (GDP only)
Multiple Model (GDP + Health + Internet)
The adjusted R² rises from 0.4272 to 0.7164, meaning the multiple regression explains about 71.6% of the variation in life expectancy, compared to only 42.7% in the simple regression. Because adjusted R² also increases, the additional variables clearly provide meaningful explanatory power.
# Residuals vs Fitted (homoscedasticity)
plot(mod_simple, which = 1)
# Normal Q-Q plot (normality)
plot(mod_simple, which = 2)
Homoscedasticity
The homoscedasticity assumption is violated. The residuals show a curved, non-random pattern and uneven spread, suggesting that the relationship between GDP and life expectancy may not be purely linear. This means the simple linear model does not fully capture the true pattern in the data, and predictions may be less reliable at very low or high GDP levels.
Normality of residuals
The normality assumption is moderately violated, especially in the tails. While the center of the distribution looks reasonably normal, the residuals show heavy tails — meaning extreme residuals are more common than a normal distribution would expect. This can affect the accuracy of hypothesis tests and confidence intervals in the simple model.
Reflection Based on the diagnostic plots, the simple regression model does not fully meet the assumptions of linearity, homoscedasticity, and normality. The residuals show clear curvature and unequal variance, suggesting the relationship between GDP and life expectancy is more complex than the simple straight-line model captures. These issues indicate that the simple model may not be the best predictor, which aligns with the much stronger performance of the multiple regression model in Question 2.
# Predicted values
pred_multi <- predict(mod_multi)
# Residuals
res_multi <- resid(mod_multi)
# RMSE
rmse_multi <- sqrt(mean(res_multi^2))
rmse_multi
## [1] 4.056417
An RMSE of 4.056417 means that, on average, the multiple regression model’s predictions of life expectancy differ from the actual observed life expectancy by about 4.06 years.
This RMSE indicates a relatively strong model fit. While a 4-year error means the model is not perfect, it performs substantially better than the simple regression and captures most of the variation in life expectancy.
In this scenario, Energy and Electricity are highly correlated measures of national energy use. When two predictors are strongly related, the model experiences multicollinearity, which makes it difficult for the regression to determine the unique contribution of each variable to CO₂ emissions.
As a result, the estimated coefficients for Energy and Electricity can become unstable and unreliable. Even if CO₂ emissions are strongly related to overall energy consumption, the model may produce:
The overall model might still have a high R² and predict CO₂ emissions well, but the individual coefficients cannot be interpreted confidently because the model cannot separate the effects of Energy and Electricity.
To address multicollinearity, we could remove one of the two variables, combine them into a single energy-use index, or calculate Variance Inflation Factors (VIFs) to measure the severity of the issue.