Overview
In this activity, I am using the AllCountries dataset, which contains 217 observations from countries around the world and includes demographic, economic, and health-related variables. This dataset is useful for regression analysis because it allows me to examine how key predictors such as GDP, healthcare spending, and internet access relate to important outcomes like life expectancy.
In my analysis, I upload the AllCountries dataset as a CSV file, prepare it for modeling, fit a simple linear regression model and a multiple linear regression model, interpret the coefficients, check model assumptions, and diagnose overall model fit using residuals and RMSE. I also walk through a hypothetical example to understand how multicollinearity can affect regression results.
Step 1: Upload and Prepare the Data
To begin, I upload the dataset from a CSV file. Then I check the structure of the data so I can understand the types of variables I will work with, especially the predictors that will be used for the regression models.
# Load required packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Set working directory this way cause the other way did not work for me
setwd("~/Data 101/HW9")
# 1. Upload the data set as a csv
allcountries <- read.csv("AllCountries.csv")
# Take a quick look
head(allcountries)
## Country Code LandArea Population Density GDP Rural CO2 PumpPrice
## 1 Afghanistan AFG 652.86 37.172 56.9 521 74.5 0.29 0.70
## 2 Albania ALB 27.40 2.866 104.6 5254 39.7 1.98 1.36
## 3 Algeria DZA 2381.74 42.228 17.7 4279 27.4 3.74 0.28
## 4 American Samoa ASM 0.20 0.055 277.3 NA 12.8 NA NA
## 5 Andorra AND 0.47 0.077 163.8 42030 11.9 5.83 NA
## 6 Angola AGO 1246.70 30.810 24.7 3432 34.5 1.29 0.97
## Military Health ArmedForces Internet Cell HIV Hunger Diabetes BirthRate
## 1 3.72 2.01 323 11.4 67.4 NA 30.3 9.6 32.5
## 2 4.08 9.51 9 71.8 123.7 0.1 5.5 10.1 11.7
## 3 13.81 10.73 317 47.7 111.0 0.1 4.7 6.7 22.3
## 4 NA NA NA NA NA NA NA NA NA
## 5 NA 14.02 NA 98.9 104.4 NA NA 8.0 NA
## 6 9.40 5.43 117 14.3 44.7 1.9 23.9 3.9 41.3
## DeathRate ElderlyPop LifeExpectancy FemaleLabor Unemployment Energy
## 1 6.6 2.6 64.0 50.3 1.5 NA
## 2 7.5 13.6 78.5 55.9 13.9 808
## 3 4.8 6.4 76.3 16.4 12.1 1328
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## 6 8.4 2.5 61.8 76.4 7.3 545
## Electricity Developed
## 1 NA NA
## 2 2309 1
## 3 1363 1
## 4 NA NA
## 5 NA NA
## 6 312 1
# Check the structure
str(allcountries)
## 'data.frame': 217 obs. of 26 variables:
## $ Country : chr "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## $ Code : chr "AFG" "ALB" "DZA" "ASM" ...
## $ LandArea : num 652.86 27.4 2381.74 0.2 0.47 ...
## $ Population : num 37.172 2.866 42.228 0.055 0.077 ...
## $ Density : num 56.9 104.6 17.7 277.3 163.8 ...
## $ GDP : int 521 5254 4279 NA 42030 3432 16864 11653 4212 NA ...
## $ Rural : num 74.5 39.7 27.4 12.8 11.9 34.5 75.4 8.1 36.9 56.6 ...
## $ CO2 : num 0.29 1.98 3.74 NA 5.83 1.29 5.74 4.78 1.9 8.41 ...
## $ PumpPrice : num 0.7 1.36 0.28 NA NA 0.97 NA 1.1 0.77 NA ...
## $ Military : num 3.72 4.08 13.81 NA NA ...
## $ Health : num 2.01 9.51 10.73 NA 14.02 ...
## $ ArmedForces : int 323 9 317 NA NA 117 0 105 49 NA ...
## $ Internet : num 11.4 71.8 47.7 NA 98.9 14.3 76 75.8 69.7 97.2 ...
## $ Cell : num 67.4 123.7 111 NA 104.4 ...
## $ HIV : num NA 0.1 0.1 NA NA 1.9 NA 0.4 0.2 NA ...
## $ Hunger : num 30.3 5.5 4.7 NA NA 23.9 NA 3.8 4.3 NA ...
## $ Diabetes : num 9.6 10.1 6.7 NA 8 3.9 13.2 5.5 7.1 11.6 ...
## $ BirthRate : num 32.5 11.7 22.3 NA NA 41.3 16.1 17 13.1 11 ...
## $ DeathRate : num 6.6 7.5 4.8 NA NA 8.4 5.8 7.6 9.7 8.9 ...
## $ ElderlyPop : num 2.6 13.6 6.4 NA NA 2.5 7.2 11.3 11.4 13.6 ...
## $ LifeExpectancy: num 64 78.5 76.3 NA NA 61.8 76.5 76.7 74.8 76 ...
## $ FemaleLabor : num 50.3 55.9 16.4 NA NA 76.4 NA 57.1 55.8 NA ...
## $ Unemployment : num 1.5 13.9 12.1 NA NA 7.3 NA 9.5 17.7 NA ...
## $ Energy : int NA 808 1328 NA NA 545 NA 2030 1016 NA ...
## $ Electricity : int NA 2309 1363 NA NA 312 NA 3075 1962 NA ...
## $ Developed : int NA 1 1 NA NA 1 NA 2 1 NA ...
# Summary of key variables of interest
summary(allcountries[, c("LifeExpectancy", "GDP", "Health", "Internet",
"CO2", "Energy", "Electricity")])
## LifeExpectancy GDP Health Internet
## Min. :52.20 Min. : 275 Min. : 0.000 Min. : 1.30
## 1st Qu.:66.90 1st Qu.: 2032 1st Qu.: 6.157 1st Qu.:29.18
## Median :74.30 Median : 5950 Median : 9.605 Median :58.35
## Mean :72.46 Mean : 14733 Mean :10.597 Mean :54.47
## 3rd Qu.:77.70 3rd Qu.: 17298 3rd Qu.:13.713 3rd Qu.:78.92
## Max. :84.70 Max. :114340 Max. :39.460 Max. :98.90
## NA's :18 NA's :30 NA's :29 NA's :13
## CO2 Energy Electricity
## Min. : 0.0400 Min. : 66 Min. : 39
## 1st Qu.: 0.8575 1st Qu.: 738 1st Qu.: 904
## Median : 2.7550 Median : 1574 Median : 2620
## Mean : 4.9780 Mean : 2664 Mean : 4270
## 3rd Qu.: 6.2525 3rd Qu.: 3060 3rd Qu.: 5600
## Max. :43.8600 Max. :17923 Max. :53832
## NA's :13 NA's :82 NA's :76
I look specifically at LifeExpectancy, GDP, Health, and Internet, since these will be used for the simple and multiple regression models.
Step 2: Simple Linear Regression — LifeExpectancy ~ GDP
# Fit simple linear regression: LifeExpectancy ~ GDP
simple_model <- lm(LifeExpectancy ~ GDP, data = allcountries)
# View the model summary
summary(simple_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = allcountries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
For the simple linear regression, I predict LifeExpectancy based on GDP per capita. Before fitting the model, I remove rows with missing values for these variables so the model will run properly.
After fitting the model, I interpret the intercept and slope:
The intercept represents the predicted life expectancy when GDP is zero. While this isn’t meaningful in real-world terms, it is needed mathematically.
The slope tells me how much life expectancy is expected to change with every one-unit increase in GDP. If the slope is positive, it means life expectancy increases as GDP increases.
I also look at the R² value to understand how much of the variation in life expectancy can be explained by GDP alone. A higher R² indicates a stronger relationship.
Step 3: Multiple Linear Regression — LifeExpectancy ~ GDP + Health + Internet
# Fit multiple linear regression: LifeExpectancy ~ GDP + Health + Internet
multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = allcountries)
# View the model summary
summary(multiple_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = allcountries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
Next, I fit a multiple regression model predicting LifeExpectancy using three predictors:
GDP (economic strength)
Health (government spending on healthcare)
Internet (percentage of population with internet access)
I interpret each coefficient while holding the other variables constant.
For example, the coefficient for Health tells me how life expectancy changes when healthcare spending increases while controlling for GDP and Internet use.
I also compare the adjusted R² from the multiple regression to the simple model. If it increases, that means the added predictors help explain life expectancy better.
Step 4: Checking Assumptions — Homoscedasticity and Normality
simple_data <- allcountries[, c("LifeExpectancy", "GDP")]
simple_model <- lm(LifeExpectancy ~ GDP, data = simple_data)
# Extract residuals and fitted values
simple_resid <- resid(simple_model)
simple_fitted <- fitted(simple_model)
# 1) Residuals vs Fitted (Homoscedasticity)
plot(simple_fitted, simple_resid,
xlab = "Fitted values (Predicted LifeExpectancy)",
ylab = "Residuals",
main = "Residuals vs Fitted: Simple Model")
abline(h = 0, lty = 2)
# 2) Normal Q-Q Plot (Normality)
qqnorm(simple_resid, main = "Normal Q-Q Plot: Simple Model Residuals")
qqline(simple_resid, col = 1, lwd = 2)
I check the regression assumptions for the simple model (LifeExpectancy ~ GDP).
To check homoscedasticity, I look at the residuals vs. fitted values plot. The ideal outcome is a random, evenly spread cloud of points. A violation appears as funnel shapes or patterns, which indicates non-constant variance.
To check normality, I look at the Q–Q plot of residuals. Ideally, the points fall along the diagonal line. If the points curve or deviate heavily at the ends, the residuals are not normally distributed.
After running these checks in R, I reflect on whether the results matched the ideal outcome.
Step 5: Diagnosing Model Fit — RMSE and Residuals
# Build multiple regression dataset (Country + predictors + response)
mult_data <- allcountries[, c("Country", "LifeExpectancy", "GDP", "Health", "Internet")]
mult_data <- na.omit(mult_data)
# Fit multiple linear regression: LifeExpectancy ~ GDP + Health + Internet
multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet,
data = mult_data)
summary(multiple_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = mult_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
# Get residuals from the model
mult_resid <- resid(multiple_model)
# Add residuals back to the data frame
mult_data_with_resid <- mult_data
mult_data_with_resid$resid <- mult_resid
# Sort by largest absolute residual
order_index <- order(-abs(mult_data_with_resid$resid))
mult_data_with_resid <- mult_data_with_resid[order_index, ]
# Show the 10 countries with the largest absolute residuals
head(mult_data_with_resid[, c("Country", "LifeExpectancy", "resid")], 10)
## Country LifeExpectancy resid
## 48 Cote d'Ivoire 54.1 -14.566175
## 112 Lesotho 54.6 -12.884749
## 145 Nigeria 53.9 -11.741767
## 171 Sierra Leone 52.2 -11.365464
## 64 Eswatini 58.3 -10.419868
## 178 South Africa 63.4 -9.827573
## 16 Bangladesh 72.8 9.416106
## 114 Libya 72.1 8.699673
## 39 Chad 53.2 -8.597045
## 97 Italy 83.2 8.302370
For the multiple regression model, I calculate the RMSE (Root Mean Square Error). RMSE tells me, on average, how many years my predictions differ from the actual life expectancy values. A lower RMSE means the model predicts more accurately.
I also review the residuals to check whether certain countries have unusually large errors.
If I see extreme outliers — countries with unusually high or low life expectancy — I may need to investigate further, since these points can influence the regression and reduce confidence in predictions.
Step 6: Hypothetical Example — Multicollinearity
# Check correlation between Energy and Electricity
cor(allcountries[, c("Energy", "Electricity")], use = "complete.obs")
## Energy Electricity
## Energy 1.0000000 0.7970054
## Electricity 0.7970054 1.0000000
To understand multicollinearity, I imagine a regression model predicting CO₂ emissions using:
Energy (kilotons of oil equivalent)
Electricity (kWh per capita)
If Energy and Electricity are highly correlated, then multicollinearity becomes a problem.
This means:
Coefficients may become unstable or flip signs.
Standard errors will increase.
Individual p-values may look insignificant even if the variables matter.
Interpretation becomes difficult because the predictors overlap too much.
Even though the model might still have a strong R², the reliability of each individual coefficient decreases. In this situation, I would consider removing one of the variables or combining them.