Overview

In this activity, I am using the AllCountries dataset, which contains 217 observations from countries around the world and includes demographic, economic, and health-related variables. This dataset is useful for regression analysis because it allows me to examine how key predictors such as GDP, healthcare spending, and internet access relate to important outcomes like life expectancy.

In my analysis, I upload the AllCountries dataset as a CSV file, prepare it for modeling, fit a simple linear regression model and a multiple linear regression model, interpret the coefficients, check model assumptions, and diagnose overall model fit using residuals and RMSE. I also walk through a hypothetical example to understand how multicollinearity can affect regression results.

Step 1: Upload and Prepare the Data

To begin, I upload the dataset from a CSV file. Then I check the structure of the data so I can understand the types of variables I will work with, especially the predictors that will be used for the regression models.

# Load required packages

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Set working directory this way cause the other way did not work for me 

setwd("~/Data 101/HW9")

# 1. Upload the data set as a csv

allcountries <- read.csv("AllCountries.csv")

# Take a quick look

head(allcountries)
##          Country Code LandArea Population Density   GDP Rural  CO2 PumpPrice
## 1    Afghanistan  AFG   652.86     37.172    56.9   521  74.5 0.29      0.70
## 2        Albania  ALB    27.40      2.866   104.6  5254  39.7 1.98      1.36
## 3        Algeria  DZA  2381.74     42.228    17.7  4279  27.4 3.74      0.28
## 4 American Samoa  ASM     0.20      0.055   277.3    NA  12.8   NA        NA
## 5        Andorra  AND     0.47      0.077   163.8 42030  11.9 5.83        NA
## 6         Angola  AGO  1246.70     30.810    24.7  3432  34.5 1.29      0.97
##   Military Health ArmedForces Internet  Cell HIV Hunger Diabetes BirthRate
## 1     3.72   2.01         323     11.4  67.4  NA   30.3      9.6      32.5
## 2     4.08   9.51           9     71.8 123.7 0.1    5.5     10.1      11.7
## 3    13.81  10.73         317     47.7 111.0 0.1    4.7      6.7      22.3
## 4       NA     NA          NA       NA    NA  NA     NA       NA        NA
## 5       NA  14.02          NA     98.9 104.4  NA     NA      8.0        NA
## 6     9.40   5.43         117     14.3  44.7 1.9   23.9      3.9      41.3
##   DeathRate ElderlyPop LifeExpectancy FemaleLabor Unemployment Energy
## 1       6.6        2.6           64.0        50.3          1.5     NA
## 2       7.5       13.6           78.5        55.9         13.9    808
## 3       4.8        6.4           76.3        16.4         12.1   1328
## 4        NA         NA             NA          NA           NA     NA
## 5        NA         NA             NA          NA           NA     NA
## 6       8.4        2.5           61.8        76.4          7.3    545
##   Electricity Developed
## 1          NA        NA
## 2        2309         1
## 3        1363         1
## 4          NA        NA
## 5          NA        NA
## 6         312         1
# Check the structure

str(allcountries)
## 'data.frame':    217 obs. of  26 variables:
##  $ Country       : chr  "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##  $ Code          : chr  "AFG" "ALB" "DZA" "ASM" ...
##  $ LandArea      : num  652.86 27.4 2381.74 0.2 0.47 ...
##  $ Population    : num  37.172 2.866 42.228 0.055 0.077 ...
##  $ Density       : num  56.9 104.6 17.7 277.3 163.8 ...
##  $ GDP           : int  521 5254 4279 NA 42030 3432 16864 11653 4212 NA ...
##  $ Rural         : num  74.5 39.7 27.4 12.8 11.9 34.5 75.4 8.1 36.9 56.6 ...
##  $ CO2           : num  0.29 1.98 3.74 NA 5.83 1.29 5.74 4.78 1.9 8.41 ...
##  $ PumpPrice     : num  0.7 1.36 0.28 NA NA 0.97 NA 1.1 0.77 NA ...
##  $ Military      : num  3.72 4.08 13.81 NA NA ...
##  $ Health        : num  2.01 9.51 10.73 NA 14.02 ...
##  $ ArmedForces   : int  323 9 317 NA NA 117 0 105 49 NA ...
##  $ Internet      : num  11.4 71.8 47.7 NA 98.9 14.3 76 75.8 69.7 97.2 ...
##  $ Cell          : num  67.4 123.7 111 NA 104.4 ...
##  $ HIV           : num  NA 0.1 0.1 NA NA 1.9 NA 0.4 0.2 NA ...
##  $ Hunger        : num  30.3 5.5 4.7 NA NA 23.9 NA 3.8 4.3 NA ...
##  $ Diabetes      : num  9.6 10.1 6.7 NA 8 3.9 13.2 5.5 7.1 11.6 ...
##  $ BirthRate     : num  32.5 11.7 22.3 NA NA 41.3 16.1 17 13.1 11 ...
##  $ DeathRate     : num  6.6 7.5 4.8 NA NA 8.4 5.8 7.6 9.7 8.9 ...
##  $ ElderlyPop    : num  2.6 13.6 6.4 NA NA 2.5 7.2 11.3 11.4 13.6 ...
##  $ LifeExpectancy: num  64 78.5 76.3 NA NA 61.8 76.5 76.7 74.8 76 ...
##  $ FemaleLabor   : num  50.3 55.9 16.4 NA NA 76.4 NA 57.1 55.8 NA ...
##  $ Unemployment  : num  1.5 13.9 12.1 NA NA 7.3 NA 9.5 17.7 NA ...
##  $ Energy        : int  NA 808 1328 NA NA 545 NA 2030 1016 NA ...
##  $ Electricity   : int  NA 2309 1363 NA NA 312 NA 3075 1962 NA ...
##  $ Developed     : int  NA 1 1 NA NA 1 NA 2 1 NA ...
# Summary of key variables of interest

summary(allcountries[, c("LifeExpectancy", "GDP", "Health", "Internet",
"CO2", "Energy", "Electricity")])
##  LifeExpectancy       GDP             Health          Internet    
##  Min.   :52.20   Min.   :   275   Min.   : 0.000   Min.   : 1.30  
##  1st Qu.:66.90   1st Qu.:  2032   1st Qu.: 6.157   1st Qu.:29.18  
##  Median :74.30   Median :  5950   Median : 9.605   Median :58.35  
##  Mean   :72.46   Mean   : 14733   Mean   :10.597   Mean   :54.47  
##  3rd Qu.:77.70   3rd Qu.: 17298   3rd Qu.:13.713   3rd Qu.:78.92  
##  Max.   :84.70   Max.   :114340   Max.   :39.460   Max.   :98.90  
##  NA's   :18      NA's   :30       NA's   :29       NA's   :13     
##       CO2              Energy       Electricity   
##  Min.   : 0.0400   Min.   :   66   Min.   :   39  
##  1st Qu.: 0.8575   1st Qu.:  738   1st Qu.:  904  
##  Median : 2.7550   Median : 1574   Median : 2620  
##  Mean   : 4.9780   Mean   : 2664   Mean   : 4270  
##  3rd Qu.: 6.2525   3rd Qu.: 3060   3rd Qu.: 5600  
##  Max.   :43.8600   Max.   :17923   Max.   :53832  
##  NA's   :13        NA's   :82      NA's   :76

I look specifically at LifeExpectancy, GDP, Health, and Internet, since these will be used for the simple and multiple regression models.

Step 2: Simple Linear Regression — LifeExpectancy ~ GDP

# Fit simple linear regression: LifeExpectancy ~ GDP
simple_model <- lm(LifeExpectancy ~ GDP, data = allcountries)

# View the model summary
summary(simple_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = allcountries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

For the simple linear regression, I predict LifeExpectancy based on GDP per capita. Before fitting the model, I remove rows with missing values for these variables so the model will run properly.

After fitting the model, I interpret the intercept and slope:

The intercept represents the predicted life expectancy when GDP is zero. While this isn’t meaningful in real-world terms, it is needed mathematically.

The slope tells me how much life expectancy is expected to change with every one-unit increase in GDP. If the slope is positive, it means life expectancy increases as GDP increases.

I also look at the R² value to understand how much of the variation in life expectancy can be explained by GDP alone. A higher R² indicates a stronger relationship.

Step 3: Multiple Linear Regression — LifeExpectancy ~ GDP + Health + Internet

# Fit multiple linear regression: LifeExpectancy ~ GDP + Health + Internet
multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = allcountries)

# View the model summary
summary(multiple_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = allcountries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Next, I fit a multiple regression model predicting LifeExpectancy using three predictors:

GDP (economic strength)

Health (government spending on healthcare)

Internet (percentage of population with internet access)

I interpret each coefficient while holding the other variables constant.

For example, the coefficient for Health tells me how life expectancy changes when healthcare spending increases while controlling for GDP and Internet use.

I also compare the adjusted R² from the multiple regression to the simple model. If it increases, that means the added predictors help explain life expectancy better.

Step 4: Checking Assumptions — Homoscedasticity and Normality

simple_data <- allcountries[, c("LifeExpectancy", "GDP")]
simple_model <- lm(LifeExpectancy ~ GDP, data = simple_data)

# Extract residuals and fitted values

simple_resid  <- resid(simple_model)
simple_fitted <- fitted(simple_model)

# 1) Residuals vs Fitted (Homoscedasticity)

plot(simple_fitted, simple_resid,
xlab = "Fitted values (Predicted LifeExpectancy)",
ylab = "Residuals",
main = "Residuals vs Fitted: Simple Model")
abline(h = 0, lty = 2)

# 2) Normal Q-Q Plot (Normality)

qqnorm(simple_resid, main = "Normal Q-Q Plot: Simple Model Residuals")
qqline(simple_resid, col = 1, lwd = 2)

I check the regression assumptions for the simple model (LifeExpectancy ~ GDP).

To check homoscedasticity, I look at the residuals vs. fitted values plot. The ideal outcome is a random, evenly spread cloud of points. A violation appears as funnel shapes or patterns, which indicates non-constant variance.

To check normality, I look at the Q–Q plot of residuals. Ideally, the points fall along the diagonal line. If the points curve or deviate heavily at the ends, the residuals are not normally distributed.

After running these checks in R, I reflect on whether the results matched the ideal outcome.

Step 5: Diagnosing Model Fit — RMSE and Residuals

# Build multiple regression dataset (Country + predictors + response)
mult_data <- allcountries[, c("Country", "LifeExpectancy", "GDP", "Health", "Internet")]
mult_data <- na.omit(mult_data)

# Fit multiple linear regression: LifeExpectancy ~ GDP + Health + Internet
multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet,
                     data = mult_data)

summary(multiple_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = mult_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16
# Get residuals from the model
mult_resid <- resid(multiple_model)

# Add residuals back to the data frame
mult_data_with_resid <- mult_data
mult_data_with_resid$resid <- mult_resid

# Sort by largest absolute residual
order_index <- order(-abs(mult_data_with_resid$resid))
mult_data_with_resid <- mult_data_with_resid[order_index, ]

# Show the 10 countries with the largest absolute residuals
head(mult_data_with_resid[, c("Country", "LifeExpectancy", "resid")], 10)
##           Country LifeExpectancy      resid
## 48  Cote d'Ivoire           54.1 -14.566175
## 112       Lesotho           54.6 -12.884749
## 145       Nigeria           53.9 -11.741767
## 171  Sierra Leone           52.2 -11.365464
## 64       Eswatini           58.3 -10.419868
## 178  South Africa           63.4  -9.827573
## 16     Bangladesh           72.8   9.416106
## 114         Libya           72.1   8.699673
## 39           Chad           53.2  -8.597045
## 97          Italy           83.2   8.302370

For the multiple regression model, I calculate the RMSE (Root Mean Square Error). RMSE tells me, on average, how many years my predictions differ from the actual life expectancy values. A lower RMSE means the model predicts more accurately.

I also review the residuals to check whether certain countries have unusually large errors.

If I see extreme outliers — countries with unusually high or low life expectancy — I may need to investigate further, since these points can influence the regression and reduce confidence in predictions.

Step 6: Hypothetical Example — Multicollinearity

# Check correlation between Energy and Electricity

cor(allcountries[, c("Energy", "Electricity")], use = "complete.obs")
##                Energy Electricity
## Energy      1.0000000   0.7970054
## Electricity 0.7970054   1.0000000

To understand multicollinearity, I imagine a regression model predicting CO₂ emissions using:

Energy (kilotons of oil equivalent)

Electricity (kWh per capita)

If Energy and Electricity are highly correlated, then multicollinearity becomes a problem.

This means:

Coefficients may become unstable or flip signs.

Standard errors will increase.

Individual p-values may look insignificant even if the variables matter.

Interpretation becomes difficult because the predictors overlap too much.

Even though the model might still have a strong R², the reliability of each individual coefficient decreases. In this situation, I would consider removing one of the variables or combining them.