HW 9

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

1. Uploading the Dataset

AllCountries <- read.csv("AllCountries.csv")

# Quick check

str(AllCountries)

## 'data.frame':    217 obs. of  26 variables:
##  $ Country       : chr  "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##  $ Code          : chr  "AFG" "ALB" "DZA" "ASM" ...
##  $ LandArea      : num  652.86 27.4 2381.74 0.2 0.47 ...
##  $ Population    : num  37.172 2.866 42.228 0.055 0.077 ...
##  $ Density       : num  56.9 104.6 17.7 277.3 163.8 ...
##  $ GDP           : int  521 5254 4279 NA 42030 3432 16864 11653 4212 NA ...
##  $ Rural         : num  74.5 39.7 27.4 12.8 11.9 34.5 75.4 8.1 36.9 56.6 ...
##  $ CO2           : num  0.29 1.98 3.74 NA 5.83 1.29 5.74 4.78 1.9 8.41 ...
##  $ PumpPrice     : num  0.7 1.36 0.28 NA NA 0.97 NA 1.1 0.77 NA ...
##  $ Military      : num  3.72 4.08 13.81 NA NA ...
##  $ Health        : num  2.01 9.51 10.73 NA 14.02 ...
##  $ ArmedForces   : int  323 9 317 NA NA 117 0 105 49 NA ...
##  $ Internet      : num  11.4 71.8 47.7 NA 98.9 14.3 76 75.8 69.7 97.2 ...
##  $ Cell          : num  67.4 123.7 111 NA 104.4 ...
##  $ HIV           : num  NA 0.1 0.1 NA NA 1.9 NA 0.4 0.2 NA ...
##  $ Hunger        : num  30.3 5.5 4.7 NA NA 23.9 NA 3.8 4.3 NA ...
##  $ Diabetes      : num  9.6 10.1 6.7 NA 8 3.9 13.2 5.5 7.1 11.6 ...
##  $ BirthRate     : num  32.5 11.7 22.3 NA NA 41.3 16.1 17 13.1 11 ...
##  $ DeathRate     : num  6.6 7.5 4.8 NA NA 8.4 5.8 7.6 9.7 8.9 ...
##  $ ElderlyPop    : num  2.6 13.6 6.4 NA NA 2.5 7.2 11.3 11.4 13.6 ...
##  $ LifeExpectancy: num  64 78.5 76.3 NA NA 61.8 76.5 76.7 74.8 76 ...
##  $ FemaleLabor   : num  50.3 55.9 16.4 NA NA 76.4 NA 57.1 55.8 NA ...
##  $ Unemployment  : num  1.5 13.9 12.1 NA NA 7.3 NA 9.5 17.7 NA ...
##  $ Energy        : int  NA 808 1328 NA NA 545 NA 2030 1016 NA ...
##  $ Electricity   : int  NA 2309 1363 NA NA 312 NA 3075 1962 NA ...
##  $ Developed     : int  NA 1 1 NA NA 1 NA 2 1 NA ...

summary(AllCountries$LifeExpectancy)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   52.20   66.90   74.30   72.46   77.70   84.70      18

summary(AllCountries$GDP)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     275    2032    5950   14733   17298  114340      30

2. Simple Linear Regression (Fitting and Interpretation)

# Simple linear regression
mod_simple <- lm(LifeExpectancy ~ GDP, data = AllCountries)

summary(mod_simple)

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = AllCountries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

The intercept is 68.42, which represents the predicted life expectancy for a country with GDP per capita equal to $0.

The slope for GDP is estimated at 0.0002476. This means that for each additional 1000 US dollar of GDP per capita, the model predicts life expectancy to increase by around 0.25 years, or 3 months.

The R² value is 0.4304, which means that about 43% of the variation in life expectancy across countries is explained by differences in GDP per capita in this model. The higher the R², the better GDP alone explains variation in life expectancy; a lower R² means there is still a lot of variation that GDP cannot account for.

3. Multiple Linear Regression (Fitting and Interpretation)

mod_multi <- lm(LifeExpectancy ~ GDP + Health + Internet, 
                data = AllCountries)

summary(mod_multi)

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Health Estimate = 0.2479

Holding GDP and Internet constant, a one–percentage point increase in healthcare spending is associated with an increase in life expectancy for about 0.248 years on average. This coefficient is positive and statistically significant (p = 0.000247), meaning Health is an important predictor of life expectancy after accounting for GDP and Internet usage.

Internet Estimate = 0.1903

Holding GDP and Health constant, a one–percentage point increase in the population using the Internet is associated with an increase of about 0.19 years in life expectancy. This relationship is also strongly statistically significant (p < 2e-16), suggesting that countries with wider Internet access tend to have higher life expectancies, even after adjusting for income and healthcare spending.

GDP Estimate = 0.00002367

The GDP coefficient is positive but not statistically significant (p = 0.302). This means that after controlling for Health and Internet access, GDP alone does not appear to explain additional variation in life expectancy. This makes sense: GDP is strongly correlated with Internet access and Health spending, so once those are included, GDP loses unique explanatory power.

Simple Model (GDP only)

R² = 0.4304
Adjusted R² = 0.4272

Multiple Model (GDP + Health + Internet)

R² = 0.7213
Adjusted R² = 0.7164

The adjusted R² rises from 0.4272 to 0.7164, meaning the multiple regression explains about 71.6% of the variation in life expectancy, compared to only 42.7% in the simple regression. Because adjusted R² also increases, the additional variables clearly provide meaningful explanatory power.

4. Checking Assumptions (Homoscedasticity and Normality)

# Residuals vs Fitted (homoscedasticity)
plot(mod_simple, which = 1)

# Normal Q-Q plot (normality)
plot(mod_simple, which = 2)

Homoscedasticity

The homoscedasticity assumption is violated. The residuals show a curved, non-random pattern and uneven spread, suggesting that the relationship between GDP and life expectancy may not be purely linear. This means the simple linear model does not fully capture the true pattern in the data, and predictions may be less reliable at very low or high GDP levels.

Normality of residuals

The normality assumption is moderately violated, especially in the tails. While the center of the distribution looks reasonably normal, the residuals show heavy tails — meaning extreme residuals are more common than a normal distribution would expect. This can affect the accuracy of hypothesis tests and confidence intervals in the simple model.

Reflection Based on the diagnostic plots, the simple regression model does not fully meet the assumptions of linearity, homoscedasticity, and normality. The residuals show clear curvature and unequal variance, suggesting the relationship between GDP and life expectancy is more complex than the simple straight-line model captures. These issues indicate that the simple model may not be the best predictor, which aligns with the much stronger performance of the multiple regression model in Question 2.

5. Diagnosing Model Fit (RMSE and Residuals)

# Predicted values
pred_multi <- predict(mod_multi)

# Residuals
res_multi <- resid(mod_multi)

# RMSE
rmse_multi <- sqrt(mean(res_multi^2))
rmse_multi

## [1] 4.056417

An RMSE of 4.056417 means that, on average, the multiple regression model’s predictions of life expectancy differ from the actual observed life expectancy by about 4.06 years.

This RMSE indicates a relatively strong model fit. While a 4-year error means the model is not perfect, it performs substantially better than the simple regression and captures most of the variation in life expectancy.

6. Hypothetical Example (Multicollinearity in Multiple Regression)

In this scenario, Energy and Electricity are highly correlated measures of national energy use. When two predictors are strongly related, the model experiences multicollinearity, which makes it difficult for the regression to determine the unique contribution of each variable to CO₂ emissions.

As a result, the estimated coefficients for Energy and Electricity can become unstable and unreliable. Even if CO₂ emissions are strongly related to overall energy consumption, the model may produce:

Large standard errors
Weak or insignificant t-values
Coefficients with unexpected signs (e.g., one positive and one negative)
Coefficients that change dramatically when small changes are made to the dataset

The overall model might still have a high R² and predict CO₂ emissions well, but the individual coefficients cannot be interpreted confidently because the model cannot separate the effects of Energy and Electricity.

To address multicollinearity, we could remove one of the two variables, combine them into a single energy-use index, or calculate Variance Inflation Factors (VIFs) to measure the severity of the issue.