Load Required Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(ggplot2)
library(Metrics)  # For RMSE

#1. Upload Dataset

# Upload AllCountries.csv

AllCountries <- read_csv("AllCountries.csv")
## Rows: 217 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): Country, Code
## dbl (24): LandArea, Population, Density, GDP, Rural, CO2, PumpPrice, Militar...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(AllCountries)
## # A tibble: 6 × 26
##   Country Code  LandArea Population Density   GDP Rural   CO2 PumpPrice Military
##   <chr>   <chr>    <dbl>      <dbl>   <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
## 1 Afghan… AFG     653.       37.2      56.9   521  74.5  0.29      0.7      3.72
## 2 Albania ALB      27.4       2.87    105.   5254  39.7  1.98      1.36     4.08
## 3 Algeria DZA    2382.       42.2      17.7  4279  27.4  3.74      0.28    13.8 
## 4 Americ… ASM       0.2       0.055   277.     NA  12.8 NA        NA       NA   
## 5 Andorra AND       0.47      0.077   164.  42030  11.9  5.83     NA       NA   
## 6 Angola  AGO    1247.       30.8      24.7  3432  34.5  1.29      0.97     9.4 
## # ℹ 16 more variables: Health <dbl>, ArmedForces <dbl>, Internet <dbl>,
## #   Cell <dbl>, HIV <dbl>, Hunger <dbl>, Diabetes <dbl>, BirthRate <dbl>,
## #   DeathRate <dbl>, ElderlyPop <dbl>, LifeExpectancy <dbl>, FemaleLabor <dbl>,
## #   Unemployment <dbl>, Energy <dbl>, Electricity <dbl>, Developed <dbl>

#2. Simple Linear Regression: LifeExpectancy ~ GDP

Fit the model:

# Ensure numeric

AllCountries$GDP <- as.numeric(AllCountries$GDP)
AllCountries$LifeExpectancy <- as.numeric(AllCountries$LifeExpectancy)

# Fit simple linear regression

slr_model <- lm(LifeExpectancy ~ GDP, data = AllCountries)
summary(slr_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = AllCountries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

Interpretation: Intercept: The predicted LifeExpectancy when GDP = 0.

Slope: The expected change in LifeExpectancy for a one-unit increase in GDP ($US per capita).

R²: Proportion of variation in LifeExpectancy explained by GDP. A higher R² indicates GDP explains more variation.

#3. Multiple Linear Regression: LifeExpectancy ~ GDP + Health + Internet

# Fit multiple linear regression

mlr_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
summary(mlr_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Interpretation: Health coefficient: Holding GDP and Internet constant, for a 1% increase in government health spending, LifeExpectancy is expected to increase by the Health coefficient.

Adjusted R²: Compares to simple regression; if higher, the additional predictors (Health and Internet) improve model fit.

#4. Checking Assumptions (Homoscedasticity & Normality)

Homoscedasticity: Residuals vs Fitted Plot

plot(slr_model, which = 1)

Ideal outcome: Residuals scattered randomly around zero with constant spread.

Violation: Funnel or pattern indicates heteroscedasticity; predictions may be less reliable.

Normality: Q-Q Plot

plot(slr_model, which = 2)

Ideal outcome: Residuals lie on the diagonal line (normal distribution).

Violation: S-shaped or heavy tails indicate non-normal residuals; p-values may be less accurate.

#5. Diagnosing Model Fit (RMSE & Residuals) for Multiple Regression

mlr_data <- AllCountries %>% drop_na(LifeExpectancy, GDP, Health, Internet)

# Predicted values

pred <- predict(mlr_model, AllCountries)

# Residuals

res <- residuals(mlr_model)

# RMSE

rmse_value <- rmse(mlr_data$LifeExpectancy, pred)
## Warning in actual - predicted: longer object length is not a multiple of
## shorter object length
rmse_value
## [1] NA

RMSE: Average prediction error in years of LifeExpectancy. Lower RMSE = better model.

Large residuals: Countries with extreme values may reduce confidence; investigate outliers or missing predictors.

#6. Hypothetical Example: Multicollinearity

Suppose we predict CO2 ~ Energy + Electricity:

Problem: Energy and Electricity are highly correlated.

Effect: Coefficient estimates become unstable; it’s difficult to determine the individual effect of each predictor.

Solution: Consider removing one predictor, combining them, or using techniques like Principal Component Regression.