library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(ggplot2)
library(Metrics) # For RMSE
#1. Upload Dataset
# Upload AllCountries.csv
AllCountries <- read_csv("AllCountries.csv")
## Rows: 217 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Country, Code
## dbl (24): LandArea, Population, Density, GDP, Rural, CO2, PumpPrice, Militar...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(AllCountries)
## # A tibble: 6 × 26
## Country Code LandArea Population Density GDP Rural CO2 PumpPrice Military
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghan… AFG 653. 37.2 56.9 521 74.5 0.29 0.7 3.72
## 2 Albania ALB 27.4 2.87 105. 5254 39.7 1.98 1.36 4.08
## 3 Algeria DZA 2382. 42.2 17.7 4279 27.4 3.74 0.28 13.8
## 4 Americ… ASM 0.2 0.055 277. NA 12.8 NA NA NA
## 5 Andorra AND 0.47 0.077 164. 42030 11.9 5.83 NA NA
## 6 Angola AGO 1247. 30.8 24.7 3432 34.5 1.29 0.97 9.4
## # ℹ 16 more variables: Health <dbl>, ArmedForces <dbl>, Internet <dbl>,
## # Cell <dbl>, HIV <dbl>, Hunger <dbl>, Diabetes <dbl>, BirthRate <dbl>,
## # DeathRate <dbl>, ElderlyPop <dbl>, LifeExpectancy <dbl>, FemaleLabor <dbl>,
## # Unemployment <dbl>, Energy <dbl>, Electricity <dbl>, Developed <dbl>
#2. Simple Linear Regression: LifeExpectancy ~ GDP
Fit the model:
# Ensure numeric
AllCountries$GDP <- as.numeric(AllCountries$GDP)
AllCountries$LifeExpectancy <- as.numeric(AllCountries$LifeExpectancy)
# Fit simple linear regression
slr_model <- lm(LifeExpectancy ~ GDP, data = AllCountries)
summary(slr_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = AllCountries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
Interpretation: Intercept: The predicted LifeExpectancy when GDP = 0.
Slope: The expected change in LifeExpectancy for a one-unit increase in GDP ($US per capita).
R²: Proportion of variation in LifeExpectancy explained by GDP. A higher R² indicates GDP explains more variation.
#3. Multiple Linear Regression: LifeExpectancy ~ GDP + Health + Internet
# Fit multiple linear regression
mlr_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
summary(mlr_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
Interpretation: Health coefficient: Holding GDP and Internet constant, for a 1% increase in government health spending, LifeExpectancy is expected to increase by the Health coefficient.
Adjusted R²: Compares to simple regression; if higher, the additional predictors (Health and Internet) improve model fit.
#4. Checking Assumptions (Homoscedasticity & Normality)
Homoscedasticity: Residuals vs Fitted Plot
plot(slr_model, which = 1)
Ideal outcome: Residuals scattered randomly around zero with constant spread.
Violation: Funnel or pattern indicates heteroscedasticity; predictions may be less reliable.
Normality: Q-Q Plot
plot(slr_model, which = 2)
Ideal outcome: Residuals lie on the diagonal line (normal distribution).
Violation: S-shaped or heavy tails indicate non-normal residuals; p-values may be less accurate.
#5. Diagnosing Model Fit (RMSE & Residuals) for Multiple Regression
mlr_data <- AllCountries %>% drop_na(LifeExpectancy, GDP, Health, Internet)
# Predicted values
pred <- predict(mlr_model, AllCountries)
# Residuals
res <- residuals(mlr_model)
# RMSE
rmse_value <- rmse(mlr_data$LifeExpectancy, pred)
## Warning in actual - predicted: longer object length is not a multiple of
## shorter object length
rmse_value
## [1] NA
RMSE: Average prediction error in years of LifeExpectancy. Lower RMSE = better model.
Large residuals: Countries with extreme values may reduce confidence; investigate outliers or missing predictors.
#6. Hypothetical Example: Multicollinearity
Suppose we predict CO2 ~ Energy + Electricity:
Problem: Energy and Electricity are highly correlated.
Effect: Coefficient estimates become unstable; it’s difficult to determine the individual effect of each predictor.
Solution: Consider removing one predictor, combining them, or using techniques like Principal Component Regression.