1.) Upload CSV
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
all_countries <- read_csv("~/Downloads/AllCountries.csv")
## Rows: 217 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Country, Code
## dbl (24): LandArea, Population, Density, GDP, Rural, CO2, PumpPrice, Militar...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(all_countries)
## # A tibble: 6 × 26
## Country Code LandArea Population Density GDP Rural CO2 PumpPrice Military
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghan… AFG 653. 37.2 56.9 521 74.5 0.29 0.7 3.72
## 2 Albania ALB 27.4 2.87 105. 5254 39.7 1.98 1.36 4.08
## 3 Algeria DZA 2382. 42.2 17.7 4279 27.4 3.74 0.28 13.8
## 4 Americ… ASM 0.2 0.055 277. NA 12.8 NA NA NA
## 5 Andorra AND 0.47 0.077 164. 42030 11.9 5.83 NA NA
## 6 Angola AGO 1247. 30.8 24.7 3432 34.5 1.29 0.97 9.4
## # ℹ 16 more variables: Health <dbl>, ArmedForces <dbl>, Internet <dbl>,
## # Cell <dbl>, HIV <dbl>, Hunger <dbl>, Diabetes <dbl>, BirthRate <dbl>,
## # DeathRate <dbl>, ElderlyPop <dbl>, LifeExpectancy <dbl>, FemaleLabor <dbl>,
## # Unemployment <dbl>, Energy <dbl>, Electricity <dbl>, Developed <dbl>
str(all_countries)
## spc_tbl_ [217 × 26] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Country : chr [1:217] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## $ Code : chr [1:217] "AFG" "ALB" "DZA" "ASM" ...
## $ LandArea : num [1:217] 652.86 27.4 2381.74 0.2 0.47 ...
## $ Population : num [1:217] 37.172 2.866 42.228 0.055 0.077 ...
## $ Density : num [1:217] 56.9 104.6 17.7 277.3 163.8 ...
## $ GDP : num [1:217] 521 5254 4279 NA 42030 ...
## $ Rural : num [1:217] 74.5 39.7 27.4 12.8 11.9 34.5 75.4 8.1 36.9 56.6 ...
## $ CO2 : num [1:217] 0.29 1.98 3.74 NA 5.83 1.29 5.74 4.78 1.9 8.41 ...
## $ PumpPrice : num [1:217] 0.7 1.36 0.28 NA NA 0.97 NA 1.1 0.77 NA ...
## $ Military : num [1:217] 3.72 4.08 13.81 NA NA ...
## $ Health : num [1:217] 2.01 9.51 10.73 NA 14.02 ...
## $ ArmedForces : num [1:217] 323 9 317 NA NA 117 0 105 49 NA ...
## $ Internet : num [1:217] 11.4 71.8 47.7 NA 98.9 14.3 76 75.8 69.7 97.2 ...
## $ Cell : num [1:217] 67.4 123.7 111 NA 104.4 ...
## $ HIV : num [1:217] NA 0.1 0.1 NA NA 1.9 NA 0.4 0.2 NA ...
## $ Hunger : num [1:217] 30.3 5.5 4.7 NA NA 23.9 NA 3.8 4.3 NA ...
## $ Diabetes : num [1:217] 9.6 10.1 6.7 NA 8 3.9 13.2 5.5 7.1 11.6 ...
## $ BirthRate : num [1:217] 32.5 11.7 22.3 NA NA 41.3 16.1 17 13.1 11 ...
## $ DeathRate : num [1:217] 6.6 7.5 4.8 NA NA 8.4 5.8 7.6 9.7 8.9 ...
## $ ElderlyPop : num [1:217] 2.6 13.6 6.4 NA NA 2.5 7.2 11.3 11.4 13.6 ...
## $ LifeExpectancy: num [1:217] 64 78.5 76.3 NA NA 61.8 76.5 76.7 74.8 76 ...
## $ FemaleLabor : num [1:217] 50.3 55.9 16.4 NA NA 76.4 NA 57.1 55.8 NA ...
## $ Unemployment : num [1:217] 1.5 13.9 12.1 NA NA 7.3 NA 9.5 17.7 NA ...
## $ Energy : num [1:217] NA 808 1328 NA NA ...
## $ Electricity : num [1:217] NA 2309 1363 NA NA ...
## $ Developed : num [1:217] NA 1 1 NA NA 1 NA 2 1 NA ...
## - attr(*, "spec")=
## .. cols(
## .. Country = col_character(),
## .. Code = col_character(),
## .. LandArea = col_double(),
## .. Population = col_double(),
## .. Density = col_double(),
## .. GDP = col_double(),
## .. Rural = col_double(),
## .. CO2 = col_double(),
## .. PumpPrice = col_double(),
## .. Military = col_double(),
## .. Health = col_double(),
## .. ArmedForces = col_double(),
## .. Internet = col_double(),
## .. Cell = col_double(),
## .. HIV = col_double(),
## .. Hunger = col_double(),
## .. Diabetes = col_double(),
## .. BirthRate = col_double(),
## .. DeathRate = col_double(),
## .. ElderlyPop = col_double(),
## .. LifeExpectancy = col_double(),
## .. FemaleLabor = col_double(),
## .. Unemployment = col_double(),
## .. Energy = col_double(),
## .. Electricity = col_double(),
## .. Developed = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
2.) Simple Linear Regression (Fitting and Interpretation): LifeExpectacy based on GDP
simple_model <- lm(LifeExpectancy ~ GDP, data = all_countries)
summary(simple_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = all_countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
Interpretation - The intercept (around 68.4) represents the predicted life expectancy when GDP is 0. This is not practically meaningful, but mathematically it’s the y-intercept.
The coefficient for GDP (around 0.00025) means for every 1 dollar increase in GDP per person, life expectancy increases by about 0.00025 years.
Look at the p-values: Both are < 0.05, indicating statistical significance.
R² (around 0.43) explains about 43% of the variance in life expectancy from GDP alone—decent but room for improvement.
3.) Multiple Linear Regression (Fitting and Interpretation): LifeExpectancy using GDP, Health (percentage of government expenditures on healthcare), and Internet (percentage of population with internet access) as predictors.
multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = all_countries)
summary(multiple_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = all_countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
Interpretation The estimated coefficient for Health is about 0.247, this means if Health spending goes up by 1 percentage point, life expectancy increases by about 0.25 years, on average. assuming GDP and Internet stay the same.
4.) Checking Assumptions (Homoscedasticity and Normality) Visual linearly check
plot(all_countries$GDP, all_countries$LifeExpectancy,
xlab = "GDP (US dollars per capita)", ylab = "Life Expectancy (years)", main = "Life Expectancy vs GDP")
abline(simple_model, col = 1, lwd = 2)
Clear positive trend: Life expectancy tends to increase as GDP rises,
and the points cluster around the regression line, so we can assume
linearity is reasonable.
par(mfrow = c(2, 2)); plot(simple_model); par(mfrow = c(1, 1))
Residuals vs Fitted: The residuals curve and have bigger spread at low
fitted values.
Scale–Location: The points are more spread out for some fitted values than others, and the red line isn’t flat which suggests the errors are not equally spread out everywhere
Q–Q Plot: Tails deviate. This means that residuals are not perfectly normal, but not disastrous. So normality is checked.
Residuals vs Leverage: Most points have low leverage and a few are higher but none seem to hinder the results
To check homoscedasticity, I look at the Residuals vs Fitted plot and the Scale–Location plot. There should be a constant variance of residuals.
To check normality, I look at the Q–Q plot. The points should be normally distributed, following the straight line with only small deviations at the ends/tails
5.) Diagnosing Model Fit (RMSE and Residuals)
# Calculate residuals for simple model (optional, to compare)
residuals_simple <- resid(simple_model)
# Calculate RMSE for simple model
rmse_simple <- sqrt(mean(residuals_simple^2))
rmse_simple
## [1] 5.868172
# For multiple model
# Calculate residuals
residuals_multiple <- resid(multiple_model)
# Calculate RMSE for multiple model
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.056417
Simple model: RMSE ≈ 5.87 years, meaning the predictions miss by ~5.87 years on average.
Multiple model: RMSE ≈ 4.06 years, so predictions miss by ~4.06 years on average. This means that the error drops by 1.81 years
Overall, the multiple model is better. It includes key factors (GDP, Health spending, and Internet access) that matter for real-world life expectancy, and it both raises R² (from about 0.43 to 0.72) and reduces RMSE (from about 5.87 to 4.06 years).
6.) Hypothetical Example (Multicollinearity in Multiple Regression
cor(all_countries[, c("Energy", "Electricity")], use = "complete.obs")
## Energy Electricity
## Energy 1.0000000 0.7970054
## Electricity 0.7970054 1.0000000
The multicollinearity might affect the interpretation of the coefficients and the reliability of the model because Energy and Electricity move together so much, so the model can’t tell which one is causing the change in CO2.