1.) Upload CSV

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
all_countries <- read_csv("~/Downloads/AllCountries.csv")
## Rows: 217 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): Country, Code
## dbl (24): LandArea, Population, Density, GDP, Rural, CO2, PumpPrice, Militar...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(all_countries)
## # A tibble: 6 × 26
##   Country Code  LandArea Population Density   GDP Rural   CO2 PumpPrice Military
##   <chr>   <chr>    <dbl>      <dbl>   <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
## 1 Afghan… AFG     653.       37.2      56.9   521  74.5  0.29      0.7      3.72
## 2 Albania ALB      27.4       2.87    105.   5254  39.7  1.98      1.36     4.08
## 3 Algeria DZA    2382.       42.2      17.7  4279  27.4  3.74      0.28    13.8 
## 4 Americ… ASM       0.2       0.055   277.     NA  12.8 NA        NA       NA   
## 5 Andorra AND       0.47      0.077   164.  42030  11.9  5.83     NA       NA   
## 6 Angola  AGO    1247.       30.8      24.7  3432  34.5  1.29      0.97     9.4 
## # ℹ 16 more variables: Health <dbl>, ArmedForces <dbl>, Internet <dbl>,
## #   Cell <dbl>, HIV <dbl>, Hunger <dbl>, Diabetes <dbl>, BirthRate <dbl>,
## #   DeathRate <dbl>, ElderlyPop <dbl>, LifeExpectancy <dbl>, FemaleLabor <dbl>,
## #   Unemployment <dbl>, Energy <dbl>, Electricity <dbl>, Developed <dbl>
str(all_countries)
## spc_tbl_ [217 × 26] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Country       : chr [1:217] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##  $ Code          : chr [1:217] "AFG" "ALB" "DZA" "ASM" ...
##  $ LandArea      : num [1:217] 652.86 27.4 2381.74 0.2 0.47 ...
##  $ Population    : num [1:217] 37.172 2.866 42.228 0.055 0.077 ...
##  $ Density       : num [1:217] 56.9 104.6 17.7 277.3 163.8 ...
##  $ GDP           : num [1:217] 521 5254 4279 NA 42030 ...
##  $ Rural         : num [1:217] 74.5 39.7 27.4 12.8 11.9 34.5 75.4 8.1 36.9 56.6 ...
##  $ CO2           : num [1:217] 0.29 1.98 3.74 NA 5.83 1.29 5.74 4.78 1.9 8.41 ...
##  $ PumpPrice     : num [1:217] 0.7 1.36 0.28 NA NA 0.97 NA 1.1 0.77 NA ...
##  $ Military      : num [1:217] 3.72 4.08 13.81 NA NA ...
##  $ Health        : num [1:217] 2.01 9.51 10.73 NA 14.02 ...
##  $ ArmedForces   : num [1:217] 323 9 317 NA NA 117 0 105 49 NA ...
##  $ Internet      : num [1:217] 11.4 71.8 47.7 NA 98.9 14.3 76 75.8 69.7 97.2 ...
##  $ Cell          : num [1:217] 67.4 123.7 111 NA 104.4 ...
##  $ HIV           : num [1:217] NA 0.1 0.1 NA NA 1.9 NA 0.4 0.2 NA ...
##  $ Hunger        : num [1:217] 30.3 5.5 4.7 NA NA 23.9 NA 3.8 4.3 NA ...
##  $ Diabetes      : num [1:217] 9.6 10.1 6.7 NA 8 3.9 13.2 5.5 7.1 11.6 ...
##  $ BirthRate     : num [1:217] 32.5 11.7 22.3 NA NA 41.3 16.1 17 13.1 11 ...
##  $ DeathRate     : num [1:217] 6.6 7.5 4.8 NA NA 8.4 5.8 7.6 9.7 8.9 ...
##  $ ElderlyPop    : num [1:217] 2.6 13.6 6.4 NA NA 2.5 7.2 11.3 11.4 13.6 ...
##  $ LifeExpectancy: num [1:217] 64 78.5 76.3 NA NA 61.8 76.5 76.7 74.8 76 ...
##  $ FemaleLabor   : num [1:217] 50.3 55.9 16.4 NA NA 76.4 NA 57.1 55.8 NA ...
##  $ Unemployment  : num [1:217] 1.5 13.9 12.1 NA NA 7.3 NA 9.5 17.7 NA ...
##  $ Energy        : num [1:217] NA 808 1328 NA NA ...
##  $ Electricity   : num [1:217] NA 2309 1363 NA NA ...
##  $ Developed     : num [1:217] NA 1 1 NA NA 1 NA 2 1 NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Country = col_character(),
##   ..   Code = col_character(),
##   ..   LandArea = col_double(),
##   ..   Population = col_double(),
##   ..   Density = col_double(),
##   ..   GDP = col_double(),
##   ..   Rural = col_double(),
##   ..   CO2 = col_double(),
##   ..   PumpPrice = col_double(),
##   ..   Military = col_double(),
##   ..   Health = col_double(),
##   ..   ArmedForces = col_double(),
##   ..   Internet = col_double(),
##   ..   Cell = col_double(),
##   ..   HIV = col_double(),
##   ..   Hunger = col_double(),
##   ..   Diabetes = col_double(),
##   ..   BirthRate = col_double(),
##   ..   DeathRate = col_double(),
##   ..   ElderlyPop = col_double(),
##   ..   LifeExpectancy = col_double(),
##   ..   FemaleLabor = col_double(),
##   ..   Unemployment = col_double(),
##   ..   Energy = col_double(),
##   ..   Electricity = col_double(),
##   ..   Developed = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

2.) Simple Linear Regression (Fitting and Interpretation): LifeExpectacy based on GDP

simple_model <- lm(LifeExpectancy ~ GDP, data = all_countries)

summary(simple_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = all_countries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

Interpretation - The intercept (around 68.4) represents the predicted life expectancy when GDP is 0. This is not practically meaningful, but mathematically it’s the y-intercept.

3.) Multiple Linear Regression (Fitting and Interpretation): LifeExpectancy using GDP, Health (percentage of government expenditures on healthcare), and Internet (percentage of population with internet access) as predictors.

multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = all_countries)

summary(multiple_model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = all_countries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Interpretation The estimated coefficient for Health is about 0.247, this means if Health spending goes up by 1 percentage point, life expectancy increases by about 0.25 years, on average. assuming GDP and Internet stay the same.

4.) Checking Assumptions (Homoscedasticity and Normality) Visual linearly check

plot(all_countries$GDP, all_countries$LifeExpectancy,
     xlab = "GDP (US dollars per capita)", ylab = "Life Expectancy (years)", main = "Life Expectancy vs GDP")
abline(simple_model, col = 1, lwd = 2)

Clear positive trend: Life expectancy tends to increase as GDP rises, and the points cluster around the regression line, so we can assume linearity is reasonable.

par(mfrow = c(2, 2)); plot(simple_model); par(mfrow = c(1, 1))

Residuals vs Fitted: The residuals curve and have bigger spread at low fitted values.

Scale–Location: The points are more spread out for some fitted values than others, and the red line isn’t flat which suggests the errors are not equally spread out everywhere

Q–Q Plot: Tails deviate. This means that residuals are not perfectly normal, but not disastrous. So normality is checked.

Residuals vs Leverage: Most points have low leverage and a few are higher but none seem to hinder the results

To check homoscedasticity, I look at the Residuals vs Fitted plot and the Scale–Location plot. There should be a constant variance of residuals.

To check normality, I look at the Q–Q plot. The points should be normally distributed, following the straight line with only small deviations at the ends/tails

5.) Diagnosing Model Fit (RMSE and Residuals)

# Calculate residuals for simple model (optional, to compare)
residuals_simple <- resid(simple_model)

# Calculate RMSE for simple model
rmse_simple <- sqrt(mean(residuals_simple^2))
rmse_simple
## [1] 5.868172
# For multiple model

# Calculate residuals
residuals_multiple <- resid(multiple_model)

# Calculate RMSE for multiple model
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.056417

Simple model: RMSE ≈ 5.87 years, meaning the predictions miss by ~5.87 years on average.

Multiple model: RMSE ≈ 4.06 years, so predictions miss by ~4.06 years on average. This means that the error drops by 1.81 years

Overall, the multiple model is better. It includes key factors (GDP, Health spending, and Internet access) that matter for real-world life expectancy, and it both raises R² (from about 0.43 to 0.72) and reduces RMSE (from about 5.87 to 4.06 years).

6.) Hypothetical Example (Multicollinearity in Multiple Regression

cor(all_countries[, c("Energy", "Electricity")], use = "complete.obs")
##                Energy Electricity
## Energy      1.0000000   0.7970054
## Electricity 0.7970054   1.0000000

The multicollinearity might affect the interpretation of the coefficients and the reliability of the model because Energy and Electricity move together so much, so the model can’t tell which one is causing the change in CO2.