library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
allco <- read_csv("AllCountries.csv")
## Rows: 217 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): Country, Code
## dbl (24): LandArea, Population, Density, GDP, Rural, CO2, PumpPrice, Militar...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Simple Linear Regression (Fitting and Interpretation)

simple_model_allco <- lm(LifeExpectancy ~ GDP, data = allco)

summary(simple_model_allco)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = allco)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

Interpretation:

The intercept 6.842e+01 represents the predicted life expectency when GDP is 0. This is mathematically the y-intercept.

The coefficient for GDP 2.476e-04 means for every increase of GDP by 1 , life expectency increases by about 2.476e-04 years.

The p-values for both are significantly lower than < 0.05, indicating statistical significance.

R² (0.4272) explains about 43% of the variance in life expectency from GDP.

Multiple Linear Regression (Fitting and Interpretation)

multiple_model_allco <- lm(LifeExpectancy ~ GDP + Health + Internet, data = allco)

summary(multiple_model_allco)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = allco)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Intercept:(5.908e+01) Predicted Life Expectency when GDP, Health and Internet predictors are 0 is 5.908e+01

Coefficients (slope):

GDP: (2.367e-05), +1 GDP = +2.367e-05 years Life Expectency. p-value (0.302025) not significant.

Health: (2.479e-01), +1 unit health = +2.479e-01 years Life Expectency. p-value (0.000247) significant.

Internet: (1.903e-01), +1 unit Internet = +1.903e-01 years Life Expectency. p-value (< 2e-16) significant.

P-values: Health and Internet are significant (<0.05).

Adjusted R²: about 0.7164. This means about 72% of variance in life expectency is explained by this model. This increased from 43% from the simple model.

Checking Assumptions (Homoscedasticity and Normality)

Homoscedasticity In the Residuals vs Fitted, spread of residuals should be constant

Violation: Might mean results are not accuret

Normality In the Q-Q Residuals, residuals should be normally distributed.

Violation: Results might not be accurate

par(mfrow=c(2,2)); plot(simple_model_allco); par(mfrow=c(1,1))

Homoscedasticity: Curved shape in residuals vs fitted, this is a violation

Normality: Mostly normal in the middle with some deviation on the tails, so residuals are not perfectly normal but still relativly normal.

Diagnosing Model Fit

residuals_multiple_allco <- resid(multiple_model_allco)

rmse_multiple_allco <- sqrt(mean(residuals_multiple_allco^2))
rmse_multiple_allco
## [1] 4.056417

RMSE = 4.056417 for multiple regression meaning predictions miss by about 4.1 years on average for life expectency based on GDP, Health, and Internet. Large residuals for certain countries could be outliers and might reduce confidence in the model’s predictions.

In the future it might be intresting to look into how density and hunger might impact the model.

Hypothetical Example (Multicollinearity in Multiple Regression)

It reduces the relibability of a model to use Energy and Electricity as predictors for CO2 emissions if they are highly correlated as it would be dificult to know how each one impacts CO2 emissions. Coefficient SEs might be inflated and individual p-values can look nonsignificant even though the model fits well.