library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
allco <- read_csv("AllCountries.csv")
## Rows: 217 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Country, Code
## dbl (24): LandArea, Population, Density, GDP, Rural, CO2, PumpPrice, Militar...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
simple_model_allco <- lm(LifeExpectancy ~ GDP, data = allco)
summary(simple_model_allco)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = allco)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
The intercept 6.842e+01 represents the predicted life expectency when GDP is 0. This is mathematically the y-intercept.
The coefficient for GDP 2.476e-04 means for every increase of GDP by 1 , life expectency increases by about 2.476e-04 years.
The p-values for both are significantly lower than < 0.05, indicating statistical significance.
R² (0.4272) explains about 43% of the variance in life expectency from GDP.
multiple_model_allco <- lm(LifeExpectancy ~ GDP + Health + Internet, data = allco)
summary(multiple_model_allco)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = allco)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
Intercept:(5.908e+01) Predicted Life Expectency when GDP, Health and Internet predictors are 0 is 5.908e+01
Coefficients (slope):
GDP: (2.367e-05), +1 GDP = +2.367e-05 years Life Expectency. p-value (0.302025) not significant.
Health: (2.479e-01), +1 unit health = +2.479e-01 years Life Expectency. p-value (0.000247) significant.
Internet: (1.903e-01), +1 unit Internet = +1.903e-01 years Life Expectency. p-value (< 2e-16) significant.
P-values: Health and Internet are significant (<0.05).
Adjusted R²: about 0.7164. This means about 72% of variance in life expectency is explained by this model. This increased from 43% from the simple model.
Homoscedasticity In the Residuals vs Fitted, spread of residuals should be constant
Violation: Might mean results are not accuret
Normality In the Q-Q Residuals, residuals should be normally distributed.
Violation: Results might not be accurate
par(mfrow=c(2,2)); plot(simple_model_allco); par(mfrow=c(1,1))
Homoscedasticity: Curved shape in residuals vs fitted, this is a violation
Normality: Mostly normal in the middle with some deviation on the tails, so residuals are not perfectly normal but still relativly normal.
residuals_multiple_allco <- resid(multiple_model_allco)
rmse_multiple_allco <- sqrt(mean(residuals_multiple_allco^2))
rmse_multiple_allco
## [1] 4.056417
RMSE = 4.056417 for multiple regression meaning predictions miss by about 4.1 years on average for life expectency based on GDP, Health, and Internet. Large residuals for certain countries could be outliers and might reduce confidence in the model’s predictions.
In the future it might be intresting to look into how density and hunger might impact the model.
It reduces the relibability of a model to use Energy and Electricity as predictors for CO2 emissions if they are highly correlated as it would be dificult to know how each one impacts CO2 emissions. Coefficient SEs might be inflated and individual p-values can look nonsignificant even though the model fits well.