library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("C:/Users/tonge/Downloads")
countries <- read_csv("AllCountries.csv")
## Rows: 217 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Country, Code
## dbl (24): LandArea, Population, Density, GDP, Rural, CO2, PumpPrice, Militar...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
y = life expectancy x = gdp
simple_model <- lm(LifeExpectancy ~ GDP, data = countries)
# View the model summary
summary(simple_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
Interpretation
The intercept (around 68.42) represents the predicted Life Expectancy when GDP is 0. This is not not practically meaningful, but mathematically it’s the y-intercept.
The coefficient for weight (around .0002) means for every 1 lb increase in GDP, Life expectancy increases by about .0002 years
Both p values are < 0.05, indicating statistical significance.
R² (around 0.427) explains about 42.7% of the variance in MPG from weight alone—decent but room for improvement.
multiple_model <- lm(LifeExpectancy ~ Health + Internet + GDP, data = countries)
summary(multiple_model)
##
## Call:
## lm(formula = LifeExpectancy ~ Health + Internet + GDP, data = countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
Interpretation
An ideal outcome would entail all variables being independent. In other words, that LifeExpectancy does not directly influence GDP. There also should not be highly correlated with one another. Additionally, there needs to be a linear relationship among the two variables. Next, that residuals are spread evenly among predictors showing that the predictions errors for low and high life expectancies are evenly distributed. Lastky, the residuals should folow a normally shaped distribution.
Visual Linearity Check
plot(countries$GDP, countries$LifeExpectancy,
xlab="GDP", ylab="LifeExpectancy", main="Life Expectancy vs GDP")
abline(simple_model, col=1, lwd=2)
Clear positive linear trend: Life Expectancy increases as the GDP
increases. The points are pretty far from the regression line showing
that the linearity isn’t too good.
Independence
plot(resid(simple_model), type="b", main="Residuals vs Order", ylab="Residuals")
abline(h=0, lty=2)
Residuals vs Order: There are a lot of deviations from the line. Therefore, independence might be violated in this simple model. However, this can be fixed by adding various to the model like health.
Core diagnostics (covers: linearity, homoscedasticity, normality, influence)
par(mfrow=c(2,2)); plot(simple_model); par(mfrow=c(1,1))
Interpretations
-Residuals vs Fitted: It is not straight and has a downward curve. Therefore, linearity does not look too good. Values fan out towards lower values meaning that lower life expectancies are more likely to have prediction erroes. It is not evenly scattered around zero, showing that homoscedasticity? assumptions might not be meant.
Scale–Location: At the start, near the lower values, the residuals fan out a lot. There is some heteroscedasticity, meaning variance grows for low Life Expectancy predictions.
Q–Q plot: The line is sighlty curvy but the tails do deviate so normality is checked.
Residuals vs Leverage: None of the leverage points are supper influential
Calculating the RMSE
# Calculate residuals
residuals_multiple <- resid(multiple_model)
# Calculate RMSE for multiple model
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.056417
Multiple model, RMSE = 4.06 mpg meaning predictions miss by ~4.06 mpg on average. Therefore, the model predictors may not be as good at predicting larger residuals. To investigate further, I would check the assumptions to see if indepence or multicolinearity is effecting the model. Alternatively, I could do backwards elimination and see if the RMSE improves.
Multicolinearity is when predictors are highly correlated with each other. Therefore, things would overlap, confuse the model, and lead to inaccurate predictions. In this scenario, the model cannot accurately predict CO2 emissons. The predictors, and their coefficients(stating their effect on CO2 emissions) would be wrong and lead to untrustworthy conclusions.