Loading Data

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

setwd("C:/Users/tonge/Downloads")
countries <- read_csv("AllCountries.csv")

## Rows: 217 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): Country, Code
## dbl (24): LandArea, Population, Density, GDP, Rural, CO2, PumpPrice, Militar...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Simple Linear Regression model

y = life expectancy x = gdp

simple_model <- lm(LifeExpectancy ~ GDP, data = countries)

# View the model summary
summary(simple_model)

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = countries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

Interpretation

The intercept (around 68.42) represents the predicted Life Expectancy when GDP is 0. This is not not practically meaningful, but mathematically it’s the y-intercept.
The coefficient for weight (around .0002) means for every 1 lb increase in GDP, Life expectancy increases by about .0002 years
Both p values are < 0.05, indicating statistical significance.
R² (around 0.427) explains about 42.7% of the variance in MPG from weight alone—decent but room for improvement.

Multiple Linear Regression

multiple_model <- lm(LifeExpectancy ~ Health + Internet + GDP, data = countries)

summary(multiple_model)

## 
## Call:
## lm(formula = LifeExpectancy ~ Health + Internet + GDP, data = countries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Interpretation

The coefficient for weight (around .2479) means for every 1 lb increase in health, Life expectancy increases by about .2479 years. Health also has a greater positive impact of life expectancy compared to internet access and GDP since they have lower coefficients.
The adjusted R squared is now 71.6%. This has increased since model one. This shows that the added predicitors(Internet and Health) are essential in predicting Life Expectancy.

Check assumptions for simple regression

An ideal outcome would entail all variables being independent. In other words, that LifeExpectancy does not directly influence GDP. There also should not be highly correlated with one another. Additionally, there needs to be a linear relationship among the two variables. Next, that residuals are spread evenly among predictors showing that the predictions errors for low and high life expectancies are evenly distributed. Lastky, the residuals should folow a normally shaped distribution.

Visual Linearity Check

plot(countries$GDP, countries$LifeExpectancy,
     xlab="GDP", ylab="LifeExpectancy", main="Life Expectancy vs GDP")
abline(simple_model, col=1, lwd=2)

Clear positive linear trend: Life Expectancy increases as the GDP increases. The points are pretty far from the regression line showing that the linearity isn’t too good.

Independence

plot(resid(simple_model), type="b", main="Residuals vs Order", ylab="Residuals")
abline(h=0, lty=2)

Residuals vs Order: There are a lot of deviations from the line. Therefore, independence might be violated in this simple model. However, this can be fixed by adding various to the model like health.

Core diagnostics (covers: linearity, homoscedasticity, normality, influence)

par(mfrow=c(2,2)); plot(simple_model); par(mfrow=c(1,1))

Interpretations

-Residuals vs Fitted: It is not straight and has a downward curve. Therefore, linearity does not look too good. Values fan out towards lower values meaning that lower life expectancies are more likely to have prediction erroes. It is not evenly scattered around zero, showing that homoscedasticity? assumptions might not be meant.

Scale–Location: At the start, near the lower values, the residuals fan out a lot. There is some heteroscedasticity, meaning variance grows for low Life Expectancy predictions.

Q–Q plot: The line is sighlty curvy but the tails do deviate so normality is checked.

Residuals vs Leverage: None of the leverage points are supper influential

Diagnosing Model Fit

Calculating the RMSE

# Calculate residuals
residuals_multiple <- resid(multiple_model)

# Calculate RMSE for multiple model
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple

## [1] 4.056417

Multiple model, RMSE = 4.06 mpg meaning predictions miss by ~4.06 mpg on average. Therefore, the model predictors may not be as good at predicting larger residuals. To investigate further, I would check the assumptions to see if indepence or multicolinearity is effecting the model. Alternatively, I could do backwards elimination and see if the RMSE improves.

Hypothetical Example(paragraph)

Multicolinearity is when predictors are highly correlated with each other. Therefore, things would overlap, confuse the model, and lead to inaccurate predictions. In this scenario, the model cannot accurately predict CO2 emissons. The predictors, and their coefficients(stating their effect on CO2 emissions) would be wrong and lead to untrustworthy conclusions.

HW 9

S Tonge

2026-04-07