#Load Datasets

data <- read.csv("AllCountries.csv")

#Simple Linear Regression

# Fit simple regression
model1 <- lm(LifeExpectancy ~ GDP, data = data)

# View results
summary(model1)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

#Estimated Coefficients

Intercept (β₀): 68.42

Slope for GDP (β₁): 0.000248

R²: 0.4304

#Interpretation

The intercept (68.42) represents the predicted life expectancy for a country with a GDP per capita of $0. Although unrealistic in practice, it provides a baseline for the regression line.

The slope (0.000248) means that for each additional $1 in GDP per capita, life expectancy increases by 0.000248 years. Equivalently, for every $1,000 increase in GDP, life expectancy increases by about 0.248 years, on average.

The R² value of 0.4304 indicates that 43.0% of the variation in life expectancy across countries is explained by GDP alone. This shows a moderate positive relationship, but over half of the variation is still explained by other factors.

#Multiple Regression

# Fit multiple regression
model2 <- lm(LifeExpectancy ~ GDP + Health + Internet, data = data)

# View results
summary(model2)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

#Estimated Coefficients

Intercept: 59.08

GDP: 0.0000237

Health: 0.2479

Internet: 0.1903

R²: 0.7213

#Adjusted R²: 0.7164

Interpretation of Health (Key Requirement)

Holding GDP and Internet access constant, a 1 percentage-point increase in government health spending is associated with an average increase of 0.248 years in life expectancy.

This means that countries investing more in public health tend to have significantly longer life expectancy, even after accounting for economic wealth and digital access.

#Adjusted R² Comparison

Simple model R² (GDP only): 0.4304

Multiple model Adjusted R²: 0.7164

The large increase in Adjusted R² shows that Health and Internet dramatically improve the model’s explanatory power, confirming that life expectancy depends on much more than income alone. #Checking Assumptions

# Residual vs Fitted (Homoscedasticity)
plot(model1, which = 1)

# Normal Q–Q plot (Normality)
plot(model1, which = 2)

#Homoscedasticity

Ideal outcome:

Random scatter around zero with constant vertical spread.

Violation would mean:

Unequal variance → inefficient estimates and unreliable hypothesis tests.

#Normality of Residuals

Histogram of residuals

Shapiro–Wilk test

Ideal outcome:

Points lie approximately on the straight line in the Q-Q plot.

Violation would mean:

Confidence intervals and p-values may be inaccurate. #RMSE for Multiple Regression

# Correct residuals for model2
res2 <- residuals(model2)

# Correct RMSE
RMSE <- sqrt(mean(res2^2))
RMSE
## [1] 4.056417

On average, the model’s predicted life expectancy differs from the actual value by about 4 years per country. This reflects moderate prediction accuracy for cross-country life expectancy.

#Effect of Large Residuals

If certain countries have very large residuals:

Confidence in predictions for those countries decreases.

It may indicate:

Missing predictors (education, inequality, disease burden, conflict)

Measurement error

Unique country-specific circumstances

These should be examined using:

Cook’s Distance

Leverage

Outlier diagnostics #Multicollinearity Check for Hypothetical Example

model3 <- lm(CO2 ~ Energy + Electricity, data = data)

# Check correlation
cor(data$Energy, data$Electricity)
## [1] NA
# Check VIF
library(car)
## Loading required package: carData
vif(model3)
##      Energy Electricity 
##     2.74052     2.74052
summary(model3)
## 
## Call:
## lm(formula = CO2 ~ Energy + Electricity, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.7559  -1.1406  -0.2020   0.7143   7.3751 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.998e-01  2.655e-01   3.012  0.00311 ** 
## Energy       3.122e-03  1.066e-04  29.290  < 2e-16 ***
## Electricity -7.044e-04  5.526e-05 -12.747  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.331 on 131 degrees of freedom
##   (83 observations deleted due to missingness)
## Multiple R-squared:  0.899,  Adjusted R-squared:  0.8974 
## F-statistic: 582.8 on 2 and 131 DF,  p-value: < 2.2e-16

Because Energy and Electricity measure closely related aspects of energy usage, their high correlation causes multicollinearity, which weakens the statistical reliability of the regression coefficients, even though overall CO₂ prediction may remain strong.