#Load Datasets
data <- read.csv("AllCountries.csv")
#Simple Linear Regression
# Fit simple regression
model1 <- lm(LifeExpectancy ~ GDP, data = data)
# View results
summary(model1)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
#Estimated Coefficients
Intercept (β₀): 68.42
Slope for GDP (β₁): 0.000248
R²: 0.4304
#Interpretation
The intercept (68.42) represents the predicted life expectancy for a country with a GDP per capita of $0. Although unrealistic in practice, it provides a baseline for the regression line.
The slope (0.000248) means that for each additional $1 in GDP per capita, life expectancy increases by 0.000248 years. Equivalently, for every $1,000 increase in GDP, life expectancy increases by about 0.248 years, on average.
The R² value of 0.4304 indicates that 43.0% of the variation in life expectancy across countries is explained by GDP alone. This shows a moderate positive relationship, but over half of the variation is still explained by other factors.
#Multiple Regression
# Fit multiple regression
model2 <- lm(LifeExpectancy ~ GDP + Health + Internet, data = data)
# View results
summary(model2)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
#Estimated Coefficients
Intercept: 59.08
GDP: 0.0000237
Health: 0.2479
Internet: 0.1903
R²: 0.7213
#Adjusted R²: 0.7164
Interpretation of Health (Key Requirement)
Holding GDP and Internet access constant, a 1 percentage-point increase in government health spending is associated with an average increase of 0.248 years in life expectancy.
This means that countries investing more in public health tend to have significantly longer life expectancy, even after accounting for economic wealth and digital access.
#Adjusted R² Comparison
Simple model R² (GDP only): 0.4304
Multiple model Adjusted R²: 0.7164
The large increase in Adjusted R² shows that Health and Internet dramatically improve the model’s explanatory power, confirming that life expectancy depends on much more than income alone. #Checking Assumptions
# Residual vs Fitted (Homoscedasticity)
plot(model1, which = 1)
# Normal Q–Q plot (Normality)
plot(model1, which = 2)
#Homoscedasticity
Ideal outcome:
Random scatter around zero with constant vertical spread.
Violation would mean:
Unequal variance → inefficient estimates and unreliable hypothesis tests.
#Normality of Residuals
Histogram of residuals
Shapiro–Wilk test
Ideal outcome:
Points lie approximately on the straight line in the Q-Q plot.
Violation would mean:
Confidence intervals and p-values may be inaccurate. #RMSE for Multiple Regression
# Correct residuals for model2
res2 <- residuals(model2)
# Correct RMSE
RMSE <- sqrt(mean(res2^2))
RMSE
## [1] 4.056417
On average, the model’s predicted life expectancy differs from the actual value by about 4 years per country. This reflects moderate prediction accuracy for cross-country life expectancy.
#Effect of Large Residuals
If certain countries have very large residuals:
Confidence in predictions for those countries decreases.
It may indicate:
Missing predictors (education, inequality, disease burden, conflict)
Measurement error
Unique country-specific circumstances
These should be examined using:
Cook’s Distance
Leverage
Outlier diagnostics #Multicollinearity Check for Hypothetical Example
model3 <- lm(CO2 ~ Energy + Electricity, data = data)
# Check correlation
cor(data$Energy, data$Electricity)
## [1] NA
# Check VIF
library(car)
## Loading required package: carData
vif(model3)
## Energy Electricity
## 2.74052 2.74052
summary(model3)
##
## Call:
## lm(formula = CO2 ~ Energy + Electricity, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.7559 -1.1406 -0.2020 0.7143 7.3751
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.998e-01 2.655e-01 3.012 0.00311 **
## Energy 3.122e-03 1.066e-04 29.290 < 2e-16 ***
## Electricity -7.044e-04 5.526e-05 -12.747 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.331 on 131 degrees of freedom
## (83 observations deleted due to missingness)
## Multiple R-squared: 0.899, Adjusted R-squared: 0.8974
## F-statistic: 582.8 on 2 and 131 DF, p-value: < 2.2e-16
Because Energy and Electricity measure closely related aspects of energy usage, their high correlation causes multicollinearity, which weakens the statistical reliability of the regression coefficients, even though overall CO₂ prediction may remain strong.