1
# Load the data
AllCountries <- read.csv("AllCountries.csv")
# View structure
str(AllCountries)
## 'data.frame': 217 obs. of 26 variables:
## $ Country : chr "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## $ Code : chr "AFG" "ALB" "DZA" "ASM" ...
## $ LandArea : num 652.86 27.4 2381.74 0.2 0.47 ...
## $ Population : num 37.172 2.866 42.228 0.055 0.077 ...
## $ Density : num 56.9 104.6 17.7 277.3 163.8 ...
## $ GDP : int 521 5254 4279 NA 42030 3432 16864 11653 4212 NA ...
## $ Rural : num 74.5 39.7 27.4 12.8 11.9 34.5 75.4 8.1 36.9 56.6 ...
## $ CO2 : num 0.29 1.98 3.74 NA 5.83 1.29 5.74 4.78 1.9 8.41 ...
## $ PumpPrice : num 0.7 1.36 0.28 NA NA 0.97 NA 1.1 0.77 NA ...
## $ Military : num 3.72 4.08 13.81 NA NA ...
## $ Health : num 2.01 9.51 10.73 NA 14.02 ...
## $ ArmedForces : int 323 9 317 NA NA 117 0 105 49 NA ...
## $ Internet : num 11.4 71.8 47.7 NA 98.9 14.3 76 75.8 69.7 97.2 ...
## $ Cell : num 67.4 123.7 111 NA 104.4 ...
## $ HIV : num NA 0.1 0.1 NA NA 1.9 NA 0.4 0.2 NA ...
## $ Hunger : num 30.3 5.5 4.7 NA NA 23.9 NA 3.8 4.3 NA ...
## $ Diabetes : num 9.6 10.1 6.7 NA 8 3.9 13.2 5.5 7.1 11.6 ...
## $ BirthRate : num 32.5 11.7 22.3 NA NA 41.3 16.1 17 13.1 11 ...
## $ DeathRate : num 6.6 7.5 4.8 NA NA 8.4 5.8 7.6 9.7 8.9 ...
## $ ElderlyPop : num 2.6 13.6 6.4 NA NA 2.5 7.2 11.3 11.4 13.6 ...
## $ LifeExpectancy: num 64 78.5 76.3 NA NA 61.8 76.5 76.7 74.8 76 ...
## $ FemaleLabor : num 50.3 55.9 16.4 NA NA 76.4 NA 57.1 55.8 NA ...
## $ Unemployment : num 1.5 13.9 12.1 NA NA 7.3 NA 9.5 17.7 NA ...
## $ Energy : int NA 808 1328 NA NA 545 NA 2030 1016 NA ...
## $ Electricity : int NA 2309 1363 NA NA 312 NA 3075 1962 NA ...
## $ Developed : int NA 1 1 NA NA 1 NA 2 1 NA ...
2
# Fit simple linear regression
model1 <- lm(LifeExpectancy ~ GDP, data = AllCountries)
# Model summary
summary(model1)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = AllCountries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
Intercept: 68.42
Slope (GDP): 0.0002
R²: 0.430
Interpretation
Intercept (68.42): When GDP is zero, the predicted life expectancy is about 68.4 years.
Slope (0.0002): For every $1 increase in GDP per capita, life expectancy increases by about 0.0002 years. This means a $10,000 increase in GDP is associated with about a 2-year increase in life expectancy.
R² = 0.43: GDP explains about 43% of the variation in life expectancy across countries. This is a moderate relationship.
3
# Fit multiple linear regression
model2 <- lm(LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
# Model summary
summary(model2)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
Intercept 59.08 GDP 0.0000237 Health 0.2479 Internet 0.1903 Adjusted R²: 0.716
Iterpretation of Health Coefficient
The Health coefficient (0.2479) means:
For every 1% increase in government health spending, life expectancy increases by about 0.25 years, holding GDP and Internet constant.
Comparison of R² Values Model R² Simple (GDP only) 0.430 Multiple (GDP + Health + Internet) 0.716
This large increase shows that Health and Internet access greatly improve the model’s ability to explain life expectancy.
4
To check the assumption of homoscedasticity (equal variance), I use a residuals versus fitted values plot. I look for a random scatter of points with no clear pattern, which would indicate that the variance of the residuals is constant across all fitted values. If I observe a cone or funnel shape, this would indicate a violation of the assumption and suggest that the model’s predictions may be unreliable.
To check the assumption of normality of residuals, I use a Q–Q plot. I expect the points to closely follow a straight diagonal line, which would indicate that the residuals are normally distributed. If I notice that the points curve away from the line, this would indicate a violation of the normality assumption and suggest that the statistical inferences from the model may be biased.
# Residual plots
par(mfrow = c(1,2))
plot(model1$fitted.values, resid(model1),
xlab = "Fitted Values",
ylab = "Residuals",
main = "Residuals vs Fitted")
abline(h = 0)
qqnorm(resid(model1))
qqline(resid(model1))
Reflection
The residuals show some deviation from normality.
Mild heteroscedasticity is present.
The model is still usable, but predictions may be less reliable at extreme GDP values
5
# Predictions
pred <- predict(model2)
# RMSE calculation
rmse <- sqrt(mean((AllCountries$LifeExpectancy[complete.cases(
AllCountries[,c("LifeExpectancy","GDP","Health","Internet")]
)] - pred)^2))
rmse
## [1] 4.056417
Interpretation
On average, predictions are off by about 4 years of life expectancy.
Large residuals indicate countries with unusual healthcare systems, inequality, or political instability.
Further investigation may include:
Regional effects
War/conflict
Education levels
6
Model:
CO2 ~ Energy + Electricity
Explanation
If Energy and Electricity are highly correlated, then:
Coefficients become unstable
Standard errors increase
Individual predictor significance becomes unreliable
Model interpretation becomes misleading
Even if R² is high, you cannot trust the individual effects of Energy or Electricity separately.
Solution
Use:
VIF (Variance Inflation Factor)
Remove one predictor and Combine predictors into an index