1

# Load the data
AllCountries <- read.csv("AllCountries.csv")

# View structure
str(AllCountries)
## 'data.frame':    217 obs. of  26 variables:
##  $ Country       : chr  "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##  $ Code          : chr  "AFG" "ALB" "DZA" "ASM" ...
##  $ LandArea      : num  652.86 27.4 2381.74 0.2 0.47 ...
##  $ Population    : num  37.172 2.866 42.228 0.055 0.077 ...
##  $ Density       : num  56.9 104.6 17.7 277.3 163.8 ...
##  $ GDP           : int  521 5254 4279 NA 42030 3432 16864 11653 4212 NA ...
##  $ Rural         : num  74.5 39.7 27.4 12.8 11.9 34.5 75.4 8.1 36.9 56.6 ...
##  $ CO2           : num  0.29 1.98 3.74 NA 5.83 1.29 5.74 4.78 1.9 8.41 ...
##  $ PumpPrice     : num  0.7 1.36 0.28 NA NA 0.97 NA 1.1 0.77 NA ...
##  $ Military      : num  3.72 4.08 13.81 NA NA ...
##  $ Health        : num  2.01 9.51 10.73 NA 14.02 ...
##  $ ArmedForces   : int  323 9 317 NA NA 117 0 105 49 NA ...
##  $ Internet      : num  11.4 71.8 47.7 NA 98.9 14.3 76 75.8 69.7 97.2 ...
##  $ Cell          : num  67.4 123.7 111 NA 104.4 ...
##  $ HIV           : num  NA 0.1 0.1 NA NA 1.9 NA 0.4 0.2 NA ...
##  $ Hunger        : num  30.3 5.5 4.7 NA NA 23.9 NA 3.8 4.3 NA ...
##  $ Diabetes      : num  9.6 10.1 6.7 NA 8 3.9 13.2 5.5 7.1 11.6 ...
##  $ BirthRate     : num  32.5 11.7 22.3 NA NA 41.3 16.1 17 13.1 11 ...
##  $ DeathRate     : num  6.6 7.5 4.8 NA NA 8.4 5.8 7.6 9.7 8.9 ...
##  $ ElderlyPop    : num  2.6 13.6 6.4 NA NA 2.5 7.2 11.3 11.4 13.6 ...
##  $ LifeExpectancy: num  64 78.5 76.3 NA NA 61.8 76.5 76.7 74.8 76 ...
##  $ FemaleLabor   : num  50.3 55.9 16.4 NA NA 76.4 NA 57.1 55.8 NA ...
##  $ Unemployment  : num  1.5 13.9 12.1 NA NA 7.3 NA 9.5 17.7 NA ...
##  $ Energy        : int  NA 808 1328 NA NA 545 NA 2030 1016 NA ...
##  $ Electricity   : int  NA 2309 1363 NA NA 312 NA 3075 1962 NA ...
##  $ Developed     : int  NA 1 1 NA NA 1 NA 2 1 NA ...

2

# Fit simple linear regression
model1 <- lm(LifeExpectancy ~ GDP, data = AllCountries)
# Model summary
summary(model1)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = AllCountries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

Intercept: 68.42

Slope (GDP): 0.0002

R²: 0.430

Interpretation

Intercept (68.42): When GDP is zero, the predicted life expectancy is about 68.4 years.

Slope (0.0002): For every $1 increase in GDP per capita, life expectancy increases by about 0.0002 years. This means a $10,000 increase in GDP is associated with about a 2-year increase in life expectancy.

R² = 0.43: GDP explains about 43% of the variation in life expectancy across countries. This is a moderate relationship.

3

# Fit multiple linear regression
model2 <- lm(LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)

# Model summary
summary(model2)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Intercept 59.08 GDP 0.0000237 Health 0.2479 Internet 0.1903 Adjusted R²: 0.716

Iterpretation of Health Coefficient

The Health coefficient (0.2479) means:

For every 1% increase in government health spending, life expectancy increases by about 0.25 years, holding GDP and Internet constant.

Comparison of R² Values Model R² Simple (GDP only) 0.430 Multiple (GDP + Health + Internet) 0.716

This large increase shows that Health and Internet access greatly improve the model’s ability to explain life expectancy.

4

To check the assumption of homoscedasticity (equal variance), I use a residuals versus fitted values plot. I look for a random scatter of points with no clear pattern, which would indicate that the variance of the residuals is constant across all fitted values. If I observe a cone or funnel shape, this would indicate a violation of the assumption and suggest that the model’s predictions may be unreliable.

To check the assumption of normality of residuals, I use a Q–Q plot. I expect the points to closely follow a straight diagonal line, which would indicate that the residuals are normally distributed. If I notice that the points curve away from the line, this would indicate a violation of the normality assumption and suggest that the statistical inferences from the model may be biased.

# Residual plots
par(mfrow = c(1,2))

plot(model1$fitted.values, resid(model1),
     xlab = "Fitted Values",
     ylab = "Residuals",
     main = "Residuals vs Fitted")
abline(h = 0)

qqnorm(resid(model1))
qqline(resid(model1))

Reflection

The residuals show some deviation from normality.

Mild heteroscedasticity is present.

The model is still usable, but predictions may be less reliable at extreme GDP values

5

# Predictions
pred <- predict(model2)

# RMSE calculation
rmse <- sqrt(mean((AllCountries$LifeExpectancy[complete.cases(
  AllCountries[,c("LifeExpectancy","GDP","Health","Internet")]
)] - pred)^2))

rmse
## [1] 4.056417

Interpretation

On average, predictions are off by about 4 years of life expectancy.

Large residuals indicate countries with unusual healthcare systems, inequality, or political instability.

Further investigation may include:

Regional effects

War/conflict

Education levels

6

Model:

CO2 ~ Energy + Electricity

Explanation

If Energy and Electricity are highly correlated, then:

Coefficients become unstable

Standard errors increase

Individual predictor significance becomes unreliable

Model interpretation becomes misleading

Even if R² is high, you cannot trust the individual effects of Energy or Electricity separately.

Solution

Use:

VIF (Variance Inflation Factor)

Remove one predictor and Combine predictors into an index