R Markdown

#Simple Linear Regression (Fitting and Interpretation):
df2 <- read.csv("AllCountries.csv", stringsAsFactors = FALSE)
model1 <- lm(LifeExpectancy ~ GDP, data = df2)
summary(model1)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = df2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

The coefficient means that for every one-unit increase in GDP (every additional dollar in GDP per capita), the average LifeExpectancy is predicted to increase by approximately 0.0002476 years, holding all other factors constant. This establishes a positive and statistically significant relationship between GDP per capita and average life expectancy. The Multiple R-squared is 43%. This means that 43% of the total variation in Life Expectancy across countries can be explained by the linear relationship with GDP per capita.

#Multiple Linear Regression (Fitting and Interpretation) 
model2 <- lm(LifeExpectancy ~ GDP + Health + Internet, data = df2)
summary(model2)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = df2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

This model is better. This combination of factors explains about 72% of the reason why life expectancy differs across countries. The key finding is the effect of Health spending. The 0.2479 result means that if a country increases the percentage of healthcare budget by one point, avg lifespan is predicted to increase by about 0.25 years (or about three months). This is seen even after you account for GDP and internet access. Once we included Health and Internet, the wealth of the country (GDP) had less impact. This suggests that when it comes to expectancy, the percentage of money spent on health (and access to technology like the Internet) might be more important than just the GDP.

#Assumption checks
par(mfrow = c(1,2))   
plot(model1, which = 1)  
plot(model1, which = 2) 

par(mfrow = c(1,1))

When checking the model assumptions for the simple linear regression (LifeExpectancy ~ GDP), the Residuals vs Fitted plot shows a curved pattern and changing spread, which indicates a violation of the homoscedasticity assumption. The variance of the residuals increases at higher fitted values, and the red smooth line bends downward, showing that the relationship is not perfectly linear. The Q–Q plot shows the middle values following the line fairly closely, but the tails deviate noticeably, indicating some non-normality due to outliers or extreme cases. Therefore, both homoscedasticity and normality are not fully satisfied in this model, which reduces confidence in using GDP alone to predict life expectancy across all countries.

pred2 <- predict(model2)

res2 <- AllCountries$LifeExpectancy - pred2
## Warning in AllCountries$LifeExpectancy - pred2: longer object length is not a
## multiple of shorter object length
RMSE <- sqrt(mean(res2^2, na.rm = TRUE))
RMSE
## [1] 10.0122

When Energy and Electricity are highly correlated, the model cannot tell which predictor is truly responsible for changes in CO₂ emissions. Their coefficients become unstable, standard errors increase, and signs (positive/negative) may flip. Even if the model predicts CO₂ emissions well, the individual coefficients become unreliable.