library(dplyr)
data <- read.csv("~/Downloads/AllCountries.csv")

Introduction

This assignment uses the AllCountries dataset to study life expectancy across countries. The dataset has 217 observations and 26 variables. Each row represents a country, and the variables include GDP, healthcare spending, internet access, population, energy use, and life expectancy.

The main question is whether GDP can predict life expectancy, and whether adding healthcare spending and internet access improves the model.

1. Simple Linear Regression

We first fit a simple linear regression model to predict LifeExpectancy using GDP.

simple_data <- data %>%
  select(Country, LifeExpectancy, GDP) %>%
  filter(!is.na(LifeExpectancy), !is.na(GDP))

model1 <- lm(LifeExpectancy ~ GDP, data = simple_data)

summary(model1)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = simple_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

The intercept is about 68.42. This means that when GDP is 0, the model predicts a life expectancy of about 68.42 years.

The slope for GDP is about 0.000248. This means that for every 1 dollar increase in GDP per capita, life expectancy is predicted to increase by about 0.000248 years.

The R-squared value is about 0.43. This means GDP explains about 43 percent of the variation in life expectancy across countries.

2. Multiple Linear Regression

Next, we fit a multiple linear regression model using GDP, Health, and Internet to predict LifeExpectancy.

multiple_data <- data %>%
  select(Country, LifeExpectancy, GDP, Health, Internet) %>%
  filter(!is.na(LifeExpectancy),
         !is.na(GDP),
         !is.na(Health),
         !is.na(Internet))

model2 <- lm(LifeExpectancy ~ GDP + Health + Internet, data = multiple_data)

summary(model2)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = multiple_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

The coefficient for Health is about 0.248. This means that, while holding GDP and Internet access constant, a 1 percentage point increase in government spending on healthcare is associated with about a 0.248 year increase in life expectancy.

The coefficient for Internet is about 0.190. This means that, while holding GDP and Health constant, a 1 percentage point increase in internet access is associated with about 0.190 years increase in life expectancy.

The adjusted R-squared for the multiple regression model is about 0.716, which is higher than the adjusted R-squared for the simple regression model, which is about 0.427. This suggests that adding Health and Internet improves the model and explains more variation in life expectancy than GDP alone.

3. Checking Assumptions

For the simple linear regression model, we check homoscedasticity and normality of residuals.

For homoscedasticity, the ideal outcome is that the residuals are spread evenly across the fitted values with no clear pattern. A violation would show a funnel shape or curved pattern, which would mean the model’s errors are not consistent across predictions.

For normality, the ideal outcome is that the residuals follow the straight line on the Q-Q plot. A violation would show points strongly curving away from the line, especially at the ends, which would mean the residuals are not normally distributed.

par(mfrow = c(2, 2))
plot(model1)

Based on the diagnostic plots, the model is not perfect. The residual plots suggest that there may be some uneven spread and some countries that do not fit the model well. The Q-Q plot also shows some departure from normality, especially at the ends.

4. RMSE and Residuals

rmse <- sqrt(mean(residuals(model2)^2))
rmse
## [1] 4.056417

The RMSE is about 4.06. This means that the multiple regression model is typically off by about 4.06 years when predicting life expectancy.

Large residuals mean that some countries have life expectancy values that are much higher or lower than the model predicts. This suggests other important factors may not be included in the model.

5. Multicollinearity Example

If Energy and Electricity are highly correlated, this can cause multicollinearity. This makes it harder to interpret coefficients because the predictors overlap in what they explain. As a result, coefficients can become unstable and less reliable.

Conclusion

GDP helps predict life expectancy, but it does not explain everything. The multiple regression model performs better because it includes additional important variables like healthcare spending and internet access.

Countries with higher healthcare spending and greater internet access tend to have higher life expectancy, even after accounting for GDP. However, the model still has limitations since many other factors influence life expectancy.

References

World Bank. AllCountries dataset. Data gathered from data.worldbank.org.