Cross-Sectional Regression Analysis (Task 7/8)

Import and Clean Data

I will analyze the relationship between GDP per Capita (Explanatory) and Life Expectancy (Outcome) for the year 2019.

Note: We use extra = TRUE in the WDI function to get region information, which allows us to filter out “Aggregates” (like “World” or “Latin America & Caribbean”) so we only analyze individual countries.

##          country iso2c iso3c year status lastupdated life_expectancy
## 1    Afghanistan    AF   AFG 2019         2025-10-07          62.941
## 2        Albania    AL   ALB 2019         2025-10-07          79.467
## 3        Algeria    DZ   DZA 2019         2025-10-07          75.682
## 4 American Samoa    AS   ASM 2019         2025-10-07          72.751
## 5        Andorra    AD   AND 2019         2025-10-07          84.098
## 6         Angola    AO   AGO 2019         2025-10-07          63.051
##   gdp_per_capita                     region          capital longitude latitude
## 1       557.8615                 South Asia            Kabul   69.1761  34.5228
## 2      4563.4674      Europe & Central Asia           Tirane   19.8172  41.3317
## 3      4672.6641 Middle East & North Africa          Algiers   3.05097  36.7397
## 4     12524.0160        East Asia & Pacific        Pago Pago  -170.691 -14.2846
## 5     39346.2750      Europe & Central Asia Andorra la Vella    1.5218  42.5075
## 6      2664.4385         Sub-Saharan Africa           Luanda    13.242 -8.81155
##                income        lending
## 1          Low income            IDA
## 2 Upper middle income           IBRD
## 3 Upper middle income           IBRD
## 4         High income Not classified
## 5         High income Not classified
## 6 Lower middle income           IBRD

Summarize the Association

Here we visualize the raw relationship between Wealth (GDP) and Health (Life Expectancy).

Comment on Linearity:

Visually, the relationship does not look perfectly linear. The data points follow a curved “logarithmic” path: life expectancy rises sharply as poor countries get slightly richer, but then plateaus for wealthy countries. A straight line (the red line) misses this nuance, underestimating values for middle-income countries and overestimating for very rich ones. This suggests a violation of the linearity assumption.

Run a Regression Model

We calculate the linear model: \(LifeExpectancy = \beta_0 + \beta_1(GDP)\).

# Fit the linear model
model_fit <- lm(life_expectancy ~ gdp_per_capita, data = clean_data)

# Show results in a clean table
kable(tidy(model_fit), caption = "Regression Results")

Regression Results
term	estimate	std.error	statistic	p.value
(Intercept)	69.5704919	0.5386633	129.15394	0
gdp_per_capita	0.0001967	0.0000178	11.03726	0

# Show model fit statistics (R-squared, etc.)
kable(glance(model_fit), caption = "Model Fit Statistics")

Model Fit Statistics
r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.3750405	0.3719619	6.278971	121.8211	0	1	-666.5047	1339.009	1348.979	8003.372	203	205

Model Commentary:

Coefficients: The gdp_per_capita estimate is positive and significant (p-value < 0.05). This confirms that higher GDP is statistically associated with higher life expectancy.

Global Model Fit: The R-squared (seen in the second table) indicates how much of the variation in life expectancy is explained by GDP. If the value is around 0.30-0.60, it means GDP explains a moderate amount of the difference between countries, but not everything.

Residuals: We will analyze these in the next section to see where the model fails.

# We use 'tidy' from broom to get data for plotting
coef_data <- tidy(model_fit, conf.int = TRUE) %>%
  filter(term != "(Intercept)") # Remove intercept to focus on the variable

ggplot(coef_data, aes(x = term, y = estimate)) +
  geom_point(size = 3, color = "darkred") +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) +
  labs(title = "Coefficient Estimate for GDP per Capita",
       y = "Estimate (Impact on Life Expectancy)",
       x = "Variable") +
  theme_minimal()

Diagnose the Regression Model

We check the residuals (errors) to see if our model is trustworthy.

Distribution of Residuals

We want the residuals to be normally distributed (bell curve centered at zero).

# Add residuals to the original data
data_with_resid <- clean_data %>%
  mutate(residuals = residuals(model_fit))

ggplot(data_with_resid, aes(x = residuals)) +
  geom_histogram(binwidth = 2, fill = "orange", color = "white") +
  labs(title = "Histogram of Model Residuals",
       x = "Residual Value",
       y = "Count") +
  theme_minimal()

Diagnostic Plots

R provides standard plots to check assumptions:

Residuals vs Fitted: Checks for linearity (should be a flat cloud).

Q-Q Plot: Checks for normality (dots should follow the diagonal line).

par(mfrow = c(2, 2)) # Arrange plots in a 2x2 grid
plot(model_fit)

Diagnostic Commentary:

Trust in Results: Based on the Residuals vs Fitted plot (top left), we see a distinct “U-shape” or pattern rather than a random cloud. This confirms our earlier visual suspicion that the relationship is non-linear. The model systematically predicts poorly for specific income groups.

Normality: The Q-Q plot (top right) deviates from the dashed line at the tails, suggesting the errors are not perfectly normal.

Conclusion: While the general finding (wealth relates to health) is true, the linear model is technically biased. A logarithmic transformation of GDP would likely produce a valid model.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.