Multiple Regression Analysis (Task 9/10)

Import and Clean Data

We will import data for the year 2019 using three variables:

1. Outcome (Dependent): Life Expectancy

2. Explanatory 1: GDP per Capita (Economic Strength)

3. Explanatory 2: Urban Population (% of total) (Social structure)

## [1] "Number of countries for analysis: 215"

##          country iso2c iso3c year status lastupdated life_expectancy
## 1    Afghanistan    AF   AFG 2019         2025-12-04          62.941
## 2        Albania    AL   ALB 2019         2025-12-04          79.467
## 3        Algeria    DZ   DZA 2019         2025-12-04          75.682
## 4 American Samoa    AS   ASM 2019         2025-12-04          72.751
## 5        Andorra    AD   AND 2019         2025-12-04          84.098
## 6         Angola    AO   AGO 2019         2025-12-04          63.051
##   gdp_per_capita urban_pop                     region          capital
## 1       557.8615    25.754                 South Asia            Kabul
## 2      4563.4674    61.229      Europe & Central Asia           Tirane
## 3      4672.6641    73.189 Middle East & North Africa          Algiers
## 4     12524.0160    87.147        East Asia & Pacific        Pago Pago
## 5     39346.2750    87.984      Europe & Central Asia Andorra la Vella
## 6      2664.4385    66.177         Sub-Saharan Africa           Luanda
##   longitude latitude              income        lending
## 1   69.1761  34.5228          Low income            IDA
## 2   19.8172  41.3317 Upper middle income           IBRD
## 3   3.05097  36.7397 Upper middle income           IBRD
## 4  -170.691 -14.2846         High income Not classified
## 5    1.5218  42.5075         High income Not classified
## 6    13.242 -8.81155 Lower middle income           IBRD

Summarize the Association

Here we plot the outcome against both explanatory variables to visually inspect the relationships.

Comment on Linearity:

GDP per Capita (Left): As seen in previous tasks, this relationship is curved (logarithmic), not linear. The straight red line fits poorly at the extremes.

Urban Population (Right): This relationship looks more linear than GDP, though there is still a wide spread. As countries become more urbanized, life expectancy generally rises.

Run a Multiple Regression Model

We now fit a model that includes both predictors: \[ LifeExpectancy = \beta_0 + \beta_1(GDP) + \beta_2(UrbanPop) + \epsilon \]

Multiple Regression Results
term	estimate	std.error	statistic
(Intercept)	62.7843901	1.1356558	55.284699
gdp_per_capita	0.0001379	0.0000189	7.291982
urban_pop	0.1260808	0.0196092	6.429659

Global Model Fit
r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.4825681	0.4776867	5.729791	98.85787	0	2	-678.8822	1365.764	1379.247	6960.067	212	215

Interpretation of Results:

The Coefficients: GDP: The estimate is positive and statistically significant (p < 0.05). Crucially, this is the effect of GDP holding urbanization constant. It suggests that even if two countries have the same urban population, the richer one lives longer. Urban Pop: The estimate is also positive and significant. This means that if two countries have the same GDP, the one with more cities (higher urban pop) tends to have higher life expectancy.

Global Fit: The Adjusted R-squared (in the second table) tells us how much variance we explain. It is likely higher than the model with GDP alone, as we have added more information.

Residuals: The “standard error” of the residuals gives us an average “miss” distance for our predictions.

Plotting Model Coefficients

Diagnose the Regression Model

We inspect the residuals to see if we can trust this multiple regression

Distribution of Residuals

Diagnostic Plots

Diagnostic Commentary:

Trust in Results: Looking at the Residuals vs Fitted plot (top left), we still see a pattern (heteroscedasticity or non-linearity) rather than a random cloud. This persists because we haven’t fixed the non-linear nature of the GDP variable.

Outliers: The Residuals vs Leverage plot (bottom right) helps identify countries that have an unusual combination of GDP and Urbanization that might be pulling the regression line excessively.

Conclusion: While we have established that both money and urbanization matter, the diagnostic plots suggest the linear model is still biased. We should likely log-transform GDP in a future iteration to fix the curved pattern.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.