Import and Clean Data
We will import data for the year 2019 using three
variables:
1. Outcome (Dependent): Life Expectancy
2. Explanatory 1: GDP per Capita (Economic Strength)
3. Explanatory 2: Urban Population (% of total) (Social
structure)
## [1] "Number of countries for analysis: 215"
## country iso2c iso3c year status lastupdated life_expectancy
## 1 Afghanistan AF AFG 2019 2025-12-04 62.941
## 2 Albania AL ALB 2019 2025-12-04 79.467
## 3 Algeria DZ DZA 2019 2025-12-04 75.682
## 4 American Samoa AS ASM 2019 2025-12-04 72.751
## 5 Andorra AD AND 2019 2025-12-04 84.098
## 6 Angola AO AGO 2019 2025-12-04 63.051
## gdp_per_capita urban_pop region capital
## 1 557.8615 25.754 South Asia Kabul
## 2 4563.4674 61.229 Europe & Central Asia Tirane
## 3 4672.6641 73.189 Middle East & North Africa Algiers
## 4 12524.0160 87.147 East Asia & Pacific Pago Pago
## 5 39346.2750 87.984 Europe & Central Asia Andorra la Vella
## 6 2664.4385 66.177 Sub-Saharan Africa Luanda
## longitude latitude income lending
## 1 69.1761 34.5228 Low income IDA
## 2 19.8172 41.3317 Upper middle income IBRD
## 3 3.05097 36.7397 Upper middle income IBRD
## 4 -170.691 -14.2846 High income Not classified
## 5 1.5218 42.5075 High income Not classified
## 6 13.242 -8.81155 Lower middle income IBRD
Summarize the Association
Here we plot the outcome against both explanatory
variables to visually inspect the relationships.

GDP per Capita (Left): As seen in previous tasks,
this relationship is curved (logarithmic), not linear. The straight red
line fits poorly at the extremes.
Urban Population (Right): This relationship looks
more linear than GDP, though there is still a wide spread. As countries
become more urbanized, life expectancy generally rises.
Run a Multiple Regression Model
We now fit a model that includes both predictors:
\[ LifeExpectancy = \beta_0 + \beta_1(GDP) +
\beta_2(UrbanPop) + \epsilon \]
Multiple Regression Results
| (Intercept) |
62.7843901 |
1.1356558 |
55.284699 |
0 |
| gdp_per_capita |
0.0001379 |
0.0000189 |
7.291982 |
0 |
| urban_pop |
0.1260808 |
0.0196092 |
6.429659 |
0 |
Global Model Fit
| 0.4825681 |
0.4776867 |
5.729791 |
98.85787 |
0 |
2 |
-678.8822 |
1365.764 |
1379.247 |
6960.067 |
212 |
215 |
Interpretation of Results:
The Coefficients: GDP: The estimate is positive and
statistically significant (p < 0.05). Crucially, this is the effect
of GDP holding urbanization constant. It suggests that even if two
countries have the same urban population, the richer one lives longer.
Urban Pop: The estimate is also positive and
significant. This means that if two countries have the same GDP, the one
with more cities (higher urban pop) tends to have higher life
expectancy.
Global Fit: The Adjusted R-squared (in the second
table) tells us how much variance we explain. It is likely higher than
the model with GDP alone, as we have added more information.
Residuals: The “standard error” of the residuals
gives us an average “miss” distance for our predictions.
Plotting Model Coefficients

Diagnose the Regression Model
We inspect the residuals to see if we can trust this multiple
regression
Distribution of Residuals

Diagnostic Plots

Trust in Results: Looking at the Residuals vs
Fitted plot (top left), we still see a pattern (heteroscedasticity or
non-linearity) rather than a random cloud. This persists because we
haven’t fixed the non-linear nature of the GDP variable.
Outliers: The Residuals vs Leverage plot (bottom
right) helps identify countries that have an unusual combination of GDP
and Urbanization that might be pulling the regression line
excessively.
Conclusion: While we have established that both
money and urbanization matter, the diagnostic plots suggest the linear
model is still biased. We should likely log-transform GDP in a future
iteration to fix the curved pattern.
R Markdown
This is an R Markdown document. Markdown is a simple formatting
syntax for authoring HTML, PDF, and MS Word documents. For more details
on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be
generated that includes both content as well as the output of any
embedded R code chunks within the document. You can embed an R code
chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
Including Plots
You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.
Comment on Linearity: