PART 2: Is the world getting better ?

This is the continuation and the final part of “Is the world getting better” Report. For this report, we’d be focusing on modelling and prediction for our data and understanding the causation behind some pattern.

Data Model

From our data exploration we can begin to ask questions like:

is there a relationship between lifeExpectancy, year, population, gdpPercapital ?
- We already have a subtle answer to this question from the data exploration section. We know there’s a relationship between life expectancy, year and gdpPercapital;
How strong is the relationship between the response and each predictors ?
- how strong is the relationship between these metrics - we don’t expect each of these metrics to have the same kind of robust relationship. We will figure this out with the p-value when we fit our model.
is the relationship linear ?
- If the relationship is linear then the linear model is the go-to model (you should know the answer to this by now).
How well does our model fit ?
- how accurate is our model - can we predict future metrics with our model ?

To answer these questions, we need to fit a statistical model to our data. Since we are dealing with a quantitative date with a linear relationship - i’ll be using the multiple linear regression model. One way to go about this is to fit our linear model to a single country and then replicate the same process for all other countries.

#fitting model with Nigeria (you already know why, Lol.)

ng <- gapminder %>% filter(country == "Nigeria")


#model 1: LifeExp as response - year, pop and gdpPercap as predictors.
ng.model <- lm(lifeExp ~ year + pop + gdpPercap, data = ng)

summary(ng.model)

## 
## Call:
## lm(formula = lifeExp ~ year + pop + gdpPercap, data = ng)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.64988 -0.48872 -0.03423  0.47820  0.66983 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.142e+03  1.114e+02 -10.249 7.06e-06 ***
## year         6.068e-01  5.755e-02  10.545 5.71e-06 ***
## pop         -2.163e-07  2.888e-08  -7.489 7.00e-05 ***
## gdpPercap    4.208e-05  7.780e-04   0.054    0.958    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5867 on 8 degrees of freedom
## Multiple R-squared:  0.9845, Adjusted R-squared:  0.9787 
## F-statistic: 169.6 on 3 and 8 DF,  p-value: 1.405e-07

Now, let’s address each question above:

One: is there a relationship between life expectancy, year, population and gdpPercap ?

The F-statistic tells us if there’s a relationship between our response and the predictors. When there’s no relationship between the response and the predictors, the F-statistic will be very close to 1. On the other-hand, an F-statistic with a value above 1 tells us there’s is a relationship between the response and the predictors. The F-statistic of our model is 169.6 - which is far more than 1 (the evidence is substantial);

Two: How strong is the relationship between the response and each predictor ?

To determine the the strength of the relationship between our response and our predictors, we look at the individual t-value and p-values [Pr(>|t|)]. I will focus on the p-values as it is easier to understand and explain. A very small p-value infers that there’s a relationship between that predictor and the response. Also the asterick beside the p-values tells us how significant the relationship is between that predictor and the response.
- single asterick ’*’ = good relationship (p-value is around 5%);
- double asterick ’**’ = relationship is strong - (p-value is around 1%);
- triple asterick ’***’ = relationship is very strong - significant relationship - (p-value is less than 1%).
From our model, we can see that there’s a significant relationship between our predictors (year, pop) and our response (lifeExp). On the other hand, the p-value of gdpPercap is 0.958 (which means there’s no relationship between lifeExp and gdpPercap).Now don’t forget, whatever we discuss here is just for one country(Nigeria) and is not true for all other countries (Will come back to this later). A good practice is to always remove predictors that have no relationship with the response inother to make your/our model better.
Another question we might find ourself asking is: if two predictors have significant relationship with the response, how do we tell which is more significant ? In the case of our model, year and pop both have significant relationship with our response but the relationship between lifeExp and year (p-value = 0.00000571) is stronger than the relationship between lifeExp and pop (p-value = 0.00007). Inshort the the predictor with a p-value closest to 0 has the most significant relationship.
One final word on the intercept. The intercept is the value of our response when our predictor(s) is equal to zero. An intercept with a small p-value like our model simply means we can reject the null hypothesis. Rejecting the null hypothesis is what tells us if there’s a relationship between our predictor and response (you can read more about null hypothesis here: https://stats.stackexchange.com/questions/135564/null-hypothesis-for-linear-regression).

Three: Is the relationship linear ?

To confirm this, i will be fitting a single linear model(model with one predictor). And since the most significant predictor from our previous model is year, i will be fitting a model with lifeExp as the response and year as the predictor.

#model 2: LifeExp as response - year as predictor.

lm.fit <- lm(lifeExp ~ year, data = ng)

summary(lm.fit)

## 
## Call:
## lm(formula = lifeExp ~ year, data = ng)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4441 -1.2073  0.1505  1.3306  1.7442 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -368.28479   50.32489  -7.318 2.55e-05 ***
## year           0.20807    0.02542   8.184 9.64e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.52 on 10 degrees of freedom
## Multiple R-squared:  0.8701, Adjusted R-squared:  0.8571 
## F-statistic: 66.99 on 1 and 10 DF,  p-value: 9.638e-06

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

This IS WHERE IT GETS INTERESTING - THIS PART IS VERY IMPORTANT SO PLEASE PAY ATTENTION.

Our linear model is a good fit for our data (this is confirmed via F-statistic, p-value and multiple r-squared value) but is it the best model for our data ? Unfortunately the answer is too complicated for me to answer (my intuition says no because of the residual plot), but let’s go over it. This is where the Residual plot comes in.

A basic definition of Residual is simply - difference between our actual and predicted values. We can tell from our residual plot that there’s a pattern (an arch shape) which is another way of saying the relationship between our response and predictor may be non-linear and the linear model may not be the best model (the technical word for this is Heteroscedasticity - non constant variances in the error).

I can understand why our residual plot thinks the relationship is non-linear (even though it’s obvious that the relationship is linear) - and the reason is the drop-off in the last 4 points of our data. If you look at the table below, notice how lifeExp increases year on year until 1997 when it starts to drop-off, drops further in 2002 and then increased in 2007 (from 1997 to 2007, the relationship is not linear but we know that’s a one-off period and it’s not sustainable - unfortunately our residual doesn’t know this - having more linear data points after 2007 could have given our residual a tip that 1997-2007 was a glitch).

This is one area where our intuition will override our residual plot but let’s still bring in a non-linear model to test our residual’s insight that a non-linear relationship exists between our predictor and response.

Tip: Your next question might be, how can you tell from the residual plot if a relationship is linear or non-linear? My answer to this is - you want a residual plot without a pattern, where the points are evenly distributed. This was not easy for me to grasp at first but the best way to go around this is to watch one or two tutorial vides on linear regression and residual plot on youtube like this one https://www.youtube.com/watch?v=gLENW2AdJWg&t=918s.

We can accomodate non-linear relationship within our linear model with polynomial regression. In short, we are still using a linear model but this time we are testing for a non-linear relationship. Polynomial regression adds extra predictors that are powers of the original variable (in our case, year is the predictor - what a polynomail regression does is add year, year², year³,…..yearⁿ to our model).

Let’s fit two polynomial models to our data (model_2 and model_3) and see how they compare to our linear model:

model_2: for this model, we are using 2 predictors - also called a quadratic polynomial fit. To call,we pass -> poly(year, 2);
- this is the same as: lm(lifeExp ~ year + year²)
model_3: for this model, we are using 3 predictors - also called a cubic polynomial fit. To call, we pass in -> poly(year, 3);
- this is the same as: lm(lifeExp ~ year + year² + year³)

#model 2: LifeExp as response - year as predictor.

model_2 <- lm(lifeExp ~ poly(year,2), data = ng)
summary(model_2)

## 
## Call:
## lm(formula = lifeExp ~ poly(year, 2), data = ng)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.64243 -0.43265 -0.02896  0.34880  0.73539 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     43.5813     0.1559 279.471  < 2e-16 ***
## poly(year, 2)1  12.4405     0.5402  23.029 2.61e-09 ***
## poly(year, 2)2  -4.5253     0.5402  -8.377 1.53e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5402 on 9 degrees of freedom
## Multiple R-squared:  0.9852, Adjusted R-squared:  0.982 
## F-statistic: 300.3 on 2 and 9 DF,  p-value: 5.776e-09

#model 3 with
model_3 <- lm(lifeExp ~ poly(year,3), data = ng)
summary(model_3)

## 
## Call:
## lm(formula = lifeExp ~ poly(year, 3), data = ng)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.67862 -0.11876  0.05468  0.14100  0.44322 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    43.58133    0.09298 468.719  < 2e-16 ***
## poly(year, 3)1 12.44052    0.32209  38.624 2.22e-10 ***
## poly(year, 3)2 -4.52529    0.32209 -14.050 6.39e-07 ***
## poly(year, 3)3 -1.34030    0.32209  -4.161  0.00316 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3221 on 8 degrees of freedom
## Multiple R-squared:  0.9953, Adjusted R-squared:  0.9936 
## F-statistic: 568.8 on 3 and 8 DF,  p-value: 1.164e-09

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

By plotting each model side by side we can compare our results and also the residuals. It’s not hard to figure out that model_3 is too perfect as a model (we can also see the improvement from our linear model residual plot to the cubic residual plot). The fact that the cubic regression curve passes through almost all our data points and has a residual standard error of 0.3221 tells us that our model_3 is almost a perfect fit (which is too good to be true and this is a red flag - i don’t usually trust any model with an R² of more than 0.9 and model_2 and model_3 are not any different). I will come back to R² later in this section but basically R² tells us how fit our model is and it takes on a value between 0 and 1.

You might be wondering: why did we decide to use a cubic fit poly(year, 3) and if a cubic fit is better than a quadratic fit poly(year, 2) - why not try poly(year, 4) or poly(year, 5).

If you look at our model_3 summary table, you should notice that our p-values starts to drop off at the 3rd order of our cubic fit (where our p-value is equal to 0.00316 with two asterick), now that is a warning for us to not go beyond that order. If we go beyond the 3rd order we will begin to see a situation where our model fitness starts to drop off and get less fitter. Another way to confirm this is by using the Anova function. We use this function to compare models. In the summary table below, we are comparing our polynomial models up to 5th order. Notice anything ? - yes the p-value. The p-value drops off at the 4th order and drops further at the 5th order. That’s enough to tell us that the best model is between the quadratic and cubic model.
Cross-validation is another alternative to anova (and it is more advanced than anova) but that is a topic for another day.

model_2 <- lm(lifeExp ~ poly(year,2), data = ng)
model_3 <- lm(lifeExp ~ poly(year,3), data = ng)
mod4 <- lm(lifeExp ~ poly(year,4), data = ng)
mod5 <- lm(lifeExp ~ poly(year,5), data = ng)

anova(model_2,model_3,mod4,mod5)

## Analysis of Variance Table
## 
## Model 1: lifeExp ~ poly(year, 2)
## Model 2: lifeExp ~ poly(year, 3)
## Model 3: lifeExp ~ poly(year, 4)
## Model 4: lifeExp ~ poly(year, 5)
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1      9 2.62635                                  
## 2      8 0.82994  1   1.79641 70.381 0.0001562 ***
## 3      7 0.43658  1   0.39336 15.411 0.0077508 ** 
## 4      6 0.15314  1   0.28344 11.105 0.0157602 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Four: How well does our model fit ?

The R² measures the fitness of our model. It takes on a value between 0 and 1, the closer the R² is to 1 the fitter the model and vice-versa the closer it is to 0. The R² of our model_3 is 0.9953 (you can understand why i said it’s too good to be true - in football terms: imagine having an expected goal of 0.9953, that’s like having an empty net with the ball on the goal line waiting for a single touch) .
Another metric we can use to measure fitness is the Residual standard error (RSE) . It actually measures the lack of fit of a model. You want this metric to be as low as possible. How low ?, depends on your model and the unit of measure. In our case, the RSE of our model_3 is 0.3221(very low), RSE for our linear model is 1.52 (low).

Word of caution on our models: Earlier in this section i said model_2 and model_3 are almost too good to be true and you’ll understand why that is the case. If you look at our trend line, you will notice that lifeExp and year have a linear relationship until 1995 when it starts to drop-off but like i said earlier that’s a glitch and the trend will definitely be back to a linear trend. Now, our model_2 and model_3 captured the “drop-off” as the last point of contact. The problem with that is if we use model_2 or model_3 to predict future numbers, our prediction will definitely be “off” - how off ? You can imagine a situation where at some point our model_2 and model_3 will predict lifeExpectancy of 0years(almost impossible to fathom).

Let’s compare our 3 models. This way we can know which model to pick for our data and which is more reliable and accurate. We do this by:

comparing our models predictions. We give our models a time-period in years and see how they deal with forecasting lifeexpectancy say 10/30 years from now;

To start with the prediction, i will create a tibble with just one column (year), feed this tibble with our models (we do this with gather_predictions ) and then render our predicted numbers.

df <- tibble(year = seq(2012, 2042, by = 5))

#convert year to int
df$year <- as.integer(df$year)


#run predictions
pred <- df %>%
  data_grid(year) %>% gather_predictions(lm.fit, model_2, model_3)

pred %>% datatable(caption = tags$caption(
  style = 'caption-side: top; text-align: center;',
    'Our predcition table is too long. What we need is a shorter table, we do this by making it wider. ', htmltools::div('We can make it wider by having our models in columns (like a pivot table in excel).')
  )
)

The table above is hard to read. Let’s make it wider by using the spread function in R.

#Spread the data
spread(pred, key = "model", value = "pred") %>% 
  datatable(caption = tags$caption(
  style = 'caption-side: top; text-align: center;',
    'Our linear model prediction is more accurate and is close to the actual lifeExpectancy figures for NG.  ', htmltools::div('Model_2 and Model_3 are negatively correlated to year. This should not come as a surprise, we saw it coming.')
  )
)

Now we can compare our predictions. Our linear model(lm.fit) predicts life expectancy to be 52.42 years in 2022 and 56.5 years in 2042. Model_2 on the other hand predicts life expectancy to be 44.9 years in 2022 and 38.7 years 21 years from now. For model_3, life expectancy will be 39.5 years in 2022 and to drop-down to 17.69 years in 2042.

Our lm.fit predictions certainly looks more reliable. A quick search on google shows that life expectancy in Nigeria in 2012 was 51.79 years and in 2017 it was 53.95 years so our linear model is very close and a good predictor. Compare this to our polynomial model (model_2 and model_3), it’s the exact opposite of our linear model (decreasing and model_3 is decreasing faster).

From our predictions we can see that model_2 and model_3 are actually bad models so we are going to get rid of them and stick to our linear model going forward.

We could also use our linear model to predict the life expectancy of all countries that fit with our model and it will give us an accurate estimate (remember it’s dependent on the absense of war).

So What Next ?

Now that we have a good model (lm.fit), we can start thinking of fitting this model to all other countries. But this raises some new set of questions. We know we have a good model for Nigeria, but a good model for Nigeria is not necessarily a perfect model or even a good model for all other countries. From our data exploration section (PCA plot to be precise), we know that Nigeria has little in common with a lot of countries (especially countries outside Africa) - so our model might be good for countries that are positively correlated with Nigeria and a bad model for negatively correlated countries - the only way to know is to test. By the way these are all possibilities you should be wary of when fitting your model to other data set.

There are different ways we can approach fitting our linear model to other countries:

We can build several models for each country, check the fitness and pick the model with the highest R². We know this will be cubersome but on the other hand we can use the map function (i have not figured how to do this yet). With this approach, we can be sure to have a very good model for all countries;
Using our Hierarchical clustering, we can build a model for each cluster and then fit the model with every country in that cluster. This approach won’t give us a good model for all countries because we are using averages;
We can build a single model using our averages (for all countries) and then use this model for all countries. This is not a good approach because our model will definitely not fit (same issue with the Hierarchical clustering);
Another option is to build a good model for a single country (like we have done with Nigeria) and use this same model for all countries (This is the approach we’d be taking).

To fit our a linear model(like lm.fit) to every country, we need to nest our data such that every row is a group of data (on a country level). So instead of having 12 rows of data for each country, what we have instead is - 12 rows of data grouped into a single row for each country. All this data/information is stored in the data column and the [object Object] is actally a list of data frames.

all_ctries <- gapminder %>% 
  group_by(country, continent) %>% nest()

all_ctries %>% datatable(caption = tags$caption(
  style = 'caption-side: top; text-align: center;',
    'Each row contains grouped data for each country. ', htmltools::div('The grouped meta-data is stored in the data column inside the [object Object] which is a nested data-frame.')))

We can check our nested data, and see all the metadata we have for the first row (Afghanistan in this case).

We can also filter for a particular country (as done for Nigeria below):

Next thing we want to do is to fit a linear model (like our NG lm.fit) to every country. We write a function to get this done and then apply it to all the dataframe using the map functional iteration:

#model fitting function
linear_fit <- function(df){
  lm(lifeExp ~ year, data = df)
}


#fit data-model to every country
models <- map(all_ctries$data, linear_fit)

Now we are going to add our fitted model to our nested data-frame by creating a new column. This is what makes a nested data-frame really cool.

We can finally add our residuals to our dataframe. As always, we create a new list column and store our residuals in there.

#adding models to our nested dataframe
all_ctries <- all_ctries %>% mutate(resid = map2(data, model, add_residuals))

#printing out results
all_ctries %>% datatable()

To view our model residuals for all countries, we have to unnest our dataframe to a regular dataframe. And remember residuals is the difference between our predicted values and the actual values so we want a number that is as close to 0 as possible (A residual of 3 is better than a residual of 6 and a residual of 1 is even better. A very high residual points to a lack of fit of our model).

resid <- unnest(all_ctries,resid)

resid %>% dplyr::select(country, continent, year, resid) %>% datatable()

So can we view our model fit too ? Yes, we can. And for this we use the glance call. Same process as before, we create a new column, store our model fit metrics in the newly created column and then convert our list-column dataframe to a regular dataframe.

mod_fit <- all_ctries %>% 
  mutate(glance = map(model, glance)) %>%
  unnest(glance) %>% select(country, continent, r.squared,  p.value) 

mod_fit %>% datatable()

Remember we can measure the fitness of our model using r.squared. From the first data page, we can see that our model has a very good fit with all the countries on this page (0.9+, very good).

One thing i was interested in and looking out for in our model are countries that have an r.squared above 0.97 (countries that fit really well with our data). Like i said i’m always wary of any model with a fitness(r.squared) above 0.9 - but not in this case because as mentioned earlier before and even from our data exploration section, there is a linear relationship between year and life expectancy (so we have nothing to be scared about although it’s always a good practice to always test your assumptions).

What we want to do next, is to look out for countries that don’t have a good fit with our model.

Here, i will take an r.squared of >= 0.5 as a good fit and an r.squared of < 0.4 as lack of fitness

#countries that don't fit well with our model
mod_fit %>% arrange(r.squared) %>% datatable()

We can see countries that don’t have a good fit with our linear model (10 countries to be precise - have less than 0.4 r.squared). And all countries are african countries;

We could fit in another model (say a polynomial model) and compare the fitness of the least fit countries against the least fit countries of our polynomial model. You can expect a situation where our polynomial model fits very well with more countries than our linear model but suffer the same issues (where our predictions are inaccurate and mechanical). But we won’t be doing this (i just want to make you aware of it).

For our linear model, we can do a further deep-dive on countries that don’t fit well with our linear model to understand if there’s a common denominator among these countries. Let’s plot the r.squared for each country and try to understand each country better.

p <- mod_fit %>% filter(r.squared <0.4) %>% 
  ggplot(aes(r.squared, reorder(country, r.squared))) + geom_point(size = 3)

p + annotate("text", x = 0.32, y = 7, label = "Group 1", color = "darkred", size = 6) + 
  annotate("text", x = 0.08, y = 2, label = "Group 2", color = "darkred", size = 6) + 
  labs(title = "Even within our least fitted countries we can see two clusters", 
       subtitle = "Group 1 countries fitness can still be improved but Group 2 countries don't fit in at all") +
  theme(plot.title = element_text(hjust = .5,lineheight = .20, size = 20),
        plot.subtitle = element_text(hjust = .5,lineheight = .20, size = 15),
        panel.grid.major = element_line(linetype = "dotted",color = "black"),
        panel.grid.minor = element_line(linetype = "dotted",color = "black"),
        axis.text.y = element_text(size = 13),
        axis.text.x = element_text(size = 13))

We can start to see a cluster (i have assigned these clusters to a group). And this is important if we want to build a very good model for these countries - most likely build 2 models (1 model for Group 1 and another model for Group 2). Since our linear model does a good job in model fitness for most countries (except Rwanda and 9 others) there’s no need for us to start building new models. We can actually take our linear model, improve it then use it to model for Group 1. Whereas for Group 2, we need to build a new model from scratch.

We can slice and dice these groups to try and understand what makes them different.

poor_fit <- mod_fit %>% 
  filter(r.squared <0.4) %>% 
  arrange(r.squared) %>% 
  left_join(gapminder, by = "country") %>% 
  mutate(Group = ifelse(r.squared > 0.2,"Group 1", "Group 2")) %>%
  select(country, year, lifeExp, Group)

poor_fit %>% 
  filter(year == 2007) %>% 
  group_by(Group) %>% 
  summarise(life_Expectancy_Average = mean(lifeExp)) %>% kable(caption = "Our linear model is a bad fit for countries with very low life expectancy.")

Our linear model is a bad fit for countries with very low life expectancy.
Group	life_Expectancy_Average
Group 1	48.91775
Group 2	44.17433

Group 1 have an average lifeExpectancy of 48.9 years in 2007 and group 2 have an average lifeExpectancy of 44.2years in 2007. This might actually mean nothing or be the reason why some countries fit better but it’s a good place to start from. You can start making a guess that maybe our model is a bad fit for countries with very low lifeExpectancy.

We can try to confirm this by checking our model fitness for countries with low lifeexpectancy in 2007.

I will collate the gapminder data-set and sort it by lifeExpectancy(from lowest to highest) and then do a left join to our model to get the r.squared of each country.

gapminder %>% filter(year == 2007) %>% 
  arrange(lifeExp) %>% 
  left_join(mod_fit, by = "country") %>%
  select(country, continent.x, lifeExp, r.squared, p.value) %>% datatable()

Is our model a bad fit for countries with low lifeExpectancy ? yes but it’s not a resounding yes, our model is actually good for some countries with low lifeExpectancy but there’s a connection in there between our model lack of fitness and countries with low lifeExpectancy.

All countries that don’t fit well with our model are in the lowest 20% segment of lifeExpectancy. And from the first page we can see than we have 4 countries that don’t fit well with our model - so we can make an assumption and not a conclusion that countries with low Expectancy will most likely not fit very well with our model (and if the country is in africa ? Bingo, then our odds is even better)

Let’s proceed and see the lifeExpectancy trend from 1952 to 2007 for each of our least fit countries.

Note that, Line chart does a poor job with rendering more than three feature on a single plot (color coding issue), for that reason we’d be adding a heatmap too to supplement the line-chart.

We go first with the heatmap.

Now, one thing is obvious straight-away from our heatmap. The steep-decline in life expectancy in Rwanda. Also, countries from southern africa were in the 60 years range of life expectancy before the decline started (as seen by their color coding changing from blue to grey). It’s also a gradual decline.

We follow this up with a line-chart.

word of caution here: Line charts don’t do a good job in differentiating between multiple lines. From our plot, it’s not easy to know which line belongs to which country (the color coding are not intuitive - maybe it’s another area i have to improve on) but not to worry, we’ll be plotting a single line plot for each country

Here again, we can see the decline in Rwanda. We can also see that decline started for most countries in the early 90’s.

Another thing that might take you a while to figure out is that these countries (except Côte d’Ivoire) share borders.

We can show how these countries are interconnected with a map of africa.

## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`

Please note that Swaziland is now called eSwatini and Congo, Dem. Rep. is now called Democratic Republic of Congo

From looking at the map, a typical question can be: “How come other countries that share borders with South-Africa like Namibia didn’t get affected by the decline around that region”. My answer to that would be, the countries we are focused on are countries that don’t fit very well with our model and not countries that had a decline in life expectancy. Asides that, countries like Namibia that share a border with South Africa were also affected and their life expectancy also declined from 61 years in 1992 to 51.479 in 2002 (~10 years).

Even though these countries share borders, they don’t share the same causation. The type of “life expectancy decline” in East Africa is different from the decline in Southern Africa. This is another insight we might want to do a deep-dive on with a new plot.

Before we plot, we should assign each country to the region they belong to in africa.

I will be categorizing Congo, Dem. Republic as an Eastern africa country even though it is categorized as a country in central african. I’m doing this because it shares a border with Uganda and Rwanda (and the genocide in Rwanda also had an impact in the civilian war in the country)

It will be intuitive to have the plots of each region - side by side - that way we can see the type of decline in each region

As we can see from the plot, the decline in Southern africa was gradual. Eastern africa decline was very sharp.

We’ll go over each region, starting with the Eastern part of Africa.

library(spData)
library(tmap)

#change congo to DRC
reg <- region %>% filter(region == "Eastern Africa") %>% mutate(color = case_when(country == "Congo, Dem. Rep." ~ "Democratic Republic of the Congo",
                                    country == "Rwanda" ~ "Rwanda",
                                    country == "Uganda" ~ "Uganda"))

#filter for africa for map
africa <- world %>% filter(continent == "Africa")

#filter for specific countries to use as color coding
East_Africa_countries <- africa %>% filter(
  name_long %in% c(
    "Rwanda","Uganda","Democratic Republic of the Congo"
    ))

#line plot
p1 <- region %>% filter(region == "Eastern Africa") %>%
  ggplot(aes(year,lifeExp,group=country)) + 
  geom_line(aes(color=color), data = reg) +
  scale_x_continuous(breaks = c(1952, 1962, 1972, 1982, 1992, 2002)) +
  xlab(NULL) + ylab("Life Expectancy (years)") +
  theme(
    axis.title.y = element_text(size = 13), legend.position = "none",
    axis.text.y = element_text(size = 13),
    axis.text.x = element_text(size = 13))

#map plot
p2 <- ggplot(africa) + 
  geom_sf() + 
  geom_sf(
    aes(fill = name_long), data = East_Africa_countries) + 
  theme(legend.position = "top",axis.text.y = element_blank(),axis.ticks.y = element_blank(),
        axis.text.x = element_text(size = 13))+ theme_void()

p1 + p2 + theme(legend.title = element_blank(),legend.position = c(0,1), legend.justification = c(0.5,28),
                legend.background = element_rect(fill = "grey90"),legend.text = element_text(size = 13),
                legend.direction  = "horizontal", plot.title = element_text(hjust = -1.5,lineheight = .20,size = 15),
        plot.subtitle = element_text(hjust = 1.5,lineheight = .20, size = 13)) + 
  labs(fill = "Countries:", title = "Sharp Decline in Life Expectancy in East Africa",
       subtitle = "Decline can be traced back to the Civil War and genocide in Rwanda in the early 90's. Then the congo war followed.")

East Africa : The decline in lifeExp is more pronounced in eastern africa. The steep-decline in east africa is as a result of the civil war and the rwandan genocide (by the way, the genocide started way before the 90’s but it took a worse turn after the President was assassinated) which had a knock-on effect. After the RAF took over the government in Rwanda - they invaded Congo to go after Hutu Rebels (that committed the genocide) which led to the first congo war. These later contributed to the Second Congo War in 1998 (with Uganda, Rwanda as major players in the war). You can read more about the the congo war to understand the cat-mouse relationship between the 3 countries. Link here * https://en.wikipedia.org/wiki/Second_Congo_War *.

One more word on genocide (summarized this from Better Angels of our Nature on genocide):

After the rwandan genocide, US President Bill clinton commissioned Harff to analyze the risk factors of genocide. She assembled a data set of 41 genocieds between 1955 and 2004 and also a control group of 93 cases of state failure that did not results in genocide matched as closely as possible to the ones that did and ran a logisticregression model on them (logistic regression is used with two-class or binary response - like "Yes / No", "Good / Bad").
  Harff discover 6 risk factors that distinguished the genocidal from the nongenocidal crises in 3 quarters of the cases:
      1. A country's previous history of genocide;
      2. The country's immediate history of political instability - to be exact the number of regime crises and ethnic or revolutionary wars it had suffered in the preceeding 15 years,Government that feel threatened are tempted to eliminate or take revenge on groups they perceived to be subversive or contaminating and are more likely to exploit the ongoing chaos to accomplish those goals before opposition can mobilize;
      3. Third was a ruling elite that came from an ethnic minority, presumbly because that multiplies the leader's worries about the precariousness of their rule;
      4. Autocratic governement. Autocratic government were three and a half times more likely to commit genocides than were full or partial democracies;
      5. Openness to trade. Conuntries that depend more on international trade are less likely to commit genocides or even fight wars;
      6. Lastly, Exclusionary ideology. Ruling elites that have exclusionary ideology are far more likely to commit genocide than elites with a more pragmatic or ecletic governing philosophy. Exclusionary ideologies include Marxism, Islamism(in particular, Sharia law), militaristic anticommunism and nationalism.

Southern Africa: Whereas we get a sharp-decline in eastern african countries, for southern african countries we have a gradual decline in lifeExp that still continues till today. The decline in lifeExp in this region is due to the AIDS epidemic in the 80’s.

A word on this, as we can see the decline in southern africa started in the late 80’s (1987 to be precise). Remember, It takes 8-10 years for HIV to turn to AIDS, which means that the seed of the decline had been planted in the late 70’s (say 1977) in Southern Africa. What makes this kind of situation different from war is, epidemic is gradual and harder to control compared to war. Also, Southern africa government had a non-challant attitude towards the AIDS epidemic (especially Thabo Mbeki’s, south african president denial on AIDS) which made matters worse.

But good news is around the corner (we are getting there - Human progress) https://www.europeanpharmaceuticalreview.com/news/141892/novel-hiv-vaccine-approach-shows-promise-in-landmark-first-in-human-trial/

A typical question we might want to ask is “where and when did this decline start in Southern Africa” . For this we might want to check when each country first recorded a decline in life-expectancy.

#which country had the first decline in life-expectancy in southern africa

# Step 1: filter for Southern Africa from Region
# Step 2: find the year with the first drop for all countries
# Step 3: sort countries based on the year of drop
# Step 4: use row number to pick first year of drop


SA <- region %>% filter(region == "Southern Africa")

#region %>% 
#  filter(region == "Southern Africa") %>%
#  ggplot(aes(year, lifeExp)) + geom_point(color = "red", size = 2) + geom_line() + facet_wrap(~country, nrow=1)

SA_final <- SA %>% group_by(country) %>% #group by country
mutate(drop_years = lifeExp - lag(lifeExp),
       drop_prop = year - lag(year),
       life_exp_percent_drop = (drop_years/lag(lifeExp))*100
       ) %>% #create a new column with diffe (years diff)
  mutate(confirm = ifelse(drop_years <= 0,"Drop","Increase")) %>% #create another column, based on a condition of diffe
  filter(confirm == "Drop") %>% #filter
  mutate(count = row_number(country)) %>%
  filter(count == 1) %>%
  select(country, year, lifeExp, drop_years, life_exp_percent_drop) %>%
  arrange(year)

SA_final %>% datatable()

As we can see, the decline started in Zambia in 1987, then Botswana and Zimbabwe in 1992, and finally - not to be surprised - Swaziland, Lesotho and South Africa in the same year. Remember that Lesotho and Swaziland are literally inside South Africa (so you expect the decline to happen at the same time there).

Swaziland had a 7% drop in life expectancy (for their first decline) followed by Lesotho. This shouldn’t come as a surprise too as both countries had a small population then (and till now). Remember in the data exploration, we said the bigger the population the more spread out the life expectancy and the harder it is to move the life expectancy.

What i’m driving at is, South Africa lost more people than Swaziland and Lesotho due to the AIDS pandemic but because they have a bigger population their loss is likely to have a minimal impact on life expectancy of their population compared to Lesotho or Swaziland.

Finally, Cote d’ivoire is a special case because it’s not easy to pin-point the exact reason for the decline, and the fact that it’s the only country from west africa in our data set makes it harder to figure out.

We can actually make a guess based on the pattern - and my first guess is the primary reason for the decline is definitely not a civil war because war/civil war tend to create a sharp decline, so we can start thinking of anything but war like an epidemic or poor health facilities, or the kind of diet prevalent during that period.

Major take-away for you from this section is: War tends to cause a steep/sharp decline in lifeExpectancy while epidemic causes a gradual decline. So when next you hear about civil war in a particular country or an epidemic (like Corona Virus), you should have an idea of how that will play out over time . Although i don’t expect any epidemic to have as much devastating effect on the population as AIDS (except ebola which is not as prevalent as AIDS - you get the drill by now) so don’t expect corona virus to make a dent on life expectancy like AIDS.

Let’s end this section with a plot of each country (since our grouped line chart didn’t do a good job in differentiating between each country). I have divided the plot into 2 groups (remember our clusters from above - group 1 have an r.squared > 0.2 and group 2 have an r.squared - that is how i have divided the plot).

So What Now ?

I need to address a final point on lifeExpectancy. By now you probably think the most important predictor of lifeExpectancy is year. What if i tell you the most important predictor of lifeExpectancy is actually gdpPercap. Remember we only ran a model on one country (Nigeria). For a country like Nigeria, the biggest predictor of LifeExpectancy is year but every country is not Nigeria. For some countries, the biggest predictor is year while for others it is gdpPercap (and you shouldn’t be surprised if population is the biggest predictor of lifeExpectancy in some countries - although this will be absurd but you can never know.)

So the next question now becomes, if the world was one single country, what predictor drives lifeExpectancy (what is the biggest predictor for most countries). We can verify this using a decision tree:

A way to interpret this tree is: the biggest predictor of lifeExpectancy for most countries is gdpPercap. We can further segment countries into 2 batches - Countries that have a gdpPercap less than $3525.3 and countries that have a gdpPercap of more than 3525.3 dollars. For countries that have a gdpPercap of less than 3525.3 dollars, we can further segment them into 4 terminal nodes or leaves (leaves are the bottom of the tree).

Countries that have a gdpPercap of more than $3525.3 can be segmented into:

Countries that have a gdpPercap more than $15,141.3 tend to live longer (with a lifeExpectancy average of 74.87 years)
Countries that have a gdpPercap greater than $5,534.92 but less than 15,141.2 dollars have a lifeExpectancy average of 68.50 years;
Countries that have a gdpPercap greater than $3525.3 but less than 5534.92 have a lifeExpectancy average of 62.28 years;

Countries that have a gdpPercap of less than $3525.3 can be segmented into:

Countries that have a gdpPercap less than $1,314.51 before year 1969 tend to have a lifeExpectancy of 41.80 years;
Countries that have a gdpPercap less than $1,314.51 after year 1969 tend to have a lifeExpectancy of 49.23 years;
Countries that a gdpPercap greater than $1,314.51 but less than 3525.3 dollars before year 1985 have a life expectancy of 51.44 years;
Countries that a gdpPercap greater than $1,314.51 but less than 3525.3 dollars after year 1985 have a life expectancy of 60.37 years.

We can also run a linear model for all countries:

## 
## Call:
## lm(formula = lifeExp ~ ., data = gp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.497  -7.075   1.121   7.701  19.640 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.115e+02  2.767e+01 -14.872  < 2e-16 ***
## year         2.354e-01  1.400e-02  16.812  < 2e-16 ***
## pop          6.353e-09  2.218e-09   2.864  0.00423 ** 
## gdpPercap    6.729e-04  2.444e-05  27.529  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.673 on 1700 degrees of freedom
## Multiple R-squared:  0.4402, Adjusted R-squared:  0.4392 
## F-statistic: 445.6 on 3 and 1700 DF,  p-value: < 2.2e-16

All our predictors have significant relationship with lifeExpectancy (although gdpPercap and year have more significance than population). Our model r.squared is 0.44 which is not impressive but not also far off. We can certainly improve our model by slicing and dicing and we can even try running a single regression model with just 1 predictor and check the fitness but we won’t be doing this.

Just out of curiosity, we can check the biggest predictor of gdpPercap. A straight-forward guess should be lifeExpectancy but let’s verify with a decision tree.

I won’t go into details on how to interpret this tree, you can apply the same logic from our previous tree to interpret this tree. Although I will point out some insights i find interesting:

our model suggests countries with lifeExpectancy less than 55.509 have the lowest gdpPercap (which kind of makes sense);
And also countries with lifeExpectancy less than 67.7 years and a population size less than 358,583 and also before year 1969 have gdpPercap of 68,980 (which is huge - and this is linked to the oil prosperity in the middle east in the 50’s and 60’s. Remeber Kuwait from our exploratory section ?).

Act III

Conclusion

The world is in year 2021 and unfortunately we still have arguments on whether the world is getting better or not. It seems we have more people who think the world is not getting better. I understand better now - why people feel this way about human progress. The more you know, the better you understand.

A century or two before now, countries go to war on average of 2 or more wars in a year, I read “Napoleon the great” by Andrew Roberts and found out Napolean fought over 60 wars in less than 25 years(Napolean actually spent more time outside of france fighting wars than he spent ruling in france). 60 wars in 20-25 years is an average of 2+ wars in a year and by the way this was the thing back then and was the prevalent thing before 20th century that we take for granted now (it will be interesting to see the lifeExpectancy in the 18th or 19th century - you can bet that it will be less than 35 years) - and if you go back to 15th, 16th centuries or even before then you will find out that they fought more wars in those years and were more violent compared to 18th, 19th century (meaning 2 wars in a year in the 18th/19th centuries was actually an improvement and progress compared to centuries before then). Inshort, you start to see a linear trend where as years go by or as centuries go by, the urge to go to wars decreases to the extent that we’ve gone from a generation that go to war almost every 6 months to a generation that might not witness war in their life time.

As we have seen in our analysis of least fit countries in our model, wars tend to make a big dent in lifeExpectancy (and on population and gdpPercap). In africa, we are still licking our wounds from the wars in the 60’s, 70’s and even 90’s. The fact that wars are frowned upon in africa today says alot about how we have progressed on this side.

Centuries ago, Rape,mass killings and genocides were considered spoils of war but now it’s illegal, humans are treating animals better now than decades ago. This is the best period for gays and homosexuals (it’s even legalized in some countries and we are not scared of calling them our friends now). Torture is illegal now, women drive in Saudi Arabia now. In most cases, the progress starts in Europe or North America, then over time filters down to other continents. We found a vaccine for corona virus in less than a year - the spanish flu killed more than 10-18 million people in 12-15 months, we are more than a year into corona virus and less than 3 million people are dead (we might not want to say it - the fact that we have such a low figure for corona virus compared to the spanish flu around the same time period points to human progress). We can be sure that another epidemic 100 years from now, will not have as much devastating effect as corona virus because Humans would have progressed more compared to now. You are probably reading this from your phone or laptop - (you probably can’t do that 30/40 years ago). Today we own cars (which was a luxurious item and only for the super rich until Ford decided to commercialize it and make Model T affordable over 100 years ago), i believe at some point we will own Aeroplanes the same we own cars (say 100 years from now and they will be way smaller). We can go on and on - on decline in violence and human progress and the theme will be the same.

Human progress is a never ending process and there’s no limit. Every generation makes their contribution and pass the baton to the next generation.

10 years ago, i probably would have been in the camp of “the world is not getting better”, so what changed between now and then ? I will say books (reading about History, reading about Napolean and his conquests, about World Wars, about Julius Cesar, about Rwandan Genocide put things in perspective).

If you ever need a book to understand human progress, i will recommend The Better Angels of our Nature. If you don’t want to read the book, you can watch this video on youtube by Steve Pinker talking about the book https://www.youtube.com/watch?v=o5X2-i_poNU

Epilogue

Over the years, i have read a lot of books on wars, genocides, biographies, programming, data science that kind of made it easier for me to form an opinion on some of the subjects in this article. You might have a better opinion on some of the subjects in this report depending on your vast knowledge and mental models. However If you are interested in going deeper, i would recommend the following books:

Books Recommendation:

On Data Modelling:

I really enjoyed these books, some of these books are free.

Applied Predictive Modelling https://www.amazon.com/Applied-Predictive-Modeling-Max-Kuhn/dp/1461468485
An Introduction to Statistical learning https://www.amazon.com/Introduction-Statistical-Learning-Applications-Statistics/dp/1461471370
R for Data Science. https://www.amazon.com/Data-Science-Transform-Visualize-Model/dp/1491910399
If you are interested in learning R, then you should check out Hadley Wickham and his books (he wrote R for Data Science). You can check his home-page here http://hadley.nz/
If you are thinking of replicating this on Python (might do another article using Python) or learning python, i’d recommend this book https://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1491957662
I also love Jose Portilla, he’s an udemy academic instructor.You can check out his Python, R and Postgres SQL courses on Udemy. His course on python is one of the best online-courses on the internet today. https://www.udemy.com/user/joseportilla/

On Diet:

I mentioned Japan and how their diet might be the difference maker in their life expectancy gap compared to other countries. I also believe sleeping very well is another plus for living longer. To learn more about diet and sleep and it’s impact on our well being you can pick up these books

On War & Genocide:

Wars and genocides were often mentioned in this report and they both have significant effect on life expectancy. While i can’t remember them all, these books are my personal favorites on this topic:

The Rwanda Crisis: History of a genocide https://www.amazon.com/Rwanda-Crisis-History-Genocide/dp/023110409X
Napoleon the Great https://www.amazon.com/dp/B01G8V1TMW/ref=dp-kindle-redirect?_encoding=UTF8&btkr=1
We Wish to Inform You That Tomorrow We Will be Killed With Our Families: Stories from Rwanda https://www.amazon.com/dp/B000OI0FI0/ref=dp-kindle-redirect?_encoding=UTF8&btkr=1
Guns of August https://www.amazon.com/dp/B002TXZS8A/ref=dp-kindle-redirect?_encoding=UTF8&btkr=1 [About World War I];
Africa’s World War: Congo, the Rwandan Genocide, and the Making of a Continental Catastrophe https://www.amazon.com/gp/product/B015G2TEF8/ref=dbs_a_def_rwt_hsch_vapi_tkin_p1_i0
The Brothers’ War: Biafra and Nigeria https://www.amazon.com/dp/B008J3TDBC/ref=dp-kindle-redirect?_encoding=UTF8&btkr=1
Say Nothing https://www.amazon.com/dp/B07B2NZRL5/ref=dp-kindle-redirect?_encoding=UTF8&btkr=1 [Really love this about the IRA]
Oil, politics and Violence https://www.amazon.com/Oil-Politics-Violence-Nigerias-1966-1976/dp/087586709X
Team of Rivals https://www.amazon.com/Team-Rivals-Political-Abraham-Lincoln-ebook/dp/B000N2HBSO [on Abraham Lincoln and America civil war]

On AIDS and Thabo Mbeki’s denial:

The Dream Deffered https://www.amazon.com/dp/B00DFM6WMU/ref=dp-kindle-redirect?_encoding=UTF8&btkr=1

On Human Progress:

The major theme of this report is how we have progressed as a race. If you have to pick a book from this section, i hope you pick one of these books

Factfulness https://www.amazon.com/dp/B0769XK7D6/ref=dp-kindle-redirect?_encoding=UTF8&btkr=1
The Better Angels of our Nature https://www.amazon.com/dp/B005HHSYMW/ref=dp-kindle-redirect?_encoding=UTF8&btkr=1

On Data Visualizations:

Fundamentals of Data Visualization https://www.amazon.com/Fundamentals-Data-Visualization-Informative-Compelling/dp/1492031089
Storytelling with Data https://www.amazon.com/dp/B016DHQSM2/ref=dp-kindle-redirect?_encoding=UTF8&btkr=1
I get a lot of my data visualizations mental models from Tom Worville in the athletic. His visuals are the most beautiful visuals you’d find on the internet today and he’s a good writer too. https://theathletic.com/author/tom-worville/
You can also check out Owen Phillips, his visuals are very good https://thef5.substack.com/

All codes are written in R, i have tried to expose the codes (but some codes for the Part 2 were hidden - will add to github). There are a lot of visuals i decided to cut off from the report which i think will be of good use (especially if you are into e-commerce or a data analyst), this visuals together with the full code of this report will be posted in my github account before the end of the week

If you have a book recommendation for me or want to say hi - you can reach me via email: olalekankazeem1@gmail.com and also my twitter account (@SabiBoy84 although i’m rarely on twitter but i make sure to pop in once in a while.)

github account -> https://github.com/Lekan-Ali

Email Add -> Olalekankazeem1@gmail.com

Twitter Account -> @SabiBoy84

If there’s one thing i didn’t plan for, it was how long this report would be or the intensity. The plan was to do a single report and get it done in a day (say 5/6 hours) but this took four days (and more than 36 hours) - it was worth it.

I’m currently reading Sam teach yourself SQL in 10 minutes and about to start reading R graphics cookbook and Steve pinker’s (sense of style - to improve my writing) so i had to rush this article because i don’t want a situation where i start making some changes because i saw a better visuals in the R graphics cookbook or read some tips about writing from sense of style.

Which leads me to my final point, my writing style or use of words may not be up to standard or good enough (maybe i used will where i should use would or use a particular word too much) i promise to improve on that and my next article will certainly be better than this.

Ending with this short video from Shawshank Redemption: Love it https://www.youtube.com/watch?v=bRKslrIQ7xI

“Hope is a good thing, maybe the best of things, and no good thing ever dies” - Andy Dufresne[Movie: Shawshank Redemption] <