On the Correlation between the Proportion of Internet Users and the Birth Rate per Country

Scraping Data from Wikipedia

Author

Andrea Carpignani

Published

20 January 2024

Introduction

In the following I am going to scrap Wikipedia to get the table of the Birth Rate for each country, the percentage of internet usage per country and an estimate of the Income Group of each country, in order to establish whether there is any relation between internet usage and birth rate and whether this correlation is stronger depending on the Income Group of each country.

Data collection

The birth rate per country can be found on the Wikipedia page “List of countries by birth rate”.
The percentage of internet usage can be found on the Wikipedia page “List of countries by number of internet users”.
To estimate the income group we can use the GDP estimated for each country and available on the Wikipedia page “List of countries by GDP”.

In order to scrap the data from these pages, we shall need the following packages:

library(tidyverse)
library(rvest)

Let us first save the URL of each page into a variable, and then save the corresponding html page in a different variable.

# Url of each page
url1 <- "https://en.wikipedia.org/wiki/List_of_countries_by_birth_rate"
url2 <- "https://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users"
url3 <- "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"

# Saving the content of each page in a variable
html1 <- read_html(url1)
html2 <- read_html(url2)
html3 <- read_html(url3)

We can now utilise rvest functions to get the required tables from the web pages.

df1 <- html1 |>
    html_element("table") |>
    html_table()

df2 <- html2 |> 
    html_element(".sortable") |>
    html_table()

df3 <- html3 |> 
    html_element(".wikitable") |>
    html_table(header = FALSE) |>
    slice(c(-1,-3))

Notice that the first two tables are easier to collect, but the third table is more complicated, because the headers of the table are divided into two lines. This will require some extra work to tidy df3, before we can join the three tables together.

First cleansing of the data

First table

Let us first have a look at the table:

# A tibble: 234 × 3
   `Country/territory` `PRB2022[1]` `CIA WF2023[2]`
   <chr>                      <int>           <dbl>
 1 Afghanistan                   36           34.8 
 2 Albania                       10           12.5 
 3 Algeria                       22           17.8 
 4 Andorra                        6            6.87
 5 Angola                        39           41.4 
 6 Antigua and Barbuda           12           15.0 
 7 Argentina                     14           15.4 
 8 Armenia                       12           10.8 
 9 Australia                     12           12.2 
10 Austria                       10            9.39
# ℹ 224 more rows

The first table is simple to tidy: we only want to change the names of the columns, and to take the first and second column.

tidy_df_1 <- df1 |>
    rename(country = 1, rate = 2) |>
    select(country, rate) |>
    arrange("country")

Second table

Let us first have a look at the table:

# A tibble: 214 × 8
   `Country or area` Subregion          Region   `Internet users` Pct  
   <chr>             <chr>              <chr>    <chr>            <chr>
 1 China             Eastern Asia       Asia     1,089,140,000    76.4%
 2 India             Southern Asia      Asia     881,250,000      62.6%
 3 United States     Northern America   Americas 311,300,000      92.4%
 4 Indonesia         South-eastern Asia Asia     215,626,156      78.8%
 5 Brazil            South America      Americas 165,300,000      77.1%
 6 Nigeria           Western Africa     Africa   136,203,231      63.8%
 7 Pakistan          Southern Asia      Asia     130,000,000      60.8%
 8 Russia            Eastern Europe     Europe   129,800,000      89.5%
 9 Bangladesh        Southern Asia      Asia     126,210,000      75.9%
10 Japan             Eastern Asia       Asia     117,400,000      94.2%
# ℹ 204 more rows
# ℹ 3 more variables:
#   `Population.mw-parser-output .nobold{font-weight:normal}(2021)[10][11]` <chr>,
#   Sources <chr>, Year <int>

For the second table there is a little more work to do: we first need to change the names of the columns, then we only want to select the country name, region, internet users and population, but we will need to eliminate the symbol “,” to separate the thousands and to transform the columns Internet users and Population into integers.

tidy_df_2 <- df2 |>
    select(1, 3, 4, 6) |>
    rename(country = 1, region = 2, users = 3, population = 4) |>
    mutate_at(c("users", "population"), function(x) str_remove_all(x, ",")) |>
    mutate_at(c("users", "population"), function(x) x = as.integer(x)) |>
    mutate(proportion = 100 * round(users / population, 4))

Third table

Let us first have a look at the table:

# A tibble: 215 × 8
   X1                  X2        X3         X4        X5       X6    X7    X8   
   <chr>               <chr>     <chr>      <chr>     <chr>    <chr> <chr> <chr>
 1 Country/Territory   UN region Forecast   Year      Estimate Year  Esti… Year 
 2 United States       Americas  26,949,643 2023      25,462,… 2022  23,3… 2021 
 3 European Union[n 3] Europe    18,351,127 2023      16,746,… 2022  —     —    
 4 China               Asia      17,700,899 [n 1]2023 17,963,… [n 4… 17,7… [n 1…
 5 Germany             Europe    4,429,838  2023      4,072,1… 2022  4,25… 2021 
 6 Japan               Asia      4,230,862  2023      4,231,1… 2022  4,94… 2021 
 7 India               Asia      3,732,224  2023      3,385,0… 2022  3,20… 2021 
 8 United Kingdom      Europe    3,332,059  2023      3,070,6… 2022  3,13… 2021 
 9 France              Europe    3,049,016  2023      2,782,9… 2022  2,95… 2021 
10 Italy               Europe    2,186,082  2023      2,010,4… 2022  2,10… 2021 
# ℹ 205 more rows

As we can see, this table is a bit more of a mess, because the names of the columns are saved into the first line and different columns have the same name, so replacing the names of the columns with the first line would induce an error with tibble. However, we may solve this problem by selecting only the column of interest, which is table number 7, because the birth rate is the one referred to the year 2022, as well as the number of internet users.

tidy_df_3 <- df3 |>
    select(1, 7) |>
    slice(-1) |>
    rename(country = 1, GDP = 2) |>
    mutate(GDP = str_remove_all(GDP, ",")) |>
    mutate(GDP = as.integer(GDP)) |>
    drop_na()

Joining the three tables in one single data set

Once we have tidied up the three tables, we need to join them into one single data set. Since we wish only to compare countries for which we have all data, we shall use the inner_join function and drop incomplete cases.

dat <- tidy_df_1 |>
    inner_join(tidy_df_2, by = join_by(country), unmatched = "drop") |>
    inner_join(tidy_df_3, by = join_by(country), unmatched = "drop") |>
    relocate(country, region, GDP, rate, population, users, proportion) |>
    drop_na()

Let us have a look at the final table.

country	region	GDP	rate	population	users	proportion
Afghanistan	Asia	14939	36	40099462	4068194	10.15
Albania	Europe	18260	10	2854710	2105339	73.75
Algeria	Africa	163473	22	44177969	26350000	59.65
Andorra	Europe	3325	6	79034	76095	96.28
Angola	Africa	70533	39	34503774	4271053	12.38
Antigua and Barbuda	Americas	1421	12	93219	77529	83.17

Establishing the Income Group

A good method to establish the income group per country is to consider the GDP per inhabitant, which means the ratio between the GDP and the population. In fact, a small country would necessarily have a small GDP, in comparison with a big country, but relatively to the number of inhabitants, the GDP of the small country with a strong economy might be much higher than a big country with slow economy. Let us therefore add this column to the data set.

dat <- dat |> mutate(gpi = GDP / population)

Now we can calculate the quartiles of the gpi (GDP per inhabitant), and divide the countries in the following four categories

Low Income
Lower Middle Income
Upper Middle Income
High Income

depending on the ranking of the gpi of the country.

qu <- quantile(dat$gpi, probs = c(0.25, 0.50, 0.75))

dat <- dat |> 
    mutate(IncomeGroup = case_when(gpi < qu[1] ~ "Low Income",
                                   between(gpi, qu[1], qu[2]) ~ "Lower Middle Income",
                                   between(gpi, qu[2], qu[3]) ~ "Upper Middle Income",
                                   gpi > qu[3] ~ "High Income"))

Data Visualisation

We can now proceed to analyse the data, to see whether there is any evidence of correlation between the number of internet users and the birth rate per country. We shall do this using ggplot2 to visualise the data in a scatter box.

ggplot(dat, aes(x = proportion, y = rate)) +
    geom_point(aes(colour = IncomeGroup)) +
    labs(
        title = "Birth Rate against Proportion of Internet Users per Country",
        x = "Proportion of Internet Users",
        y = "Birth Rate",
        colour = "Income Group") +
    scale_x_continuous(breaks = seq(0, 150, 10)) +
    scale_y_continuous(breaks = seq(0, 50, 5)) +
    theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
          axis.title.x = element_text(margin = margin(t = 10, b = 0, l = 0, r = 0)),
          axis.title.y = element_text(margin = margin(t = 0, b = 0, l = 0, r = 10)))

Let us now look at the same problem from a different perspective. First of all, let us break down the graph by income group and by region:

ggplot(dat, aes(x = proportion, y = rate)) +
    geom_point(aes(colour = IncomeGroup)) +
    facet_grid(IncomeGroup ~ region) +
    labs(
       title = "Birth Rate vs Proportion of Internet Users per Country by Region",
       x = "Proportion of Internet Users",
       y = "Birth Rate",
       colour = "Income Group"
    ) +
    theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
          axis.title.x = element_text(margin = margin(t = 10, b = 0, l = 0, r = 0)),
          axis.title.y = element_text(margin = margin(t = 0, b = 0, l = 0, r = 10)),
          legend.position = "none")

Here we can see that there are some clear patterns: looking at the graphs downwards, it is immediately evident that Europe has a high proportion of internet users, whilst Africa has a low proportion of internet users. Also Europe has overall a low birth rate, and to some extent so do the Americas, whereas the other regions are more scattered. It seems clear that the correlation between Birth Rate and Internet User is not accidental, but it seems linked to the specific region where the country lies.

To ascertain this claim, let us consider two more graphs: the birth rate and the proportion of internet users by region.

ggplot(dat, aes(x = reorder(region, -rate), y = rate)) +
    geom_point(aes(colour = region), position = "jitter") + 
    geom_boxplot(aes(colour = region), alpha = 0.5) +
    labs(
        title = "Birth Rate per Country by Region",
        x = "Region",
        y = "Birth Rate",
        colour = "Region"
    ) +
    theme(
        plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
        axis.title.x = element_text(margin = margin(t = 10, b = 0, l = 0, r = 0)),
        axis.title.y = element_text(margin = margin(t = 0, b = 0, l = 0, r = 10)),
        legend.position = "none"
    )

This graph shows now a clear pattern: it is evident that the birth rate is strongly linked to the region, as shown by the following table:

Region	Average
Europe	9.79
Americas	14.91
Asia	17.47
Oceania	22.38
Africa	30.72

Similarly, let us consider the graph of the proportion of internet users per country by region:

ggplot(dat, aes(x = reorder(region, -rate), y = proportion)) +
    geom_point(aes(colour = region), position = "jitter") + 
    geom_boxplot(aes(colour = region), alpha = 0.5) +
    labs(
       title = "Proportion of Internet Users per Country by Region",
       x = "Region",
       y = "Proportion of Internet Users per Country",
       colour = "Region"
    ) +
    theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
          axis.title.x = element_text(margin = margin(t = 10, b = 0, l = 0, r = 0)),
          axis.title.y = element_text(margin = margin(t = 0, b = 0, l = 0, r = 10)),
          legend.position = "none")

This graph also shows a clear pattern: which is the opposite as the previous one, with the only exception of Asia: in this case, the highest internet usage is in Europe, and the lowest is in Africa, with the averages summarised in the table below.

Region	Average
Africa	25.37
Oceania	45.30
Americas	60.10
Asia	63.65
Europe	83.59

Hence, with the only exception of Asia, in which the trend is more scattered and the average of internet users is higher than expected, there is a clear incidence of internet usage depending on the region.

Conclusions

From the analysis carried out on the data scraped from Wikipedia, relative to the year 2022, there is a clear correlation between birth rate and proportion of internet users which vaguely depends on the income group.

Carrying on further analysis, it turns out that both variables have a strong correlation with the region (Africa, Americas, Asia, Europe, Oceania), where the birth rate is very high in Africa and very low in Europe, and, conversely, the proportion of internet users in Africa is much smaller than the same proportion in Europe.

In regions such as Asia and Americas, however, we can see from the graphs that data are more scattered and the correlation is more vague, whereas in Africa and Europe the correlation is stronger. In particular, it seems clear from the data that in virtually all countries in Europe the birth rate is very low and the proportion of internet users is very high, suggesting a strong correlation between these two variables.