On the Correlation between the Proportion of Internet Users and the Birth Rate per Country

Scraping Data from Wikipedia

Author

Andrea Carpignani

Published

20 January 2024

Introduction

In the following I am going to scrap Wikipedia to get the table of the Birth Rate for each country, the percentage of internet usage per country and an estimate of the Income Group of each country, in order to establish whether there is any relation between internet usage and birth rate and whether this correlation is stronger depending on the Income Group of each country.

Data collection

In order to scrap the data from these pages, we shall need the following packages:

library(tidyverse)
library(rvest)

Let us first save the URL of each page into a variable, and then save the corresponding html page in a different variable.

# Url of each page
url1 <- "https://en.wikipedia.org/wiki/List_of_countries_by_birth_rate"
url2 <- "https://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users"
url3 <- "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"

# Saving the content of each page in a variable
html1 <- read_html(url1)
html2 <- read_html(url2)
html3 <- read_html(url3)

We can now utilise rvest functions to get the required tables from the web pages.

df1 <- html1 |>
    html_element("table") |>
    html_table()

df2 <- html2 |> 
    html_element(".sortable") |>
    html_table()

df3 <- html3 |> 
    html_element(".wikitable") |>
    html_table(header = FALSE) |>
    slice(c(-1,-3))

Notice that the first two tables are easier to collect, but the third table is more complicated, because the headers of the table are divided into two lines. This will require some extra work to tidy df3, before we can join the three tables together.

First cleansing of the data

First table

Let us first have a look at the table:

# A tibble: 234 × 3
   `Country/territory` `PRB2022[1]` `CIA WF2023[2]`
   <chr>                      <int>           <dbl>
 1 Afghanistan                   36           34.8 
 2 Albania                       10           12.5 
 3 Algeria                       22           17.8 
 4 Andorra                        6            6.87
 5 Angola                        39           41.4 
 6 Antigua and Barbuda           12           15.0 
 7 Argentina                     14           15.4 
 8 Armenia                       12           10.8 
 9 Australia                     12           12.2 
10 Austria                       10            9.39
# ℹ 224 more rows

The first table is simple to tidy: we only want to change the names of the columns, and to take the first and second column.

tidy_df_1 <- df1 |>
    rename(country = 1, rate = 2) |>
    select(country, rate) |>
    arrange("country")

Second table

Let us first have a look at the table:

# A tibble: 214 × 8
   `Country or area` Subregion          Region   `Internet users` Pct  
   <chr>             <chr>              <chr>    <chr>            <chr>
 1 China             Eastern Asia       Asia     1,089,140,000    76.4%
 2 India             Southern Asia      Asia     881,250,000      62.6%
 3 United States     Northern America   Americas 311,300,000      92.4%
 4 Indonesia         South-eastern Asia Asia     215,626,156      78.8%
 5 Brazil            South America      Americas 165,300,000      77.1%
 6 Nigeria           Western Africa     Africa   136,203,231      63.8%
 7 Pakistan          Southern Asia      Asia     130,000,000      60.8%
 8 Russia            Eastern Europe     Europe   129,800,000      89.5%
 9 Bangladesh        Southern Asia      Asia     126,210,000      75.9%
10 Japan             Eastern Asia       Asia     117,400,000      94.2%
# ℹ 204 more rows
# ℹ 3 more variables:
#   `Population.mw-parser-output .nobold{font-weight:normal}(2021)[10][11]` <chr>,
#   Sources <chr>, Year <int>

For the second table there is a little more work to do: we first need to change the names of the columns, then we only want to select the country name, region, internet users and population, but we will need to eliminate the symbol “,” to separate the thousands and to transform the columns Internet users and Population into integers.

tidy_df_2 <- df2 |>
    select(1, 3, 4, 6) |>
    rename(country = 1, region = 2, users = 3, population = 4) |>
    mutate_at(c("users", "population"), function(x) str_remove_all(x, ",")) |>
    mutate_at(c("users", "population"), function(x) x = as.integer(x)) |>
    mutate(proportion = 100 * round(users / population, 4))

Third table

Let us first have a look at the table:

# A tibble: 215 × 8
   X1                  X2        X3         X4        X5       X6    X7    X8   
   <chr>               <chr>     <chr>      <chr>     <chr>    <chr> <chr> <chr>
 1 Country/Territory   UN region Forecast   Year      Estimate Year  Esti… Year 
 2 United States       Americas  26,949,643 2023      25,462,… 2022  23,3… 2021 
 3 European Union[n 3] Europe    18,351,127 2023      16,746,… 2022  —     —    
 4 China               Asia      17,700,899 [n 1]2023 17,963,… [n 4… 17,7… [n 1…
 5 Germany             Europe    4,429,838  2023      4,072,1… 2022  4,25… 2021 
 6 Japan               Asia      4,230,862  2023      4,231,1… 2022  4,94… 2021 
 7 India               Asia      3,732,224  2023      3,385,0… 2022  3,20… 2021 
 8 United Kingdom      Europe    3,332,059  2023      3,070,6… 2022  3,13… 2021 
 9 France              Europe    3,049,016  2023      2,782,9… 2022  2,95… 2021 
10 Italy               Europe    2,186,082  2023      2,010,4… 2022  2,10… 2021 
# ℹ 205 more rows

As we can see, this table is a bit more of a mess, because the names of the columns are saved into the first line and different columns have the same name, so replacing the names of the columns with the first line would induce an error with tibble. However, we may solve this problem by selecting only the column of interest, which is table number 7, because the birth rate is the one referred to the year 2022, as well as the number of internet users.

tidy_df_3 <- df3 |>
    select(1, 7) |>
    slice(-1) |>
    rename(country = 1, GDP = 2) |>
    mutate(GDP = str_remove_all(GDP, ",")) |>
    mutate(GDP = as.integer(GDP)) |>
    drop_na()

Joining the three tables in one single data set

Once we have tidied up the three tables, we need to join them into one single data set. Since we wish only to compare countries for which we have all data, we shall use the inner_join function and drop incomplete cases.

dat <- tidy_df_1 |>
    inner_join(tidy_df_2, by = join_by(country), unmatched = "drop") |>
    inner_join(tidy_df_3, by = join_by(country), unmatched = "drop") |>
    relocate(country, region, GDP, rate, population, users, proportion) |>
    drop_na()

Let us have a look at the final table.

country region GDP rate population users proportion
Afghanistan Asia 14939 36 40099462 4068194 10.15
Albania Europe 18260 10 2854710 2105339 73.75
Algeria Africa 163473 22 44177969 26350000 59.65
Andorra Europe 3325 6 79034 76095 96.28
Angola Africa 70533 39 34503774 4271053 12.38
Antigua and Barbuda Americas 1421 12 93219 77529 83.17

Establishing the Income Group

A good method to establish the income group per country is to consider the GDP per inhabitant, which means the ratio between the GDP and the population. In fact, a small country would necessarily have a small GDP, in comparison with a big country, but relatively to the number of inhabitants, the GDP of the small country with a strong economy might be much higher than a big country with slow economy. Let us therefore add this column to the data set.

dat <- dat |> mutate(gpi = GDP / population)

Now we can calculate the quartiles of the gpi (GDP per inhabitant), and divide the countries in the following four categories

  • Low Income

  • Lower Middle Income

  • Upper Middle Income

  • High Income

depending on the ranking of the gpi of the country.

qu <- quantile(dat$gpi, probs = c(0.25, 0.50, 0.75))

dat <- dat |> 
    mutate(IncomeGroup = case_when(gpi < qu[1] ~ "Low Income",
                                   between(gpi, qu[1], qu[2]) ~ "Lower Middle Income",
                                   between(gpi, qu[2], qu[3]) ~ "Upper Middle Income",
                                   gpi > qu[3] ~ "High Income"))

Data Visualisation

We can now proceed to analyse the data, to see whether there is any evidence of correlation between the number of internet users and the birth rate per country. We shall do this using ggplot2 to visualise the data in a scatter box.

ggplot(dat, aes(x = proportion, y = rate)) +
    geom_point(aes(colour = IncomeGroup)) +
    labs(
        title = "Birth Rate against Proportion of Internet Users per Country",
        x = "Proportion of Internet Users",
        y = "Birth Rate",
        colour = "Income Group") +
    scale_x_continuous(breaks = seq(0, 150, 10)) +
    scale_y_continuous(breaks = seq(0, 50, 5)) +
    theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
          axis.title.x = element_text(margin = margin(t = 10, b = 0, l = 0, r = 0)),
          axis.title.y = element_text(margin = margin(t = 0, b = 0, l = 0, r = 10)))

Let us now look at the same problem from a different perspective. First of all, let us break down the graph by income group and by region:

ggplot(dat, aes(x = proportion, y = rate)) +
    geom_point(aes(colour = IncomeGroup)) +
    facet_grid(IncomeGroup ~ region) +
    labs(
       title = "Birth Rate vs Proportion of Internet Users per Country by Region",
       x = "Proportion of Internet Users",
       y = "Birth Rate",
       colour = "Income Group"
    ) +
    theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
          axis.title.x = element_text(margin = margin(t = 10, b = 0, l = 0, r = 0)),
          axis.title.y = element_text(margin = margin(t = 0, b = 0, l = 0, r = 10)),
          legend.position = "none")

Here we can see that there are some clear patterns: looking at the graphs downwards, it is immediately evident that Europe has a high proportion of internet users, whilst Africa has a low proportion of internet users. Also Europe has overall a low birth rate, and to some extent so do the Americas, whereas the other regions are more scattered. It seems clear that the correlation between Birth Rate and Internet User is not accidental, but it seems linked to the specific region where the country lies.

To ascertain this claim, let us consider two more graphs: the birth rate and the proportion of internet users by region.

ggplot(dat, aes(x = reorder(region, -rate), y = rate)) +
    geom_point(aes(colour = region), position = "jitter") + 
    geom_boxplot(aes(colour = region), alpha = 0.5) +
    labs(
        title = "Birth Rate per Country by Region",
        x = "Region",
        y = "Birth Rate",
        colour = "Region"
    ) +
    theme(
        plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
        axis.title.x = element_text(margin = margin(t = 10, b = 0, l = 0, r = 0)),
        axis.title.y = element_text(margin = margin(t = 0, b = 0, l = 0, r = 10)),
        legend.position = "none"
    )

This graph shows now a clear pattern: it is evident that the birth rate is strongly linked to the region, as shown by the following table:

Region Average
Europe 9.79
Americas 14.91
Asia 17.47
Oceania 22.38
Africa 30.72

Similarly, let us consider the graph of the proportion of internet users per country by region:

ggplot(dat, aes(x = reorder(region, -rate), y = proportion)) +
    geom_point(aes(colour = region), position = "jitter") + 
    geom_boxplot(aes(colour = region), alpha = 0.5) +
    labs(
       title = "Proportion of Internet Users per Country by Region",
       x = "Region",
       y = "Proportion of Internet Users per Country",
       colour = "Region"
    ) +
    theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
          axis.title.x = element_text(margin = margin(t = 10, b = 0, l = 0, r = 0)),
          axis.title.y = element_text(margin = margin(t = 0, b = 0, l = 0, r = 10)),
          legend.position = "none")

This graph also shows a clear pattern: which is the opposite as the previous one, with the only exception of Asia: in this case, the highest internet usage is in Europe, and the lowest is in Africa, with the averages summarised in the table below.

Region Average
Africa 25.37
Oceania 45.30
Americas 60.10
Asia 63.65
Europe 83.59

Hence, with the only exception of Asia, in which the trend is more scattered and the average of internet users is higher than expected, there is a clear incidence of internet usage depending on the region.

Conclusions

From the analysis carried out on the data scraped from Wikipedia, relative to the year 2022, there is a clear correlation between birth rate and proportion of internet users which vaguely depends on the income group.

Carrying on further analysis, it turns out that both variables have a strong correlation with the region (Africa, Americas, Asia, Europe, Oceania), where the birth rate is very high in Africa and very low in Europe, and, conversely, the proportion of internet users in Africa is much smaller than the same proportion in Europe.

In regions such as Asia and Americas, however, we can see from the graphs that data are more scattered and the correlation is more vague, whereas in Africa and Europe the correlation is stronger. In particular, it seems clear from the data that in virtually all countries in Europe the birth rate is very low and the proportion of internet users is very high, suggesting a strong correlation between these two variables.