library(tidyverse)
library(rvest)On the Correlation between the Proportion of Internet Users and the Birth Rate per Country
Scraping Data from Wikipedia
Introduction
In the following I am going to scrap Wikipedia to get the table of the Birth Rate for each country, the percentage of internet usage per country and an estimate of the Income Group of each country, in order to establish whether there is any relation between internet usage and birth rate and whether this correlation is stronger depending on the Income Group of each country.
Data collection
The birth rate per country can be found on the Wikipedia page “List of countries by birth rate”.
The percentage of internet usage can be found on the Wikipedia page “List of countries by number of internet users”.
To estimate the income group we can use the GDP estimated for each country and available on the Wikipedia page “List of countries by GDP”.
In order to scrap the data from these pages, we shall need the following packages:
Let us first save the URL of each page into a variable, and then save the corresponding html page in a different variable.
# Url of each page
url1 <- "https://en.wikipedia.org/wiki/List_of_countries_by_birth_rate"
url2 <- "https://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users"
url3 <- "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
# Saving the content of each page in a variable
html1 <- read_html(url1)
html2 <- read_html(url2)
html3 <- read_html(url3)We can now utilise rvest functions to get the required tables from the web pages.
df1 <- html1 |>
html_element("table") |>
html_table()
df2 <- html2 |>
html_element(".sortable") |>
html_table()
df3 <- html3 |>
html_element(".wikitable") |>
html_table(header = FALSE) |>
slice(c(-1,-3))Notice that the first two tables are easier to collect, but the third table is more complicated, because the headers of the table are divided into two lines. This will require some extra work to tidy df3, before we can join the three tables together.
First cleansing of the data
First table
Let us first have a look at the table:
# A tibble: 234 × 3
`Country/territory` `PRB2022[1]` `CIA WF2023[2]`
<chr> <int> <dbl>
1 Afghanistan 36 34.8
2 Albania 10 12.5
3 Algeria 22 17.8
4 Andorra 6 6.87
5 Angola 39 41.4
6 Antigua and Barbuda 12 15.0
7 Argentina 14 15.4
8 Armenia 12 10.8
9 Australia 12 12.2
10 Austria 10 9.39
# ℹ 224 more rows
The first table is simple to tidy: we only want to change the names of the columns, and to take the first and second column.
tidy_df_1 <- df1 |>
rename(country = 1, rate = 2) |>
select(country, rate) |>
arrange("country")Second table
Let us first have a look at the table:
# A tibble: 214 × 8
`Country or area` Subregion Region `Internet users` Pct
<chr> <chr> <chr> <chr> <chr>
1 China Eastern Asia Asia 1,089,140,000 76.4%
2 India Southern Asia Asia 881,250,000 62.6%
3 United States Northern America Americas 311,300,000 92.4%
4 Indonesia South-eastern Asia Asia 215,626,156 78.8%
5 Brazil South America Americas 165,300,000 77.1%
6 Nigeria Western Africa Africa 136,203,231 63.8%
7 Pakistan Southern Asia Asia 130,000,000 60.8%
8 Russia Eastern Europe Europe 129,800,000 89.5%
9 Bangladesh Southern Asia Asia 126,210,000 75.9%
10 Japan Eastern Asia Asia 117,400,000 94.2%
# ℹ 204 more rows
# ℹ 3 more variables:
# `Population.mw-parser-output .nobold{font-weight:normal}(2021)[10][11]` <chr>,
# Sources <chr>, Year <int>
For the second table there is a little more work to do: we first need to change the names of the columns, then we only want to select the country name, region, internet users and population, but we will need to eliminate the symbol “,” to separate the thousands and to transform the columns Internet users and Population into integers.
tidy_df_2 <- df2 |>
select(1, 3, 4, 6) |>
rename(country = 1, region = 2, users = 3, population = 4) |>
mutate_at(c("users", "population"), function(x) str_remove_all(x, ",")) |>
mutate_at(c("users", "population"), function(x) x = as.integer(x)) |>
mutate(proportion = 100 * round(users / population, 4))Third table
Let us first have a look at the table:
# A tibble: 215 × 8
X1 X2 X3 X4 X5 X6 X7 X8
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Country/Territory UN region Forecast Year Estimate Year Esti… Year
2 United States Americas 26,949,643 2023 25,462,… 2022 23,3… 2021
3 European Union[n 3] Europe 18,351,127 2023 16,746,… 2022 — —
4 China Asia 17,700,899 [n 1]2023 17,963,… [n 4… 17,7… [n 1…
5 Germany Europe 4,429,838 2023 4,072,1… 2022 4,25… 2021
6 Japan Asia 4,230,862 2023 4,231,1… 2022 4,94… 2021
7 India Asia 3,732,224 2023 3,385,0… 2022 3,20… 2021
8 United Kingdom Europe 3,332,059 2023 3,070,6… 2022 3,13… 2021
9 France Europe 3,049,016 2023 2,782,9… 2022 2,95… 2021
10 Italy Europe 2,186,082 2023 2,010,4… 2022 2,10… 2021
# ℹ 205 more rows
As we can see, this table is a bit more of a mess, because the names of the columns are saved into the first line and different columns have the same name, so replacing the names of the columns with the first line would induce an error with tibble. However, we may solve this problem by selecting only the column of interest, which is table number 7, because the birth rate is the one referred to the year 2022, as well as the number of internet users.
tidy_df_3 <- df3 |>
select(1, 7) |>
slice(-1) |>
rename(country = 1, GDP = 2) |>
mutate(GDP = str_remove_all(GDP, ",")) |>
mutate(GDP = as.integer(GDP)) |>
drop_na()Joining the three tables in one single data set
Once we have tidied up the three tables, we need to join them into one single data set. Since we wish only to compare countries for which we have all data, we shall use the inner_join function and drop incomplete cases.
dat <- tidy_df_1 |>
inner_join(tidy_df_2, by = join_by(country), unmatched = "drop") |>
inner_join(tidy_df_3, by = join_by(country), unmatched = "drop") |>
relocate(country, region, GDP, rate, population, users, proportion) |>
drop_na()Let us have a look at the final table.
| country | region | GDP | rate | population | users | proportion |
|---|---|---|---|---|---|---|
| Afghanistan | Asia | 14939 | 36 | 40099462 | 4068194 | 10.15 |
| Albania | Europe | 18260 | 10 | 2854710 | 2105339 | 73.75 |
| Algeria | Africa | 163473 | 22 | 44177969 | 26350000 | 59.65 |
| Andorra | Europe | 3325 | 6 | 79034 | 76095 | 96.28 |
| Angola | Africa | 70533 | 39 | 34503774 | 4271053 | 12.38 |
| Antigua and Barbuda | Americas | 1421 | 12 | 93219 | 77529 | 83.17 |
Establishing the Income Group
A good method to establish the income group per country is to consider the GDP per inhabitant, which means the ratio between the GDP and the population. In fact, a small country would necessarily have a small GDP, in comparison with a big country, but relatively to the number of inhabitants, the GDP of the small country with a strong economy might be much higher than a big country with slow economy. Let us therefore add this column to the data set.
dat <- dat |> mutate(gpi = GDP / population)Now we can calculate the quartiles of the gpi (GDP per inhabitant), and divide the countries in the following four categories
Low Income
Lower Middle Income
Upper Middle Income
High Income
depending on the ranking of the gpi of the country.
qu <- quantile(dat$gpi, probs = c(0.25, 0.50, 0.75))
dat <- dat |>
mutate(IncomeGroup = case_when(gpi < qu[1] ~ "Low Income",
between(gpi, qu[1], qu[2]) ~ "Lower Middle Income",
between(gpi, qu[2], qu[3]) ~ "Upper Middle Income",
gpi > qu[3] ~ "High Income"))Data Visualisation
We can now proceed to analyse the data, to see whether there is any evidence of correlation between the number of internet users and the birth rate per country. We shall do this using ggplot2 to visualise the data in a scatter box.
ggplot(dat, aes(x = proportion, y = rate)) +
geom_point(aes(colour = IncomeGroup)) +
labs(
title = "Birth Rate against Proportion of Internet Users per Country",
x = "Proportion of Internet Users",
y = "Birth Rate",
colour = "Income Group") +
scale_x_continuous(breaks = seq(0, 150, 10)) +
scale_y_continuous(breaks = seq(0, 50, 5)) +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.title.x = element_text(margin = margin(t = 10, b = 0, l = 0, r = 0)),
axis.title.y = element_text(margin = margin(t = 0, b = 0, l = 0, r = 10)))Let us now look at the same problem from a different perspective. First of all, let us break down the graph by income group and by region:
ggplot(dat, aes(x = proportion, y = rate)) +
geom_point(aes(colour = IncomeGroup)) +
facet_grid(IncomeGroup ~ region) +
labs(
title = "Birth Rate vs Proportion of Internet Users per Country by Region",
x = "Proportion of Internet Users",
y = "Birth Rate",
colour = "Income Group"
) +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.title.x = element_text(margin = margin(t = 10, b = 0, l = 0, r = 0)),
axis.title.y = element_text(margin = margin(t = 0, b = 0, l = 0, r = 10)),
legend.position = "none")Here we can see that there are some clear patterns: looking at the graphs downwards, it is immediately evident that Europe has a high proportion of internet users, whilst Africa has a low proportion of internet users. Also Europe has overall a low birth rate, and to some extent so do the Americas, whereas the other regions are more scattered. It seems clear that the correlation between Birth Rate and Internet User is not accidental, but it seems linked to the specific region where the country lies.
To ascertain this claim, let us consider two more graphs: the birth rate and the proportion of internet users by region.
ggplot(dat, aes(x = reorder(region, -rate), y = rate)) +
geom_point(aes(colour = region), position = "jitter") +
geom_boxplot(aes(colour = region), alpha = 0.5) +
labs(
title = "Birth Rate per Country by Region",
x = "Region",
y = "Birth Rate",
colour = "Region"
) +
theme(
plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.title.x = element_text(margin = margin(t = 10, b = 0, l = 0, r = 0)),
axis.title.y = element_text(margin = margin(t = 0, b = 0, l = 0, r = 10)),
legend.position = "none"
)This graph shows now a clear pattern: it is evident that the birth rate is strongly linked to the region, as shown by the following table:
| Region | Average |
|---|---|
| Europe | 9.79 |
| Americas | 14.91 |
| Asia | 17.47 |
| Oceania | 22.38 |
| Africa | 30.72 |
Similarly, let us consider the graph of the proportion of internet users per country by region:
ggplot(dat, aes(x = reorder(region, -rate), y = proportion)) +
geom_point(aes(colour = region), position = "jitter") +
geom_boxplot(aes(colour = region), alpha = 0.5) +
labs(
title = "Proportion of Internet Users per Country by Region",
x = "Region",
y = "Proportion of Internet Users per Country",
colour = "Region"
) +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.title.x = element_text(margin = margin(t = 10, b = 0, l = 0, r = 0)),
axis.title.y = element_text(margin = margin(t = 0, b = 0, l = 0, r = 10)),
legend.position = "none")This graph also shows a clear pattern: which is the opposite as the previous one, with the only exception of Asia: in this case, the highest internet usage is in Europe, and the lowest is in Africa, with the averages summarised in the table below.
| Region | Average |
|---|---|
| Africa | 25.37 |
| Oceania | 45.30 |
| Americas | 60.10 |
| Asia | 63.65 |
| Europe | 83.59 |
Hence, with the only exception of Asia, in which the trend is more scattered and the average of internet users is higher than expected, there is a clear incidence of internet usage depending on the region.
Conclusions
From the analysis carried out on the data scraped from Wikipedia, relative to the year 2022, there is a clear correlation between birth rate and proportion of internet users which vaguely depends on the income group.
Carrying on further analysis, it turns out that both variables have a strong correlation with the region (Africa, Americas, Asia, Europe, Oceania), where the birth rate is very high in Africa and very low in Europe, and, conversely, the proportion of internet users in Africa is much smaller than the same proportion in Europe.
In regions such as Asia and Americas, however, we can see from the graphs that data are more scattered and the correlation is more vague, whereas in Africa and Europe the correlation is stronger. In particular, it seems clear from the data that in virtually all countries in Europe the birth rate is very low and the proportion of internet users is very high, suggesting a strong correlation between these two variables.