Ethical Web Scraping Assignment: WorldOMeter Populations

Web Scraping: World O Meters Country Population Analysis

(Using an HTML website table)

Part 1: My interest

I have always found population data to be very interesting especially as birth rates fluctuate across the world and issues regarding recourse and energy scarcity continue to influence political and ethical discussions. Since the 1400’s world population has increased by about 7.5 billion people, and during that time, population density between countries has also shifted, with the migration of different groups out of what is now the Middle east and Africa. I would like to use the data set provided by WorldOMeters to explore which countries make up most of the worlds population (Besides China and India), and how these numbers compare to the actual size of the country. I would also like to know if a lower median age for a country correlates with a higher fertility rate and if fertility rates have the same distributions among country sizes.

Part 2: My inquiry

Which countries have the top ten highest populations, and how do these high numbers compare with the countries relative size?
Is there a relationship between country median age and country fertility rate?
Do fertility rate values have the same distribution when grouped by country size?

Part 3: The process

First I scraped data from https://www.worldometers.info/world-population/population-by-country/ using the structured table method. I was able to successfully grab the 13 columns. Most of the columns were in character strings so I changed them to numeric after removing unnecessary commas. I changed the form of the negative signs on values so they could be read as numeric as well. I then put all of my scraping and cleaning code into one function so it could be loaded as a data frame with just one step.

To begin my inquiry analysis, I wanted to add another categorical variable that would filter each area into size groups; tiny, small, medium, large, and massive. With this column I would be able to easily see the distribution of different size countries and easily see how population relates to general size.

Part 4: CSV. file load

Part 5: Inquiry Visualizations

Viz 1: For my first visualization I made an smaller country data frame with just the top 10 highest populated countries. I then put these into ggplot using geom_bar to plot each country as a column with population on the y axis.

top_10_countries_df %>% 
  ggplot(aes(x=Country..or.dependency., y=Population.2025)) +
  geom_bar(stat = "summary", fun = sum) +
  scale_y_continuous(labels = scales::number)+
  labs(title = "Top 10 populations by country",
       x = "Country",
       y = "Population")

Viz 2: For this visualization, I used the same smaller data frame with just the ten countries, I created essentially the same graph as above but used facet wrap to see the size comparison of each country. The top 5 most populated countries are classified as having either medium or large areas.

top_10_countries_df %>% 
  ggplot(aes(x = Country..or.dependency., y = Population.2025)) +
  geom_col() +
  facet_wrap(~area_category, nrow = 2) +
  labs(
    title = "Top 10 Populations by Country",
    x = "Country",
    y = "Population"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Viz 3: Before putting the data into visualization 3, I grouped everything in the top 10 data frame by area category and then piped the groupings into the summary function to get each sizes average population density (People per Km). I then graphed the averages using geom_col.

#| echo: true
#| eval: false

top_10_countries_df %>% 
  group_by(area_category) %>%
  summarise(avg_ppKm = mean(Density..P.Km..)) %>%
  ggplot(aes(x = area_category, y = avg_ppKm)) +
  geom_col(fill="blue") +
  labs(title = "average population density per km by top 10 country size",
       x = "area size",
       y = "people per KM")

Viz 4: Visualization 4 is an almost exact replica of the prior graph, however, I wanted to see if their would be a similar story using every country from the original countries_df.

#| echo: true
#| eval: false

countries_df %>%
  group_by(area_category) %>%
  summarise(avg_ppKm = mean(Density..P.Km..)) %>%
  ggplot(aes(x = area_category, y = avg_ppKm)) +
  geom_col(fill="red") +
  labs(title = "average population density per km by country size",
       x = "area size",
       y = "people per KM")

Viz 5: For this visualization, I started by piping in the top 10 countries data frame into ggplot, so I could see if there was a relationship between Median age and fertility rate. I was correct in my hypothesis that fertility rate goes up as median age decreases

#| echo: true
#| eval: false

top_10_countries_df %>%
  ggplot(aes(x = Fert..Rate, y = Median.Age)) +
  geom_point(color = "red") + 
  geom_smooth(method = "lm", se = TRUE) +
  labs( title = "Top 10 populated countries median age compared to fertility", 
        x = "Fertility Rate",
        y = "Median Age")

`geom_smooth()` using formula = 'y ~ x'

Viz 6:

#| echo: true
#| eval: false

countries_df %>%
  ggplot(aes(x = Fert..Rate, y = Population.2025)) +
  geom_point(color = "red") +
  scale_y_continuous(labels = scales::number)+
  facet_wrap(~area_category, nrow = 2)

  labs(title = "fertility rate by populationa and country size",
       x = "fertility rate",
       y = "Population")

$x
[1] "fertility rate"

$y
[1] "Population"

$title
[1] "fertility rate by populationa and country size"

attr(,"class")
[1] "labels"

Summary of Findings:

The first inquiry into my data was simply to find which countries from my frame had the highest populations. In order of most to least populated, these included India, China, The United States, Indonesia, Pakistan, Nigeria, Brazil, Bangladesh, Ethiopia, and Russia. I fully expected China to be on the list but had no idea of the scale to which China and India outpaced all the other countries. Even from the top list, China and India have about 5 to 6 times as many people as the other 8 countries. Size wise, the top 5 most populated countries fit into either the large category or medium category. Even though Russia is our globes largest country, they were last on the list. From my third and fourth visualizations using both data frames in my environment I found that with smaller countries, population density increased which was already inferred as small countries have less land that is uninhabited. With my 5th visualization, I was able to identify a strong negative correlation between median age and fertility rate, proving that as median age increases, even to the 30s/40s, fertility rate drops. Across the world, most kids are likely being had by parents in their late teens/early twenties. In America, having kids at a young age may be becoming less popular but its occurrence is still pronounced across the world. For my last visualization, I wanted to see how fertility rate varied across country size categories. The large and medium countries actually had lower fertility rates overall, while medium, small, and tiny countries had a higher range of fertility rate values. In these graphs you can see India and china as population outliers but to my surprise, in 2025 they have lower fertility rates than most countries.