Scraping & Analyzing World Population Data Using Python

Which are the most populous and Least Populous Countries and Continents?

Independent Data Analysis Project
Author
Affiliation
Published

April 18, 2023

1 Background

In this analysis, I scrap and analyze population and country data from three sites:

2 Objectives

The objectives of the analysis are as follows:

  1. To illustrate how to scrap data tables from websites using Python and Pandas.

  2. To demonstrate the mechanics of analyzing and plotting data in Python, Pandas and Matplotlib.

  3. To illustrate how to merge data in the Python module Pandas.

  4. To visualise the trend in global population from 1800 to 2050.

  5. To visualise world population by continent using the pythom module matplotlib.

I start by loading the required modules.

# if(!require(pacman)){
#   install.packages("pacman")
# }

# pacman::p_load(reticulate)

## Import required modules 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import pycountry as pc

3 Data

We start by scrapping the data from the three sites. This scrapping required that I transpose the data and convert some columns from strings to float.

## World population 1800-2050
###################################################################
## Global population trends 
# help(pd.read_html)
scrape = pd.read_html('https://en.wikipedia.org/wiki/World_population', header=0)

## Transpose the data as it is horizonta;
population1 = scrape[0].transpose()

## Drop the column
population1 = population1.drop(["Population"])

## View the type and index of population data
type(population1)
<class 'pandas.core.frame.DataFrame'>
population1.index

## Rename column
Index(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10'], dtype='object')
population1.columns = ["year", "years_passed"]

population1['population'] = list(range(1, 11))

## Convert year column to numeric
population1['year'] = population1["year"].astype(float)
population1["population"] = population1["population"].astype(float)

population1.head()
     year years_passed  population
1  1804.0     200,000+         1.0
2  1930.0          126         2.0
3  1960.0           30         3.0
4  1974.0           14         4.0
5  1987.0           13         5.0

In this next scrapping exercise, I scrap the data of population by continent.

##################################################################
## Scrap world population by continent 
population = pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population")

## Get the proper table 
population = population[1]

## View the columns 
population.columns

## drop the top row
MultiIndex([(                                        'Rank', ...),
            (                        'Country / Dependency', ...),
            (                                  'Population', ...),
            (                                  'Population', ...),
            (                                        'Date', ...),
            ('Source (official or from the United Nations)', ...),
            (                                       'Notes', ...)],
           )
population = population.drop(index=0, axis='index')

## Reset the row index to start at 0
population = population.reset_index()
population = population.drop("index", axis = "columns")

## Rename the column
<string>:1: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
population = population.rename(columns = {"Country / Dependency": "Country"})
population.head()

## select a few columns
  Rank        Country  ... Source (official or from the United Nations)      Notes
  Rank        Country  ... Source (official or from the United Nations)      Notes
0    1          China  ...                         Official estimate[4]        [b]
1    2          India  ...                       Official projection[5]  [c] [[1]]
2    3  United States  ...                 National population clock[6]        [d]
3    4      Indonesia  ...                         Official estimate[7]        NaN
4    5       Pakistan  ...                             UN projection[3]        [e]

[5 rows x 7 columns]
population = population[['Rank','Country', 'Population']]

## Rename the columns
population.columns = [['rank', 'country', 'population', 'world_perc']]
population.columns = [x[0] for x in population.columns]
population.head()
  rank        country  population world_perc
0    1          China  1411750000        NaN
1    2          India  1392329000        NaN
2    3  United States   334640000        NaN
3    4      Indonesia   275773800        NaN
4    5       Pakistan   235825000        NaN

For each country, I use the pycountry module to create a new variable that contains the ISO 3 country code.

## Insert country code 
def country_code(country):
    try:
        return(pc.countries.get(name = country).alpha_3)
    except:
        return("NA")
    
population['code'] = [country_code(x) for x in population['country']]
population.head()

## Create a continent column ----
## get a list of codes 
  rank        country  population world_perc code
0    1          China  1411750000        NaN  CHN
1    2          India  1392329000        NaN  IND
2    3  United States   334640000        NaN  USA
3    4      Indonesia   275773800        NaN  IDN
4    5       Pakistan   235825000        NaN  PAK
codes = pd.read_html('https://cloford.com/resources/codes/index.htm')[3]
codes = codes.drop('Note', axis = 'columns')
codes = codes.rename(columns = {"ISO (2)": "ISO2", "ISO (3)": "code", "ISO (No)": "ISO_No"})

## preview codes 
codes.head()
  Continent             Region         Country  ... code ISO_No Internet
0      Asia         South Asia     Afghanistan  ...  AFG    4.0       AF
1    Europe  South East Europe         Albania  ...  ALB    8.0       AL
2    Africa    Northern Africa         Algeria  ...  DZA   12.0       DZ
3   Oceania            Pacific  American Samoa  ...  ASM   16.0       AS
4    Europe  South West Europe         Andorra  ...  AND   20.0       AD

[5 rows x 9 columns]

Next, I merge the population and codes dataset to get data that has the variable for continents and continent sub-regions.

## Join the two dataframes ----
final_data = pd.merge(population, codes, on = "code")
final_data.head()
  rank        country  population world_perc  ... FIPS ISO2 ISO_No Internet
0    1          China  1411750000        NaN  ...   CH   CN  156.0       CN
1    2          India  1392329000        NaN  ...   IN   IN  356.0       IN
2    3  United States   334640000        NaN  ...   US   US  840.0       US
3    4      Indonesia   275773800        NaN  ...   ID   ID  360.0       ID
4    5       Pakistan   235825000        NaN  ...   PK   PK  586.0       PK

[5 rows x 13 columns]

We now work with two datasets; population1 (the first web scrapping exercise) and final data (the second webscraping and data merging exercise) for the analysis below.

4 Analysis

In this plot, we visulise the trend in population from 1850 to the 2050 (projected). We see that the population has risen exponentially. However, growth is likely to slow in the coming years.

## Population trends, 1800-2050
## Visualise world population 
plt.plot(population1['year'], population1['population'], c = "red")

## Add points to the plot 
plt.scatter(population1['year'], population1['population'], c = "black", s= population1["population"])

plt.xlabel("Year")

plt.ylabel("Population, Billions")

plt.title("Population Growth, 1804-2057")

plt.show()

Let us examine the countries with the highest and lowest population in the data. China has the highest population. See the other relevant details about China below.

final_data.loc[final_data['population'].idxmax()]
rank                   1
country            China
population    1411750000
world_perc           NaN
code                 CHN
Continent           Asia
Region         East Asia
Country            China
Capital          Beijing
FIPS                  CH
ISO2                  CN
ISO_No             156.0
Internet              CN
Name: 0, dtype: object

Niue, with a population of 1549 is the country with the lowest population.

final_data.loc[final_data['population'].idxmin()]
rank                –
country          Niue
population       1549
world_perc        NaN
code              NIU
Continent     Oceania
Region        Pacific
Country          Niue
Capital         Alofi
FIPS               NE
ISO2               NU
ISO_No          570.0
Internet           NU
Name: 171, dtype: object

I create a summary of population by continent and sub-regions. The summary shows Oceania having the lowest population while Asia holds the bulk of global human population.

final_data.groupby('Continent').sum()['population'].sort_values()
Continent
Oceania       43167638
Europe       558240991
Americas     990112726
Africa      1142650420
Asia        4326108638
Name: population, dtype: int64

<string>:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.

I run the same analysis for subregions. South Asia has the most people.

final_data.groupby('Region').sum()['population'].sort_values()
Region
Northern Asia           3492683
Northern Europe        27866753
Indian Ocean           29048450
South East Europe      30930328
Central Europe         33065326
West Indies            38798442
Pacific                43167638
Central Africa         54242768
South West Europe      58040893
Southern Europe        59386426
Central Asia           79044819
Eastern Europe         94214670
Southern Africa       171266283
Central America       178685136
Eastern Africa        217897638
Northern Africa       251540961
Western Europe        254736595
South West Asia       261421725
North America         374455281
South America         398173867
Western Africa        418654320
South East Asia       562087416
East Asia            1536697000
South Asia           1883364995
Name: population, dtype: int64

<string>:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.

Next, I visualise the population by continent and subregions. I start with a plot of population by continent.

## Plot sorted population by continent
final_data.groupby('Continent').sum()['population'].sort_values().plot()
plt.style.use("tableau-colorblind10")
plt.xlabel("Continent")
plt.ylabel("Population")
plt.title("Population by Continent")
plt.show()

Then I do a boxplot of population by continent.

## Box plot of population by continent
final_data.boxplot(column="population", by='Continent', color = 'blue')
plt.style.use("tableau-colorblind10")
plt.xlabel("Continent")
plt.ylabel("Population (Log Scale)")
plt.yscale('log')
plt.title("Population by Continent")
plt.show()

Lastly, I do a boxplot of population by regions.

## Population by regions
final_data.boxplot(column="population", by=['Region'], color = 'blue')
plt.style.use("tableau-colorblind10")
plt.yscale('log')
plt.xlabel("Continent")
plt.ylabel("Population (Log Scale)")
plt.title("Population by Continent")
plt.show()

5 Conclusion

In this analysis, I illustrate how to:

The objectives of the analysis are as follows:

  1. To illustrate how to scrap data tables from websites using Python and Pandas.

  2. To demonstrate the mechanics of analyzing and plotting data in Python, Pandas and Matplotlib.

  3. To illustrate how to merge data in the Python module Pandas.

  4. To visualise the trend in global population from 1800 to 2050.

  5. To visualise world population by continent using the pythom module matplotlib.

I hope you find this write up useful.