# if(!require(pacman)){
# install.packages("pacman")
# }
# pacman::p_load(reticulate)
1 Background
In this analysis, I scrap and analyze population and country data from three sites:
World population from 1800 to present on Wikipedia available here https://en.wikipedia.org/wiki/World_population
World population by country Wikipedia available here https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population”
Country codes on Cloford available here https://cloford.com/resources/codes/index.htm’
2 Objectives
The objectives of the analysis are as follows:
To illustrate how to scrap data tables from websites using Python and Pandas.
To demonstrate the mechanics of analyzing and plotting data in Python, Pandas and Matplotlib.
To illustrate how to merge data in the Python module Pandas.
To visualise the trend in global population from 1800 to 2050.
To visualise world population by continent using the pythom module matplotlib.
I start by loading the required modules.
## Import required modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pycountry as pc
3 Data
We start by scrapping the data from the three sites. This scrapping required that I transpose the data and convert some columns from strings to float.
## World population 1800-2050
###################################################################
## Global population trends
# help(pd.read_html)
= pd.read_html('https://en.wikipedia.org/wiki/World_population', header=0)
scrape
## Transpose the data as it is horizonta;
= scrape[0].transpose()
population1
## Drop the column
= population1.drop(["Population"])
population1
## View the type and index of population data
type(population1)
<class 'pandas.core.frame.DataFrame'>
population1.index
## Rename column
Index(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10'], dtype='object')
= ["year", "years_passed"]
population1.columns
'population'] = list(range(1, 11))
population1[
## Convert year column to numeric
'year'] = population1["year"].astype(float)
population1["population"] = population1["population"].astype(float)
population1[
population1.head()
year years_passed population
1 1804.0 200,000+ 1.0
2 1930.0 126 2.0
3 1960.0 30 3.0
4 1974.0 14 4.0
5 1987.0 13 5.0
In this next scrapping exercise, I scrap the data of population by continent.
##################################################################
## Scrap world population by continent
= pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population")
population
## Get the proper table
= population[1]
population
## View the columns
population.columns
## drop the top row
MultiIndex([( 'Rank', ...),
( 'Country / Dependency', ...),
( 'Population', ...),
( 'Population', ...),
( 'Date', ...),
('Source (official or from the United Nations)', ...),
( 'Notes', ...)],
)
= population.drop(index=0, axis='index')
population
## Reset the row index to start at 0
= population.reset_index()
population = population.drop("index", axis = "columns")
population
## Rename the column
<string>:1: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
= population.rename(columns = {"Country / Dependency": "Country"})
population
population.head()
## select a few columns
Rank Country ... Source (official or from the United Nations) Notes
Rank Country ... Source (official or from the United Nations) Notes
0 1 China ... Official estimate[4] [b]
1 2 India ... Official projection[5] [c] [[1]]
2 3 United States ... National population clock[6] [d]
3 4 Indonesia ... Official estimate[7] NaN
4 5 Pakistan ... UN projection[3] [e]
[5 rows x 7 columns]
= population[['Rank','Country', 'Population']]
population
## Rename the columns
= [['rank', 'country', 'population', 'world_perc']]
population.columns = [x[0] for x in population.columns]
population.columns population.head()
rank country population world_perc
0 1 China 1411750000 NaN
1 2 India 1392329000 NaN
2 3 United States 334640000 NaN
3 4 Indonesia 275773800 NaN
4 5 Pakistan 235825000 NaN
For each country, I use the pycountry
module to create a new variable that contains the ISO 3 country code.
## Insert country code
def country_code(country):
try:
return(pc.countries.get(name = country).alpha_3)
except:
return("NA")
'code'] = [country_code(x) for x in population['country']]
population[
population.head()
## Create a continent column ----
## get a list of codes
rank country population world_perc code
0 1 China 1411750000 NaN CHN
1 2 India 1392329000 NaN IND
2 3 United States 334640000 NaN USA
3 4 Indonesia 275773800 NaN IDN
4 5 Pakistan 235825000 NaN PAK
= pd.read_html('https://cloford.com/resources/codes/index.htm')[3]
codes = codes.drop('Note', axis = 'columns')
codes = codes.rename(columns = {"ISO (2)": "ISO2", "ISO (3)": "code", "ISO (No)": "ISO_No"})
codes
## preview codes
codes.head()
Continent Region Country ... code ISO_No Internet
0 Asia South Asia Afghanistan ... AFG 4.0 AF
1 Europe South East Europe Albania ... ALB 8.0 AL
2 Africa Northern Africa Algeria ... DZA 12.0 DZ
3 Oceania Pacific American Samoa ... ASM 16.0 AS
4 Europe South West Europe Andorra ... AND 20.0 AD
[5 rows x 9 columns]
Next, I merge the population and codes dataset to get data that has the variable for continents and continent sub-regions.
## Join the two dataframes ----
= pd.merge(population, codes, on = "code")
final_data final_data.head()
rank country population world_perc ... FIPS ISO2 ISO_No Internet
0 1 China 1411750000 NaN ... CH CN 156.0 CN
1 2 India 1392329000 NaN ... IN IN 356.0 IN
2 3 United States 334640000 NaN ... US US 840.0 US
3 4 Indonesia 275773800 NaN ... ID ID 360.0 ID
4 5 Pakistan 235825000 NaN ... PK PK 586.0 PK
[5 rows x 13 columns]
We now work with two datasets; population1 (the first web scrapping exercise) and final data (the second webscraping and data merging exercise) for the analysis below.
4 Analysis
In this plot, we visulise the trend in population from 1850 to the 2050 (projected). We see that the population has risen exponentially. However, growth is likely to slow in the coming years.
## Population trends, 1800-2050
## Visualise world population
'year'], population1['population'], c = "red")
plt.plot(population1[
## Add points to the plot
'year'], population1['population'], c = "black", s= population1["population"])
plt.scatter(population1[
"Year")
plt.xlabel(
"Population, Billions")
plt.ylabel(
"Population Growth, 1804-2057")
plt.title(
plt.show()
Let us examine the countries with the highest and lowest population in the data. China has the highest population. See the other relevant details about China below.
'population'].idxmax()] final_data.loc[final_data[
rank 1
country China
population 1411750000
world_perc NaN
code CHN
Continent Asia
Region East Asia
Country China
Capital Beijing
FIPS CH
ISO2 CN
ISO_No 156.0
Internet CN
Name: 0, dtype: object
Niue, with a population of 1549 is the country with the lowest population.
'population'].idxmin()] final_data.loc[final_data[
rank –
country Niue
population 1549
world_perc NaN
code NIU
Continent Oceania
Region Pacific
Country Niue
Capital Alofi
FIPS NE
ISO2 NU
ISO_No 570.0
Internet NU
Name: 171, dtype: object
I create a summary of population by continent and sub-regions. The summary shows Oceania having the lowest population while Asia holds the bulk of global human population.
'Continent').sum()['population'].sort_values() final_data.groupby(
Continent
Oceania 43167638
Europe 558240991
Americas 990112726
Africa 1142650420
Asia 4326108638
Name: population, dtype: int64
<string>:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
I run the same analysis for subregions. South Asia has the most people.
'Region').sum()['population'].sort_values() final_data.groupby(
Region
Northern Asia 3492683
Northern Europe 27866753
Indian Ocean 29048450
South East Europe 30930328
Central Europe 33065326
West Indies 38798442
Pacific 43167638
Central Africa 54242768
South West Europe 58040893
Southern Europe 59386426
Central Asia 79044819
Eastern Europe 94214670
Southern Africa 171266283
Central America 178685136
Eastern Africa 217897638
Northern Africa 251540961
Western Europe 254736595
South West Asia 261421725
North America 374455281
South America 398173867
Western Africa 418654320
South East Asia 562087416
East Asia 1536697000
South Asia 1883364995
Name: population, dtype: int64
<string>:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
Next, I visualise the population by continent and subregions. I start with a plot of population by continent.
## Plot sorted population by continent
'Continent').sum()['population'].sort_values().plot()
final_data.groupby("tableau-colorblind10")
plt.style.use("Continent")
plt.xlabel("Population")
plt.ylabel("Population by Continent")
plt.title( plt.show()
Then I do a boxplot of population by continent.
## Box plot of population by continent
="population", by='Continent', color = 'blue')
final_data.boxplot(column"tableau-colorblind10")
plt.style.use("Continent")
plt.xlabel("Population (Log Scale)")
plt.ylabel('log')
plt.yscale("Population by Continent")
plt.title( plt.show()
Lastly, I do a boxplot of population by regions.
## Population by regions
="population", by=['Region'], color = 'blue')
final_data.boxplot(column"tableau-colorblind10")
plt.style.use('log')
plt.yscale("Continent")
plt.xlabel("Population (Log Scale)")
plt.ylabel("Population by Continent")
plt.title( plt.show()
5 Conclusion
In this analysis, I illustrate how to:
The objectives of the analysis are as follows:
To illustrate how to scrap data tables from websites using Python and Pandas.
To demonstrate the mechanics of analyzing and plotting data in Python, Pandas and Matplotlib.
To illustrate how to merge data in the Python module Pandas.
To visualise the trend in global population from 1800 to 2050.
To visualise world population by continent using the pythom module matplotlib.
I hope you find this write up useful.