Quarto Tutorial 3

1 Introduction

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.

1.1 Definitions

Before getting started, let’s explain what each option in the yaml means.

toc adds the table of contents to your document.
number-sections adds number to the section headings when sets to true.
Latex equations are rendered using MathJax; however, you can change this to other options, as shown above.
highlight-style is used to style code outputs.
code-overflow controls the width of source code. When sets to wrap, the source code wraps around and vice versa.

There are numerous options to style and format your document, so we recommend reading the documentation on the Quarto website.

1.2 Loading the Libraries

Here we will load pandas, seaborn, and matplotlib.

# Loading packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Formatting
sns.set_context('notebook')
sns.set_style('white')

1.3 Importing the Dataset

# Import the data
ss_2008_census_df = pd.read_csv('../00_data/ss_2008_census_data_raw.csv')

# Inspect the first 5 rows
ss_2008_census_df.head()

	Region	Region Name	Region - RegionId	Variable	Variable Name	Age	Age Name	Scale	Units	2008
0	KN.A2	Upper Nile	SS-NU	KN.B2	Population, Total (Number)	KN.C1	Total	units	Persons	964353.0
1	KN.A2	Upper Nile	SS-NU	KN.B2	Population, Total (Number)	KN.C2	0 to 4	units	Persons	150872.0
2	KN.A2	Upper Nile	SS-NU	KN.B2	Population, Total (Number)	KN.C3	5 to 9	units	Persons	151467.0
3	KN.A2	Upper Nile	SS-NU	KN.B2	Population, Total (Number)	KN.C4	10 to 14	units	Persons	126140.0
4	KN.A2	Upper Nile	SS-NU	KN.B2	Population, Total (Number)	KN.C5	15 to 19	units	Persons	103804.0

# Inspect the last 5 rows
ss_2008_census_df.tail()

	Region	Region Name	Region - RegionId	Variable	Variable Name	Age	Age Name	Scale	Units	2008
448	KN.A11	Eastern Equatoria	SS-EE	KN.B8	Population, Female (Number)	KN.C14	60 to 64	units	Persons	5274.0
449	KN.A11	Eastern Equatoria	SS-EE	KN.B8	Population, Female (Number)	KN.C22	65+	units	Persons	8637.0
450	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
451	Source:	National Bureau of Statistics, South Sudan	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
452	Download URL:	http://southsudan.opendataforafrica.org/fvjqdp...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Above, we see that the three last rows contain nas (missing values). One is the data source where we obtained this dataset, and the other is the data URL.

2 Cleaning and Transforming the Data

2.1 Checking for Missing Values

Now that we have imported our dataset, we will clean and manipulate it. First, we will reconfirm the missing values and proceed with our data wrangling process.

# Check for missing values
ss_2008_census_df.isna().sum()

Region               1
Region Name          1
Region - RegionId    3
Variable             3
Variable Name        3
Age                  3
Age Name             3
Scale                3
Units                3
2008                 3
dtype: int64

2.2 Wrangling the Data Using `Method Chaining`

# Select desired columns
cols = ['Region Name', 'Variable Name', 'Age Name', '2008']

# Rename columns
cols_names = {'Region Name':'state', 
             'Variable Name':'gender', 
             'Age Name':'age_cat', 
             '2008':'population'}
             
# Create new age categories           
new_age_cats = {'0 to 4':'0-14', 
                '5 to 9':'0-14',
                '10 to 14':'0-14',
                '15 to 19':'15-29', 
                '20 to 24':'15-29',
                '25 to 29':'15-29',
                '30 to 34':'30-49', 
                '35 to 39':'30-49',
                '40 to 44':'30-49',
                '45 to 49':'30-49', 
                '50 to 54':'50-64', 
                '55 to 59':'50-64',
                '60 to 64':'50-64', 
                '65+':'>= 65'
                }
             

# Clean the data
df = (ss_2008_census_df
      [cols]
      .rename(columns = cols_names)
      .query('~age_cat.isna()')
      .assign(gender = lambda x:x['gender'].str.split('\s+').str[1],
             age_cat = lambda x:x['age_cat'].replace(new_age_cats),
             population = lambda x:x['population'].astype('int')
      )
      .query('gender != "Total" & age_cat != "Total"')
      # .drop(columns = 'pop_cat', axis = 'column')
      .groupby(['state', 'gender', 'age_cat'])['population']
      .sum()
      .reset_index()
     )

# Inspect the first 5 rows
df.head()

	state	gender	age_cat	population
0	Central Equatoria	Female	0-14	221216
1	Central Equatoria	Female	15-29	166887
2	Central Equatoria	Female	30-49	101676
3	Central Equatoria	Female	50-64	23460
4	Central Equatoria	Female	>= 65	8596

3 Summarizing Census Data

3.1 Population by State

# Calculate census data by state
st_df = (df  
         .groupby(['state'])['population']
         .sum()
         .reset_index()
         .sort_values('population', 
                      ascending=False, 
                      ignore_index=True)
         )

# Display the outpout
st_df

	state	population
0	Jonglei	1358602
1	Central Equatoria	1103557
2	Warrap	972928
3	Upper Nile	964353
4	Eastern Equatoria	906161
5	Northern Bahr el Ghazal	720898
6	Lakes	695730
7	Western Equatoria	619029
8	Unity	585801
9	Western Bahr el Ghazal	333431

3.2 Population by State and Gender

# Calculate census data by state and gender
gender_df = (df  
         .groupby(['state', 'gender'])['population']
         .sum()
         .reset_index()
         .sort_values('population', 
                      ascending=False, 
                      ignore_index=True)
         )

# Display the outpout
gender_df.head()

	state	gender	population
0	Jonglei	Male	734327
1	Jonglei	Female	624275
2	Central Equatoria	Male	581722
3	Upper Nile	Male	525430
4	Central Equatoria	Female	521835

3.3 Population by State, Gender, and Age Category

# Calculate census data by state, gender, and age category
age_df = (df  
         .groupby(['state', 'gender', 'age_cat'])['population']
         .sum()
         .reset_index()
         .sort_values(['state','population'], 
                      ascending = [True, False], 
                      ignore_index = True)
         )

# Display the outpout
age_df.head(5)

	state	gender	age_cat	population
0	Central Equatoria	Male	0-14	242247
1	Central Equatoria	Female	0-14	221216
2	Central Equatoria	Male	15-29	179903
3	Central Equatoria	Female	15-29	166887
4	Central Equatoria	Male	30-49	119355

4 Closing Remarks

This tutorial demonstrates creating a Quarto document with various yaml options to style and format the output. We hope you will find this tutorial helpful. Please let us know if there are any topics you want us to do a tutorial on.

With that said, our next tutorial will be on R.