NumPy and Pandas II

HSS 611: Programming for HSS

Taegyoon Kim

Sep 23, 2025

Agenda

We had a broad look into NumPy/Pandas fundamentals
We will look at key data manipulation/wrangling approaches with Pandas
Subsetting (with conditions), join, concatenate, reshape, aggregate, etc.

Subsetting with conditions

Single condition

Use Boolean conditions to subset rows based on a single criterion
Create a condition using comparison operators like >, <, ==, !=
Apply the condition inside the data frame’s square brackets

Subsetting with conditions

Subsetting with multiple conditions

Combine multiple conditions using & (AND) and | (OR) operators
Use parentheses around each condition to ensure correct precedence
Combine conditions for more specific subsetting

Subsetting with conditions

Subsetting rows and columns simultaneously

Use .loc[] to subset both rows and columns at the same time
Useful when you want to filter rows based on a condition and select specific columns simultaneously

Joining data

Data are often spread across multiple files or databases

We often want to combine data sets by linking rows
We use common column(s) to join data sets
For instance
- Publications (title, abstract, author name)
- Authors (author name, field or research, institution name)
- Institution (institution name, location, budget)

Joining data

Let’s create two example data frames representing two data sets

Joining data

Let’s create two example data frames representing two data sets

Joining data

Use pd.merge to join the two data frames

Join based on the key column using the on argument
In this case, we are doing a many-to-one join
- One data frame has multiple rows that correspond to a single row in another data frame based on a shared key

Joining data

Without column(s) to join on specified, pd.merge uses overlapping column(s) automatically

Not recommended though

Joining data

When the column names are different in each data Frame

Specify them separately
- Alternatively, you could change the name(s) of the column(s)

Joining data

Use the left_on and right_on arguments

Joining data

Where have 'c's and 'd's gone?

This is because the argument 'how' is set as 'inner'
With 'inner' join, the result only contains rows whose keys find a match ('a', 'b')

Joining data

outer join takes the union of the keys

Rows that do not match on keys in the other data frame will get missing values

Joining data

left join keeps the data frame on the left

If there’s no match for a particular row in the data frame on the right, it will be filled with missing values

Joining data

Vice versa for right join

Joining data

Joining on multiple columns

Joining data

Simply provide a list of columns we want to join on

Concatenating data

pd.concat puts chunks of data together row-wise or column-wise

Note that, unlike pd.merge, we do not match on values
- Also data frames are put together in a list with pd.concat
With axis taking 0, we are concatenating row-wise (vertically)

Concatenating data

With axis taking 1, we are concatenating column-wise (horizontally)

Concatenating data

What if the indices on two data source do not match?

Concatenating data

If necessary, use the reset_index(drop = True) before concatenating

Concatenating data

Use reset_index to assign a new default index

Concatenating data

Put 'inner' for the join argument to focus on rows without a missing value

Concatenating data

Use key argument to create a hierarchical index

This is useful when you want to keep track of the original data sources

Concatenating data

Use key argument to create a hierarchical index

This is useful when you want to keep track of the original data sources

Reshaping

stack rotates from the columns to the rows

Let’s create an example data frame

Reshaping

stack rotates from the columns to the rows

Centers on a data frame index
For each entry in the index (‘USA’, ‘UK’, ‘China’), the column names (‘2019’, ‘2020’, ‘2021’) work as the inner index
The index in the original data frame is now the outer index in the series

Reshaping

stack rotates from the columns to the rows

What is returned is a series

Reshaping

reset_index() transforms the Series into a DataFrame

Reshaping

unstack does the opposite

Pick the index level to be used as column names
The innermost index (-1) works as column names (by default)

Reshaping

unstack does the opposite

Use of unstack with level argument (default: -1)
Now the outer index works as column names with level = 0

Wide/long data format

Wide format provides a quick overview of all variables

Wide/long data format

Long format is ideal for time-series data or repeated measures

Wide/long data format

From wide to long

An example wide-format data frame about coffee intake

Wide/long data format

From wide to long

Use melt method
Specify id_vars
- This an identifier
- This has a defined structure with both a name and values
(Optionally) var_name and value_name as well

Wide/long data format

From wide to long

Specifying only id_vars

Wide/long data format

From wide to long

Specifying id_vars, var_name, and value_name

Wide/long data format

From wide to long

This is equivalent to creating an index using set_index (converting the identifier variable as a new index) and running stack

Wide/long data format

From long to wide

An example long-format data frame about sales

Wide/long data format

From long to wide

Use the pivot method
Specify index, columns, and values

Wide/long data format

From long to wide

Multiple values

Wide/long data format

From long to wide

This is equivalent to creating a hierarchical index using set_index and then running unstack

Wide/long data format

From long to wide

pivot will raise an error if the specified index and columns pair is not unique
pivot_table allows you to reshape data and also aggregate it
Useful when there are duplicates that need to be aggregated

Wide/long data format

From long to wide

aggfunc can take 'sum', 'mean', 'max', etc., or even a custom function
Also a list of these (e.g., ['min', 'max'])

Group operations and aggregation

Split-Apply-Combine framework

Group operations and aggregation

Grouping

Let’s create an example data

Group operations and aggregation

Grouping

Generate a GroupBy object
An intermediate object that represents the grouping information of the original data frame

Group operations and aggregation

Grouping

We can actually see the keys and groups in a GroupBy object

Group operations and aggregation

Grouping

Subset a GroupBy object and do some simple aggregation

Group operations and aggregation

Grouping

Use the dropna argument to keep missing values

Group operations and aggregation

Aggregation

Aggregation refers to any data transformation that produces scalar values from arrays
With groupby, there are many ways to aggregate
mean, median, sum, size, min, max, count, etc.

Group operations and aggregation

Aggregation

An example data frame on voter turnout by region and district

Group operations and aggregation

Aggregation

For instance, get various statistics on turnout for each region

Group operations and aggregation

Aggregation

We can also use our own aggregation functions

Group operations and aggregation

Aggregation

Index as a column using as_index

Group operations and aggregation

For more complex tasks involving multiple columns

We can use new column name = (column name, function) pattern

Group operations and aggregation

apply operations

Apply a function along an axis of a data frame or series

Group operations and aggregation

apply operations

Apply a function along an axis of a data frame or series

Group operations and aggregation

apply operations

Group operations and aggregation

apply operations

Group operations and aggregation

apply operations