NumPy and Pandas II

HSS 611: Programming for HSS

Taegyoon Kim

Sep 23, 2025

Agenda

  • We had a broad look into NumPy/Pandas fundamentals
  • We will look at key data manipulation/wrangling approaches with Pandas
  • Subsetting (with conditions), join, concatenate, reshape, aggregate, etc.

Subsetting with conditions

Single condition

  • Use Boolean conditions to subset rows based on a single criterion
  • Create a condition using comparison operators like >, <, ==, !=
  • Apply the condition inside the data frame’s square brackets

Subsetting with conditions

Subsetting with multiple conditions

  • Combine multiple conditions using & (AND) and | (OR) operators
  • Use parentheses around each condition to ensure correct precedence
  • Combine conditions for more specific subsetting

Subsetting with conditions

Subsetting rows and columns simultaneously

  • Use .loc[] to subset both rows and columns at the same time
  • Useful when you want to filter rows based on a condition and select specific columns simultaneously

Joining data

Data are often spread across multiple files or databases

  • We often want to combine data sets by linking rows
  • We use common column(s) to join data sets
  • For instance
    • Publications (title, abstract, author name)
    • Authors (author name, field or research, institution name)
    • Institution (institution name, location, budget)

Joining data

Let’s create two example data frames representing two data sets

Joining data

Let’s create two example data frames representing two data sets

Joining data

Use pd.merge to join the two data frames

  • Join based on the key column using the on argument
  • In this case, we are doing a many-to-one join
    • One data frame has multiple rows that correspond to a single row in another data frame based on a shared key

Joining data

Without column(s) to join on specified, pd.merge uses overlapping column(s) automatically

  • Not recommended though

Joining data

When the column names are different in each data Frame

  • Specify them separately
    • Alternatively, you could change the name(s) of the column(s)

Joining data

Use the left_on and right_on arguments

Joining data

Where have 'c's and 'd's gone?

  • This is because the argument 'how' is set as 'inner'
  • With 'inner' join, the result only contains rows whose keys find a match ('a', 'b')

Joining data

outer join takes the union of the keys

  • Rows that do not match on keys in the other data frame will get missing values

Joining data

left join keeps the data frame on the left

  • If there’s no match for a particular row in the data frame on the right, it will be filled with missing values

Joining data

Vice versa for right join

Joining data

Joining on multiple columns

Joining data

Simply provide a list of columns we want to join on

Concatenating data

pd.concat puts chunks of data together row-wise or column-wise

  • Note that, unlike pd.merge, we do not match on values
    • Also data frames are put together in a list with pd.concat
  • With axis taking 0, we are concatenating row-wise (vertically)

Concatenating data

With axis taking 1, we are concatenating column-wise (horizontally)

Concatenating data

What if the indices on two data source do not match?

Concatenating data

If necessary, use the reset_index(drop = True) before concatenating

Concatenating data

Use reset_index to assign a new default index

Concatenating data

Put 'inner' for the join argument to focus on rows without a missing value

Concatenating data

Use key argument to create a hierarchical index

  • This is useful when you want to keep track of the original data sources

Concatenating data

Use key argument to create a hierarchical index

  • This is useful when you want to keep track of the original data sources

Reshaping

stack rotates from the columns to the rows

  • Let’s create an example data frame

Reshaping

stack rotates from the columns to the rows

  • Centers on a data frame index
  • For each entry in the index (‘USA’, ‘UK’, ‘China’), the column names (‘2019’, ‘2020’, ‘2021’) work as the inner index
  • The index in the original data frame is now the outer index in the series

Reshaping

stack rotates from the columns to the rows

  • What is returned is a series

Reshaping

reset_index() transforms the Series into a DataFrame

Reshaping

unstack does the opposite

  • Pick the index level to be used as column names
  • The innermost index (-1) works as column names (by default)

Reshaping

unstack does the opposite

  • Use of unstack with level argument (default: -1)
  • Now the outer index works as column names with level = 0

Wide/long data format

Wide format provides a quick overview of all variables

Wide/long data format

Long format is ideal for time-series data or repeated measures

Wide/long data format

From wide to long

  • An example wide-format data frame about coffee intake

Wide/long data format

From wide to long

  • Use melt method
  • Specify id_vars
    • This an identifier
    • This has a defined structure with both a name and values
  • (Optionally) var_name and value_name as well

Wide/long data format

From wide to long

  • Specifying only id_vars

Wide/long data format

From wide to long

  • Specifying id_vars, var_name, and value_name

Wide/long data format

From wide to long

  • This is equivalent to creating an index using set_index (converting the identifier variable as a new index) and running stack

Wide/long data format

From long to wide

  • An example long-format data frame about sales

Wide/long data format

From long to wide

  • Use the pivot method
  • Specify index, columns, and values

Wide/long data format

From long to wide

  • Multiple values

Wide/long data format

From long to wide

  • This is equivalent to creating a hierarchical index using set_index and then running unstack

Wide/long data format

From long to wide

  • pivot will raise an error if the specified index and columns pair is not unique
  • pivot_table allows you to reshape data and also aggregate it
  • Useful when there are duplicates that need to be aggregated

Wide/long data format

From long to wide

  • aggfunc can take 'sum', 'mean', 'max', etc., or even a custom function
  • Also a list of these (e.g., ['min', 'max'])

Group operations and aggregation

Split-Apply-Combine framework

Group operations and aggregation

Grouping

  • Let’s create an example data

Group operations and aggregation

Grouping

  • Generate a GroupBy object
  • An intermediate object that represents the grouping information of the original data frame

Group operations and aggregation

Grouping

  • We can actually see the keys and groups in a GroupBy object

Group operations and aggregation

Grouping

  • Subset a GroupBy object and do some simple aggregation

Group operations and aggregation

Grouping

  • Use the dropna argument to keep missing values

Group operations and aggregation

Aggregation

  • Aggregation refers to any data transformation that produces scalar values from arrays
  • With groupby, there are many ways to aggregate
  • mean, median, sum, size, min, max, count, etc.

Group operations and aggregation

Aggregation

  • An example data frame on voter turnout by region and district

Group operations and aggregation

Aggregation

  • For instance, get various statistics on turnout for each region

Group operations and aggregation

Aggregation

  • We can also use our own aggregation functions

Group operations and aggregation

Aggregation

  • Index as a column using as_index

Group operations and aggregation

For more complex tasks involving multiple columns

  • We can use new column name = (column name, function) pattern

Group operations and aggregation

apply operations

  • Apply a function along an axis of a data frame or series

Group operations and aggregation

apply operations

  • Apply a function along an axis of a data frame or series

Group operations and aggregation

apply operations

Group operations and aggregation

apply operations

Group operations and aggregation

apply operations