HSS 611: Programming for HSS
Sep 23, 2025
Single condition
Subsetting with multiple conditions
& (AND) and | (OR) operators
Subsetting rows and columns simultaneously
.loc[] to subset both rows and columns at the same time
Data are often spread across multiple files or databases
Let’s create two example data frames representing two data sets
Let’s create two example data frames representing two data sets
Use pd.merge to join the two data frames
key column using the on argument
Without column(s) to join on specified, pd.merge uses overlapping column(s) automatically
When the column names are different in each data Frame
Use the left_on and right_on arguments
Where have 'c's and 'd's gone?
'how' is set as 'inner''inner' join, the result only contains rows whose keys find a match ('a', 'b')
outer join takes the union of the keys
left join keeps the data frame on the left
Vice versa for right join
Joining on multiple columns
Simply provide a list of columns we want to join on
pd.concat puts chunks of data together row-wise or column-wise
pd.merge, we do not match on values
pd.concataxis taking 0, we are concatenating row-wise (vertically)
With axis taking 1, we are concatenating column-wise (horizontally)
What if the indices on two data source do not match?
If necessary, use the reset_index(drop = True) before concatenating
Use reset_index to assign a new default index
Put 'inner' for the join argument to focus on rows without a missing value
Use key argument to create a hierarchical index
Use key argument to create a hierarchical index
stack rotates from the columns to the rows
stack rotates from the columns to the rows
stack rotates from the columns to the rows
reset_index() transforms the Series into a DataFrame
unstack does the opposite
unstack does the opposite
unstack with level argument (default: -1)level = 0
Wide format provides a quick overview of all variables
Long format is ideal for time-series data or repeated measures
From wide to long
From wide to long
melt methodid_vars
var_name and value_name as well
From wide to long
id_vars
From wide to long
id_vars, var_name, and value_name
From wide to long
set_index (converting the identifier variable as a new index) and running stack
From long to wide
From long to wide
pivot methodindex, columns, and values
From long to wide
values
From long to wide
set_index and then running unstack
From long to wide
pivot will raise an error if the specified index and columns pair is not uniquepivot_table allows you to reshape data and also aggregate it
From long to wide
aggfunc can take 'sum', 'mean', 'max', etc., or even a custom function['min', 'max'])
Split-Apply-Combine framework
Grouping
Grouping
GroupBy object
Grouping
GroupBy object
Grouping
GroupBy object and do some simple aggregation
Grouping
dropna argument to keep missing values
Aggregation
groupby, there are many ways to aggregatemean, median, sum, size, min, max, count, etc.
Aggregation
Aggregation
Aggregation
Aggregation
as_index
For more complex tasks involving multiple columns
new column name = (column name, function) pattern
apply operations
apply operations
apply operations
apply operations
apply operations