HSS 611: Programming for HSS
Sep 23, 2025
Single condition
Subsetting with multiple conditions
&
(AND) and |
(OR) operators
Subsetting rows and columns simultaneously
.loc[]
to subset both rows and columns at the same time
Data are often spread across multiple files or databases
Let’s create two example data frames representing two data sets
Let’s create two example data frames representing two data sets
Use pd.merge
to join the two data frames
key
column using the on
argument
Without column(s) to join on specified, pd.merge
uses overlapping column(s) automatically
When the column names are different in each data Frame
Use the left_on
and right_on
arguments
Where have 'c'
s and 'd'
s gone?
'how'
is set as 'inner'
'inner'
join, the result only contains rows whose keys find a match ('a'
, 'b'
)
outer
join takes the union of the keys
left
join keeps the data frame on the left
Vice versa for right
join
Joining on multiple columns
Simply provide a list of columns we want to join on
pd.concat
puts chunks of data together row-wise or column-wise
pd.merge
, we do not match on values
pd.concat
axis
taking 0, we are concatenating row-wise (vertically)
With axis
taking 1, we are concatenating column-wise (horizontally)
What if the indices on two data source do not match?
If necessary, use the reset_index(drop = True)
before concatenating
Use reset_index
to assign a new default index
Put 'inner'
for the join
argument to focus on rows without a missing value
Use key
argument to create a hierarchical index
Use key
argument to create a hierarchical index
stack
rotates from the columns to the rows
stack
rotates from the columns to the rows
stack
rotates from the columns to the rows
reset_index()
transforms the Series
into a DataFrame
unstack
does the opposite
unstack
does the opposite
unstack
with level
argument (default: -1
)level
= 0
Wide format provides a quick overview of all variables
Long format is ideal for time-series data or repeated measures
From wide to long
From wide to long
melt
methodid_vars
var_name
and value_name
as well
From wide to long
id_vars
From wide to long
id_vars
, var_name
, and value_name
From wide to long
set_index
(converting the identifier variable as a new index) and running stack
From long to wide
From long to wide
pivot
methodindex
, columns
, and values
From long to wide
value
s
From long to wide
set_index
and then running unstack
From long to wide
pivot
will raise an error if the specified index and columns pair is not uniquepivot_table
allows you to reshape data and also aggregate it
From long to wide
aggfunc
can take 'sum'
, 'mean'
, 'max'
, etc., or even a custom function['min', 'max']
)
Split-Apply-Combine framework
Grouping
Grouping
GroupBy
object
Grouping
GroupBy
object
Grouping
GroupBy
object and do some simple aggregation
Grouping
dropna
argument to keep missing values
Aggregation
groupby
, there are many ways to aggregatemean
, median
, sum
, size
, min
, max
, count
, etc.
Aggregation
Aggregation
Aggregation
Aggregation
as_index
For more complex tasks involving multiple columns
new column name = (column name, function)
pattern
apply
operations
apply
operations
apply
operations
apply
operations
apply
operations