NumPy and Pandas I

HSS 611: Programming for HSS

Taegyoon Kim

Sep 18, 2025

Agenda

  • Introduction to NumPy fundamentals
  • Getting started with Pandas

NumPy and Pandas

What is NumPy good for?

  • NumPy stands for Numerical Python
  • Foundational package for numerical computing in Python
    • E.g., basic arithmetic, linear algebra, statistical analysis
  • Many computational packages use NumPy’s array as one of the standard interfaces for data exchange
    • E.g., Pandas, scikit-learn, statmodels, etc.
  • ndarray (n-dimensional array) from NumPy is faster and more efficient than Python’s native lists for a variety of reasons
    • Homogeneity, vectorization, etc

NumPy and Pandas

ndarray, Series, and DataFrame

  • NumPy provides support for large multidimensional data (ndarray), and Pandas’ key data structures (Series and DataFrame) are built on it

NumPy and Pandas

Broad use cases of NumPy and Pandas

  • NumPy is mainly used for lower-level mathematical and numerical operation
  • Pandas provides a comprehensive toolkit for structured data manipulation
    • Reading/writing data, handling missing data, merging/joining data sets, reshaping data, grouping and aggregating data, etc.

NumPy

Generating ndarray

  • Use np.array() function
  • Converts input data to an ndarray
  • Pass lists, tuples, or other sequence type data objects

NumPy

Generating ndarray

  • Data should be homogeneous

NumPy

Do not get confused with lists

  • Note the commas

NumPy

Other ways to generate ndarray

  • np.zeros(), np.ones()
    • Populate with 0s and 1s
  • For multidimensional arrays, pass a tuple

NumPy

Other ways to generate ndarray

  • np.arange() (similar to range())

NumPy

Shape, dimension, and dtype

  • .ndim: dimension
  • .shape: shape
  • .dtype: data type

NumPy

Casting with astype() method

  • Converting a variable from one data type to another
  • E.g., from string to float

NumPy

Casting with astype() method

  • Be careful though

NumPy

Precision and memory consumption

  • float32, float64: the number of bits used to represent each floating-point number

NumPy

Precision and memory consumption

NumPy

Vectorization

  • Process of performing operations on collections of data simultaneously (\(\leftrightarrow\) writing for loops)
    • Arrays are treated like scalars
  • Let’s create an example data

NumPy

Loop, comprehension, and vectorized array arithmetic

  • See how long it takes with for loop, list comprehension, and vectorized arithmetic
  • Let’s set up

NumPy

Loop, comprehension, and vectorized array arithmetic

  • For loop

NumPy

Loop, comprehension, and vectorized array arithmetic

  • Comprehension

NumPy

Loop, comprehension, and vectorized array arithmetic

  • Vectorized array arithmetic

NumPy

Indexing and slicing

  • One-dimensional arrays behave similarly to lists

NumPy

Indexing and slicing

  • N-dimensional arrays

NumPy

Indexing and slicing

  • There are many more approaches to indexing/slicing for n-dimensional arrays
  • Recommend to check here

Pandas Series

One-dimensional array-like object

  • Contains a sequence of values of the same type
  • Also contains a sequence of data labels, called index

Pandas Series

index

  • Create a Series with an index identifying each data point with a string

Pandas Series

index

  • See the index

Pandas Series

index

  • Update the index

Pandas Series

name, values, etc.

Pandas Series

Indexing and slicing

Pandas Series

Indexing and slicing

Pandas Series

Indexing and slicing

  • []-based indexing treat integers as labels if the index contains integers

Pandas Series

Indexing and slicing

  • []-based indexing treat integers as labels if the index contains integers

Pandas Series

Using NumPy functions or NumPy-like operations

Pandas Series

Commonly used methods

Pandas Series

Commonly used methods

Pandas Series

Commonly used methods

Pandas Series

Commonly used methods

Pandas Series

Commonly used methods

Pandas Series

Commonly used methods

Pandas Series

Commonly used methods

Pandas Series

Commonly used methods

Pandas DataFrame

Represents a rectangular table of data

  • Can be thought of as a dictionary of Series all sharing the same index
  • Contains an ordered, named collection of columns

Pandas DataFrame

Represents a rectangular table of data

Pandas DataFrame

Many ways to construct a DataFrame

  • One of the most common way is from a dictionary of equal-length lists or NumPy arrays

Pandas DataFrame

Indexing and slicing

  • loc indexer works with labels (integer or string)
  • Get rows with loc

Pandas DataFrame

Indexing and slicing

  • loc indexer works with labels (integer or string)
  • Get rows with loc

Pandas DataFrame

Indexing and slicing

  • What will this return?

Pandas DataFrame

Indexing and slicing

  • Get rows and columns with loc

Pandas DataFrame

Indexing and slicing

  • iloc indexer works with positions (0, 1, 2, etc.)
  • This is the case regardless of labels

Pandas DataFrame

Indexing and slicing

  • Let’s assign numeric labels

Pandas DataFrame

Indexing and slicing

  • Get rows and columns with iloc, like this

Pandas DataFrame

Indexing and slicing

  • Subset columns and then rows

Pandas DataFrame

Indexing and slicing

  • What will this return?

Pandas DataFrame

Indexing and slicing

  • What will this return?

Pandas DataFrame

Sorting

  • By index

Pandas DataFrame

Sorting

  • By index

Pandas DataFrame

Sorting

  • By index

Pandas DataFrame

Sorting

  • By value

Pandas DataFrame

Sorting

  • By value

Pandas DataFrame

Sorting

  • By value

Pandas DataFrame

Sorting

  • By value

Pandas DataFrame

Handling missing data

  • isna() or isnull()

Pandas DataFrame

Handling missing data

  • dropna()

Pandas DataFrame

Handling missing data

  • dropna()

Pandas DataFrame

Handling missing data

  • dropna()

Pandas DataFrame

Removing duplicates

  • duplicated() return Boolean values

Pandas DataFrame

Removing duplicates

  • duplicated() return Boolean values

Pandas DataFrame

Removing duplicates

  • drop_duplicates() remove duplicates

Pandas DataFrame

Removing duplicates

  • With the keep argument False, it deletes all duplicates

Pandas DataFrame

Removing duplicates

  • drop_duplicates() remove duplicates

Pandas DataFrame

Removing duplicates

  • drop_duplicates() remove duplicates