Workshop 2, Algorithms and Data Analysis

Author

Alberto Dorantes

Published

September 29, 2025

Abstract
This is an INDIVIDUAL workshop. In this workshop we practice algorithmic thinking and programming with loops and functions, and we introduce programming for data management. We practice with financial problems such as bond valuation, calculation of financial ratios, and doing summary tables of financial variables.

1 General Directions for each workshop

You have to work on Google Colab for all your workshops. In Google Colab, you MUST LOGIN with your @tec.mx account and then create a google colab document for each workshop.

You must share each Colab document (workshop) with the following account:

  • cdorante@tec.mx

You must give Edit privileges to these accounts.

In Google Colab you can work with Python or R notebooks. The default is Python notebooks.

Your Notebook will have a default name like “Untitled2.ipynb”. Click on this name and change it to “W2-Algorithms-YourFirstName-YourLastname”.

In your Workshop Notebook you have to:

  • You have to replicate all the code of the workshop, and do all the challenges to get full credit for the workshop. The accuracy of the challenge will not significantly affect your grade; completion will have more weight for your workshop grade.

  • It is STRONGLY RECOMMENDED that you write your OWN NOTES as if this were your personal notebook to study for the FINAL EXAM. Your own workshop/notebook will be very helpful for your further study.

Once you finish your workshop, make sure that you RUN ALL CHUNKS. You can run each code chunk by clicking on the “Run” button located in the top-left section of each chunk. You can also run all the chunks in one-shot with Ctrl-F9. You have to submit to Canvas the web link of your Google Colab workshop.

2 Loops

The for statement is used to repeat certain tasks a specific number of times. We can loop over a sequence of numbers, or we can loop over specific elements of a vector.

In a similar manner, it is possible to program a loop with while statement. The main difference is that you need to specify what is the exit condition at the beginning of the loop. The while loop is useful when we do not know exactly how many iterations are needed to find a result.

Here is an example. If you want to know how many years you need to keep an investment with a fixed interest rate in order to duplicate your initial investment:

# Define variables
APR = 0.10        # Annual Percentage Rate = 10%
INV = 100         # Initial investment = $100
MULTIPLE = 2      # Multiple = 2 (investment should double)
BALANCE = INV     # Start balance equal to initial investment
year = 0          # Start with year = 0

# Loop until balance reaches multiple of initial investment
while BALANCE < MULTIPLE * INV:
    year += 1                           # Increase year by 1
    BALANCE = BALANCE * (1 + APR)       # Apply growth factor

# Print results
print(f"To multiply your investment times {MULTIPLE}, at an interest rate of {100*APR}%, you need {year} years.")
To multiply your investment times 2, at an interest rate of 10.0%, you need 8 years.
print(f"Your balance after {year} years will be ${BALANCE:.2f}")
Your balance after 8 years will be $214.36

It is up to you to practice with the different types of loops. When you have to write a loop, you first have to understand in detail what you are looking for. Then, you have to identify the logic you need to do the iterations and which type of calculations you need to do.

It is very recommended first write a pseudo-code with your own words specifying the details of your logic. The pseudo-code is just a step of sequential tasks with your own words. You can use some names of commands, but if you do not remember, you can just write what you want to do line by line, and specify what are the iterations and conditions of your algorithm.

Once you have your pseudo-code, you can use it as your prompt for your favorite LLM (such as Gemini, chatGPT, etc)

The more structured and clear you write your prompts to generate code, the more effective you will be writing code to automate any interesting and complex task in the context of Finance.

3 Looping vs vectorization

Interestingly, Python and other languages such as R, are designed to easily perform repetitive tasks for data frames, vectors and matrices without using a loop.

In Python, unlike other traditional computer languages such as C, there is an alternative way to do loops without writing loops!

This sounds weird. The way you can do a loop without writing a loop is using “vectorization”. Python can perform vectorization in a natural way. For example, if you want to do a sum of 2 vectors or matrices in a traditional language, you might think in a loop to do an element-by-element operation.

Imagine you want calculate the net present value of a new project in your company. The project involves 2 new products, A and B. Each product will bring its own income. You already estimated the expected annual free cash flow (income minus all expenses) of both products for the next 5 years. The following table shows the expected free cash flow for each year for each product (in millions MXP )

Product Year 1 Year 2 Year 3 Year 4 Year 5
A 3 3 4 6 8
B -2 -3 -1 3 5

The discount rate (or cost of capital) for this project (annual) is 15%. The initial investment of the project is 10 million pesos.

How can you estimate the present value of these cash flows?

You first have to add the cash flows of each product and get an annual free cash flow for the project. You can use a loop that for each year, you do the sum of the 2 corresponding free-cash flows. Something like:

# Define vectors of cash flows as Python lists
A = [3, 3, 4, 6, 8]
B = [-2, -3, -1, 3, 5]

# Initialize list (with zeros) of same length of A 
sumcashflows = [0] * len(A)
#sumcashflows = {}

# Loop through indices
for i in range(len(A)):
  # I sum the cash flows of A and B for the time period i, and store it in sumcashflows:
    sumcashflows[i] = A[i] + B[i]

# I print the resulting stream of cash flows of the project:

print(sumcashflows)
[1, 0, 3, 9, 13]

Now, instead of using a loop, we can use vectorization in a natural way to do the sum of both vectors:

import numpy as np

A = np.array([3, 3, 4, 6, 8])
B = np.array([-2, -3, -1, 3, 5])

# Element-wise sum using vectorization:

sumcashflows = A + B
print(sumcashflows)
[ 1  0  3  9 13]

I used numpy arrays numpy library) instead of lists since lists cannot be vectorized.

Check that the statement A + B performs the vectorization by summing element by element of each array. Internally, this vectorization of arrays actually does a kind of loop, but we do not need to program the loop!!

Data frames (from pandas library) are also vectorized since the operations you do can be done to all columns or all rows.

4 Functions

In any programming language we write an algorithm to automate a process that receive a set of inputs, process the inputs with specific steps, and then generate or return an output or result, which can be one number, a list of numbers, a dataset or any set of bits of information.

If we find that we need to run that algorithm more than one time with different input values, then it is a good idea to write a function that automate the process, but we can easily re-use the function to solve similar problems but with different input values.

In a function the input variables are defined as parameters, and the output is explicitly return at the end of the function.

Here is an example:

Imagine a function that receive a number n and return the factorial of that number. I can write this function in Python as follows:

def factorial(n):
  # There are 3 possibilities about n: n can be negative, zero or positive
  # I will use a conditional if to calculate the value of the factorial of n
  
  if n < 0:
     # If the number is negative, I send a message that the factorial cannot be calculated and return nothing (None)
     print("The parameter n must be positive.")
     return None
  else: # Here n can be zero or positive
   # If n = 0, then I return 1. Why? by definition, in mathematics the 
   # factorial of 0 is 1:
     if n==0: 
       return 1
     else:  # if n>0 I calculate the factorial as the multiplication from 1 to n  
       # Step 1: I start with a variable result = 1 
       result = 1
       # Step 2: I do a for loop to iterate from 1 to n, and for each iteration I multiply result times the iterator (counter) 
       for i in range(1, n+1): 
         result = result * i
       # Step 4: I return the result variable   
       return result   

Once I defined my function, I can call my function more than one time. For example:

num=-5
factnum = factorial(num)
The parameter n must be positive.
print(f"The factorial of {num} is {factnum}")
The factorial of -5 is None
num = 5 
print(f"The factorial of {num} is {factorial(num)}")
The factorial of 5 is 120
num = 10
print(f"The factorial of {num} is {factorial(num)}")
The factorial of 10 is 3628800

Many times it is useful to call a function from a loop. Let’s work with the following Bond valuation problem.

Imagine I want to write a function to value a bond, and then see how sensitive the bond is to changes in market interests rates.

I start by writing a function to calcualte the value (price) of a bond. I will use the following parameters as inputs:

Parameters (inputs) of my function:

  • Year-to-maturity: # of years of the contract

  • Annual interest rate

  • Frequency: # of times a year for the coupon payments

  • Annual coupon rate: % of the Face Value to calculate the annual coupon amount

  • Face value: Principal of the bond

Output:

  • Bond price / value

Instead of writing the steps before, I will write the steps within my function as comments:

def bond_value(years_to_maturity, annual_interest_rate, frequency, annual_coupon_rate, face_value):
  """
  Calculates the value of a bond.

  Arguments:
    years_to_maturity: The number of years until the bond matures.
    annual_interest_rate: The annual yield to maturity (discount rate).
    frequency: The number of coupon payments per year.
    annual_coupon_rate: The annual coupon rate as a decimal.
    face_value: The face value (par value) of the bond.

  Returns:
    The calculated value of the bond.
  """
  # Calculate periodic interest rate and number of periods 
  periodic_interest_rate = annual_interest_rate / frequency
  number_of_periods = years_to_maturity * frequency

  # Calculate periodic coupon payment
  periodic_coupon_payment = (annual_coupon_rate * face_value) / frequency

  # Calculate the present value of coupon payments (annuity)
  pv_coupon_payments = periodic_coupon_payment * (1 - (1 + periodic_interest_rate)**(-number_of_periods)) / periodic_interest_rate

  # Calculate the present value of the face value
  pv_face_value = face_value / (1 + periodic_interest_rate)**number_of_periods

  # Total bond value is the sum of present values of coupon payments and face value
  total_bond_value = pv_coupon_payments + pv_face_value

  return total_bond_value

# Example usage:
# Bond with 10 years to maturity, 5% annual interest rate, semi-annual payments,
# 6% annual coupon rate, and $1000 face value.
years = 10
interest_rate = 0.05
freq = 2
coupon_rate = 0.06
face = 1000

bond_val = bond_value(years, interest_rate, freq, coupon_rate, face)
print(f"The value of the bond is: ${bond_val:,.2f}")
The value of the bond is: $1,077.95

As you see I can use this function to value almost any type of bond. Now, imagine that I want to see how much the value of a specific bond changes when market interest rate changes.

Consider the calculation of a bond price with the following parameters: facevalue = 1,000, frequency = 4, coupon rate = 10%, and year-to-maturity = 10 years. Now, with these characteristics, I want to see how much the bond value changes if interests change from 4% to 20%.

I can write a loop to calculate the bond value by changing the interest rate from 4% to 20% jumping the interest rate by 0.5%. For each iteration save the interest rate and the bond value. At the end of the loop, do a plot to see how the bond price changes when the interest rate changes:

import matplotlib.pyplot as plt

# Bond parameters
face_value = 1000
frequency = 4
annual_coupon_rate = 0.10
years_to_maturity = 10

# Interest rate range
start_interest_rate = 0.04
end_interest_rate = 0.20
step_interest_rate = 0.005

# Lists to store results
interest_rates = []
bond_values = []

# Loop through interest rates and calculate bond value
interest_rate = start_interest_rate
while interest_rate <= end_interest_rate:
    bond_val = bond_value(years_to_maturity, interest_rate, frequency, annual_coupon_rate, face_value)
    interest_rates.append(interest_rate)
    bond_values.append(bond_val)
    interest_rate += step_interest_rate

# Plot the results
plt.figure(figsize=(10, 6))
plt.plot(interest_rates, bond_values)
plt.xlabel("Annual Interest Rate")
plt.ylabel("Bond Value")
plt.title("Bond Value vs. Annual Interest Rate")
plt.grid(True)
plt.show()

5 CHALLENGE 1

You are planning to save for retirement over the next 30 years. To do this, you will invest $6,000 a month in a stock account and $3,000 a month in a bond account. The return of the stock account is expected to be 12% compounded monthly, and the bond account will pay 5% annual interest rate also compounded monthly. When you retire, you will combine your money into an account with a 9% annual return. How much can you withdraw each month from your account assuming a 25-year withdrawal period? At the end of 25 years your balance must be zero. Assume that all payments are done at the end of each month.

You have to:

  1. Write/draw your mental map to visualize this problem with your own lens. You can do this in a paper, take a picture and include it in your Colab, or you can use any note software (like OneNote) to do this and paste it as image.

  2. Write down the input and the output

  3. Write the logical steps you propose to solve the problem. Use your own words. Write short and clear sentences for each step.

  4. Paste your inputs, output, and your steps in Gemini to generate Python code. Make sure that your solution is correct. You can check a solution in Excel or carefully review the Python solution.

  5. Make sure you understand the generated Python code. Write down any question about Python code to the next class.

6 Data structures

In Finance there are basically the following dataset structures or data types:

  1. Time-series: in this structure you can have many periods, and information for one “subject” (subject can be company, index, industry, etc). Also, you can have more than one subject, but the information is placed as new column for each subject. For example: the dataset created after running getsymbols:
       p_stock1 r_stock1 p_stock2 r_stock2
2014m1       10     0.02       20     0.01
2014m2       11     0.10       21     0.05
  1. Cross-sectional structure: in this structure, you usually have many “subjects”, for ONLY ONE period For example, key performance indicators of Mexican companies for the last quarter of 2016:
    Ticker   ROA ATO ProfitMargin
1 ALFAA.MX 0.023 0.9         0.15
2  AMXL.MX 0.015 1.1         0.10
  1. Panel-data structure: it is combination of time-series and cross-sectional. In this structure, we have more than one subject, and for each subject we can have more than one period. Example:
     Ticker quarter  ROA ATO ProfitMargin
1  ALFAA.MX  2014q1  0.2 1.1         0.17
2  ALFAA.MX  2014q2 0.21 1.2         0.16
3       ...     ...  ... ...          ...
4 BIMBOA.MX  2014q1 0.15 0.8         0.10
5 BIMBOA.MX  2014q2 0.20 0.9         0.05
6       ...     ...  ... ...          ...
7 BIMBOA.MX  2017q1 0.15 1.1         0.09

7 Data management for panel-data

For financial analysis, data management is a very important process. Most of the time, before designing and running a financial / econometric model, it is needed to do data wrangling: data collection/importing, data cleanning, data merging, data transformation.

In this example we will learn how to merge two datasets, and then do summary tables that gives us understanding of the data.

We will merge 2 dataset: a cross-sectional dataset with general information of firms from the S&P500 Market index, and a panel dataset with annual financial information for US public firms (A public firm is a company that issue shares in the market)

Download the following csv datasets:

Download and import these 2 datasets with the following code:

import pandas as pd
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# Download the first file
url1 = 'https://www.apradie.com/datos/usfirms.csv'
response1 = requests.get(url1, headers=headers)
with open('usfirms.csv', 'wb') as f:
    f.write(response1.content)
633061
# Download the second file
url2 = 'https://www.apradie.com/datos/usdata.csv'
response2 = requests.get(url2, headers=headers)
with open('usdata.csv', 'wb') as f:
    f.write(response2.content)
11317275
# Import as data frames
usfirms = pd.read_csv('usfirms.csv')
usdata = pd.read_csv('usdata.csv')

# Display the first few rows of each DataFrame to confirm
print("usfirms DataFrame:")
usfirms DataFrame:
usfirms.head()
    empresa  ...  SectorEconomatica
0         A  ...  Electric Electron
1        AA  ...  Basic & Fab Metal
2  AABA_old  ...    Software & Data
3   AAC_old  ...              Other
4  AAIC_old  ...              Funds

[5 rows x 7 columns]
print("\nusdata DataFrame:")

usdata DataFrame:
usdata.head()
  firm  year     revenue  ...  originalprice  sharesoutstanding  fixedassets
0    A  2000  10773000.0  ...          54.75         456366.381    1741000.0
1    A  2001   8396000.0  ...          28.51         463695.160    1848000.0
2    A  2002   6010000.0  ...          17.96         467024.421    1579000.0
3    A  2003   6056000.0  ...          29.24         476149.083    1447000.0
4    A  2004   7181000.0  ...          24.10         486841.087    1258000.0

[5 rows x 28 columns]

The first dataset contains a catalog of all US firms along with the corresponding industry type and status (active or cancelled), while the second dataset contains the historical financial data of the firms.

The usdata is a panel-dataset, which is called to be a long-format dataset. Each row has financial information for one US firm and 1 period (a year). All $ amounts are in thousands (’1000s).

Here is a data dictionary of the columns:

Data dictionary of historical annual financial data.
Variable Description
firm Firm ticker - Unique code of the company
year Fiscal Year
revenue Total sales of the firm in the fiscal year
cogs Cost of good sold in the fiscal year - variable costs of the products sold
sgae Sales and general administrative expenses in the fiscal year
depreciation Depreciation and ammortization in the fiscal year
otherincome Other operational income/expenses that are not directly from the core operations of the firm
extraordinaryitems Extra income/expenses not related to regular operations
finexp Financial expenses - interest expenses paid in the year (generated from loans)
incometax Income tax paid in the year
cashflowoper Cash flow generated from operations during the year
cashinbanks Cash balance in banks at the end of the year
totalassets Total assets of the firm at the end of the year
currentassets Current assets of the firm at the end of the year
accountsreceivables Accounts receivables at the end of the year
inventory Inventory at the end of the year
totalliabilities Total liabilities of the firm at the end of the year
currentliabilities Current liabilities of the firm at the end of the year
accountspayable Accounts payable at the end of the year
shortdebt Short-term financial debt at the end of the year
longdebt Balance of long-term financial debt (loans to pay longer than 1 year)
retainedearnings Retained earnings in the year
adjprice Stock adjusted price at the end of the year; adjusted for stock splits and dividend payments; used to calculate stock returns
originalprice Historical stock price (not adjusted); used to calculate historical market value
sharesoutstanding Historical number of shares available in the market
fixedassets Fixed assets value at the end of the year

Each row of this dataset has yearly financial data of one firm in one year. All firms have years 2000 to 2023. Not all firms have existed since 2000. Then, it is possible to know when each firm went public to issue shares in the financial market: the first year with some non-empty data.

Earnings before interest and Taxes (ebit) and Net Income (netincome) must be calculated as:

ebit = revenue - cogs - sgae - depreciation

netincome = ebit + otherincome + extraordinaryitems - finexp - incometax

The usfirms.csv is a catalog of all active and cancelled US firms:

Variable Description
empresa Firm ticker - Unique code of the company
Nombre Name of the firm
status Status of the firm: active or cancelled
partind Percent participation in the S&P500 market index
naics1 North American Industry Classification Code - Level 1
naics2 North American Industry Classification Code - Level 2
SectorEconomatica Economatica Industry classification

In the dataus dataset we have annual financial data. For income-statement variables, the information is the amount generated in the corresponding year fiscal year. For balance-sheet variables, the information is the value of the variable at the end of the year (since the beginning when the firm was created).

8 Data merging

In the usdata panel dataset there is no information about firm name, industry, status (active or cancelled), etc. That information is in the usfirms dataset. So, we start by integrating this information into the panel data.

Both datasets have a common/match column: firm ticker, but the column names for firm ticker in the datasets are different. the firm ticker column is called empresa for the usfirms, and firm for the usdata.

Merge the usfirm into the usdata indicating that empresa and firm columns are the matching columns. Only pull the following columns from the usfirm: Nombre, status, naics1 (industry naics level 1) and SectorEconomatica.

You can do this with the following code:

merged_data = pd.merge(usdata, usfirms[['empresa', 'Nombre', 'status', 'naics1']], left_on='firm', right_on='empresa', how='left')

# Drop the redundant 'empresa' column from the merged DataFrame
merged_data = merged_data.drop('empresa', axis=1)

# Display the first few rows of the merged DataFrame to confirm
print("Merged DataFrame:")
Merged DataFrame:
merged_data.head()
  firm  year     revenue  ...                     Nombre  status         naics1
0    A  2000  10773000.0  ...  Agilent Technologies, Inc  active  Manufacturing
1    A  2001   8396000.0  ...  Agilent Technologies, Inc  active  Manufacturing
2    A  2002   6010000.0  ...  Agilent Technologies, Inc  active  Manufacturing
3    A  2003   6056000.0  ...  Agilent Technologies, Inc  active  Manufacturing
4    A  2004   7181000.0  ...  Agilent Technologies, Inc  active  Manufacturing

[5 rows x 31 columns]

This type of merging is called left-join merging since we use the left dataset as base to pull data from the right dataset.

When doing a left-join merging of 2 dataset, it is very useful to visualize the left and the right datasets. The left dataset in this case is the usdata, and the right dataset is the usfirms. The left dataset is the base dataset, and we pull columns from the right dataset according to the matching columns.

When we do a vlookup in Excel, we do a left-join considering the left dataset the current sheet, and the right dataset the sheet you indicate to pull a column.

When we do a left-join, the resulting merged dataset must have the same number of rows than the left dataset. It is alwasy recommended to check the number of rows and columns of the resulting merged dataset to make sure that the merged opration was performed correctly:

We can use the shape method of pandas to show the # of rows and columns for each dataset:

print("Dimention of the usdata: ")
Dimention of the usdata: 
print(usdata.shape)
(62453, 28)
print("Dimention of the usfirm: ")
Dimention of the usfirm: 
print(usfirms.shape)
(5691, 7)
print("Dimention of the merged_data: ")
Dimention of the merged_data: 
print(merged_data.shape)
(62453, 31)

We see that the merged_data has the same # of rows than the left dataset usdata. The # of columns of the merged_data is 28 since we pulled 3 columns from the right dataset. Then, we can continue with our data analysis.

9 CHALLENGE 2: Basic Data analysis and visualization

Using the merged dataset, create a dataset selecting historical information for Apple. Call this dataset apple_financials. The firm code (ticker) for Apple Inc Corporation is AAPL.

Using apple_financials, calculate a column for gross profit and ebit:

grossprofit = revenue - cogs

ebit = grossprofit - sgae - depreciation

Calculate the annual growth for all years for the following: 1) revenue, 2)grossprofit, and 3)ebit

Design and show a plot to visualize how revenue, gross profit and ebit have been growing over the years

Show a table with the annual growth for revenue, gross profit and ebit for all years

Design and show a plot to visualize annual growth variables from the previous table

Create a new column for market value:

marketvalue = originalprice * sharesoutstanding

For revenue, grossprofit, ebit, and marketvalue create a growth index starting with 1.0 in the first year. In other words, the index of each variable must represent how many times the variable has grown with respect to its value in the first year,

Show a plot to see the growth indexes over the years.

Use an LLM to generate the corresponding code to the previous steps. Make sure you understand the Python code. You can ask the LLM to explain the code line by line.

Fix any error generated by the LLM, or any calculation that can be improved, and re-generate the code.

Provide an insight about the previous plots. What can you say about Apple performance in terms of revenue, gross profit, ebit, and marketvalue? Explain with your words.

(Optional) Do an online check, find the most recent 10-K report (annual financial statement) of Apple, and compare 2023 annual sales and EBIT with what you got. You must get the same amount for the annual sales reported at the end of fiscal year (for Apple, the fiscal year is Q3).

10 CHALLENGE 3. Summary tables to provide data understanding

Using the merged dataset (with all firms), do the following:

  • Create columns for grossprofit, ebit and marketvalue as you did with the apple_financials dataset

  • Create a summary table by year with the sum of

    • revenue

    • grossprofit

    • ebit

    • marketvalue

  • Using the previous summary table, do a plot to show revenue, grossprofit, ebit and marketvalue over time

  • Using the previous summary table, create a growth index for each variable to show how many times each variable has grown with respect to the first year. Show a plot to see these indexes over time

Use an LLM to generate the corresponding code to the previous steps. Make sure you understand the Python code. You can ask the LLM to explain the code line by line.

  • Provide an insight about the previous plots. What can you say about the whole financial market performance in terms of revenue, gross profit, ebit, and marketvalue? Explain with your words.

11 Datacamp online courses

YOU MUST TAKE Chapter 1 : Getting to know a dataset from the course: Exploratory Data Analysis in Python

12 W2 submission

The grade of this Workshop will be the following:

  • Complete (100%): If you submit an ORIGINAL and COMPLETE HTML file with all the activities, with your notes, and with your OWN RESPONSES to questions

  • Incomplete (75%): If you submit an ORIGINAL HTML file with ALL the activities but you did NOT RESPOND to the questions and/or you did not do all activities and respond to some of the questions.

  • Very Incomplete (10%-70%): If you complete from 10% to 75% of the workshop or you completed more but parts of your work is a copy-paste from other workshops.

  • Not submitted (0%)

Remember that you have to submit your Google Colab LINK, and you have to SHARE it with me.