Workshop 2, Business Analytics for Decision Making

Author

Alberto Dorantes D., Ph.D.

Published

January 15, 2026

Abstract

In this workshop we review the basics of a) data management, b) descriptive statistics, c) return calculation, d) the histogram, and d) the normal distribution.

1 Creating your Google Colab Notebook

You have to work on Google Colab for this and the next workshop. In Google Colab, you MUST LOGIN with your @tec.mx account and then create a google colab document for each workshop.

You must share each Colab document (workshop) with the following accounts:

cdorante@tec.mx

You must give Edit privileges to this account.

Rename your Notebook as “W2-YourFirstName YourLastname”.

You must submit your workshop by uploading your Google Colab link in Canvas. What you have to write in your workshop? You have to:

You have to REPLICATE and RUN all the Python code, and
DO ALL CHALLENGES stated in sections. These challenges can be Python code or just responding QUESTIONS with your own words and in CAPITAL LETTERS. You have to WRITE CLEARLY so that I can see your LINE OF THINKING!

I strongly recommended you to write your OWN NOTES about the topics as if it were your study NOTEBOOK.

2 What you have to read from the eBook:

(https://www.apradie.com/StatsBAbook/)

Introduction to Business Analytics

Chapter 3 - Descriptive Analytics

Chapter 5 - The Histogram and the Normal Distribution

Chapter 6 - The Central Limit Theorem

3 Directions

You have to do all excercises and challenges in your Google Colab Notebook

4 Data collection

We will collect real financial data from Yahoo Finance.

Import the following Python libraries:

import numpy as np
import pandas as pd
import yfinance as yf
import matplotlib.pyplot as plt

You might have to install the yfinance library. The yfinance library has functions to download online data from Yahoo Finance where you can find financial real historical data for stocks, ETFs, cryptocurrencies, etc from most of the financial markets around the world.

Download monthly stock prices for Nvidia and Microsoft from 2020 to Jan 2026:

data=yf.download(tickers='NVDA, MSFT', start='2020-01-01', end='2026-01-31',interval='1mo', auto_adjust=True)

[                       0%                       ][*********************100%***********************]  2 of 2 completed

I specify auto_adjust=True to indicate that I want to bring Adjusted Closing prices in the Close column. Remember that adjusted closing prices are prices adjusted for dividends and stock splits.

We got monthly stock prices for Nvidia and Microsoft from Jan 2020 into the data data frame.

We can see the first 5 rows of data:

data.head(5)

Price	Close		High		Low		Open		Volume
Ticker	MSFT	NVDA	MSFT	NVDA	MSFT	NVDA	MSFT	NVDA	MSFT	NVDA
Date
2020-01-01	161.630219	5.884465	165.257244	6.458651	148.603332	5.756039	150.758659	5.942207	558530000	6125412000
2020-02-01	153.825455	6.721724	181.066075	7.872832	144.321152	5.860073	161.820085	5.866295	887522300	11848652000
2020-03-01	150.151718	6.564618	166.613084	7.094818	126.168952	4.499602	157.387478	6.895837	1612695500	15773952000
2020-04-01	170.621323	7.278855	171.754279	7.575707	143.153960	5.936794	145.667437	6.366632	984705000	11278304000
2020-05-01	174.467728	8.841315	178.523562	9.146386	165.470623	6.993958	167.374772	7.081370	688845000	12548876000

Yahoo Finance keeps track of open, close, high, low and adjusted prices for each period. We can keep only the adjusted stock prices, which are the prices we must use to calculate financial returns.

We can create another data frame with only the adjusted price columns for both stocks:

adjprices = data['Close']

We see the first 5 and last 5 monthly prices of both stocks:

adjprices

Ticker	MSFT	NVDA
Date
2020-01-01	161.630219	5.884465
2020-02-01	153.825455	6.721724
2020-03-01	150.151718	6.564618
2020-04-01	170.621323	7.278855
2020-05-01	174.467728	8.841315
...	...	...
2025-09-01	516.982422	186.559097
2025-10-01	516.842651	202.478729
2025-11-01	491.090881	176.990143
2025-12-01	483.619995	186.489624
2026-01-01	459.859985	186.229996

73 rows × 2 columns

We can visualize the stock price of Microsoft:

plt.plot(adjprices['MSFT'])
plt.title('Microsoft adjusted prices')
plt.show()

5 Return calculation

5.1 Simple and continuously compounded (cc) return

A financial simple return for a stock in period t (R_{t}) is usually calculated as the closing stock price in t plus any dividend payment at t, and then divide this sum by the previous closing price. It is calculated as a percentage change from the previous period (t-1) to the present period (t):

R_{t}=\frac{\left(price_{t}-price_{t-1}+ dividend_t\right)}{price_{t-1} }=\frac{price_{t}+dividend_t}{price_{t-1}}-1

When a stock pays dividends of do a stock split, the financial exchange make an adjustment to the historical stock prices. This adjustment to the stock prices is made so that we do not need to use dividends nor splits to calculate simple stock returns. Then, it is always recommended to use adjusted prices to calculate stock returns, unless you have information about all dividends payed in the past.

Then, with adjusted prices the formula for simple returns is easier:

R_{t}=\frac{\left(Adjprice_{t}-Adjprice_{t-1}\right)}{Adjprice_{t-1} }=\frac{Adjprice_{t}}{price_{t-1}}-1

For example, if the adjusted price of a stock at the end of January 2021 was $100.00, and its previous (December 2020) adjusted price was $80.00, then the monthly simple return of the stock in January 2021 will be:

R_{Jan2021}=\frac{Adprice_{Jan2021}}{Adprice_{Dec2020}}-1=\frac{100}{80}-1=0.25

We can use returns in decimal or in percentage (multiplying by 100). We will keep using decimals.

Although the arithmetic mean of simple returns R gives us an idea of average past return, in the case of multi-period average return, this method of calculation can be misleading. Let’s see why this is the case.

Imagine you have only 2 periods and you want to calculate the average return of an investment per period:

Returns over time
Period	Investment value (at the end of the period)	Simple period Return (R)
0	$100	NA
1	$50	-0.50
2	$75	+0.50

Calculating the average simple return of this investment:

\bar{R}=\frac{-0.5+0.5}{2}=0%

Then, the simple average return gives me 0%, while I end up with $75, losing 25% of my initial investment ($100) over the first 2 periods. If I lost 25% of my initial investment over 2 periods, then the average mean return per period might a midpoint between 0 and 25%. The accurate mean return of an investment over time (multi-periods) is the “Geometric Mean” return.

The total return of the investment in the whole period -also called the holding-period return (HPR)- can be calculated as:

HPR=\left(1+R_{1}\right)\left(1+R_{2}\right)...\left(1+R_{N}\right)-1

Using the example, the HPR for this investment is:

HPR=\left(1-0.50\right)\left(1+0.50\right)-1=0.75 - 1 = -0.25

And the formula for the geometric average of returns will be:

\bar{R_{g}}=\sqrt[N]{\left(1+R_{1}\right)\left(1+R_{2}\right)...\left(1+R_{N}\right)}-1

Caculating the geometric average for this investment:

\bar{R_{g}}=\sqrt[2]{\left(1-0.5\right)\left(1+0.5\right)}-1= -0.13397

Then, the right average return per year is about -13.4% and the HPR for the 2 years is -25%.

However, if we use continuosuly compounded returns (r) instead of simple returns (R), then the arithmetic mean of r is an accurate measure that can be converted to simple returns to get the geometric mean, which is the accurate mean return. Let’s do the same example using continuously compounded returns:

Continuously compounded returns
Period	Investment value (at the end)	Continuously compounded return (r)
0	$100	NA
1	$50	=log(50)-log(100)=-0.6931
2	$75	=log(75)-log(50)=+0.4054

In Finance it is very recommended to calculate continuously compounded returns (cc returns) and using cc returns instead of simple returns for data analysis, statistics and econometric models.

One way to calculate cc returns is by subtracting the natural log of the current adjusted price (at t) minus the natural log of the previous adjusted price (at t-1):

r_{t}=log(Adjprice_{t})-log(Adjprice_{t-1})

This is also called as the difference of the log of the price.

We can also calculate cc returns as the log of the current adjusted price (at t) divided by the previous adjusted price (at t-1):

r_{t}=log\left(\frac{Adjprice_{t}}{Adjprice_{t-1}}\right)

cc returns are usually represented by small r, while simple returns are represented by capital R.

But why we use natural logarithm to calculate cc returns? First we need to remember what is a natural logarithm.

5.2 Reviewing the concept of natural logarithm

Generate a new dataset with the natural logarithm (log) of MSFT and NVDA

lnprices = np.log(adjprices)

We can see the first rows of this new data frame:

lnprices.head(5)

Ticker	MSFT	NVDA
Date
2020-01-01	5.085311	1.772316
2020-02-01	5.035819	1.905345
2020-03-01	5.011646	1.881694
2020-04-01	5.139447	1.984974
2020-05-01	5.161740	2.179436

As you see, when we apply a mathematical function (in this case, the log function) to a data frame, the function is calculated to all rows of all columns of the data frame.

Let’s see the plot of both, the prices and the log prices of NVDA:

plt.clf()
plt.plot(adjprices['NVDA'])
plt.title('Stock price of NVDA')
plt.show()

plt.clf()
plt.plot(lnprices['NVDA'])
plt.title('Log price of NVDA')
plt.show()

As you see, the log prices have a much smaller scale (from about 1.7 to 4.7). Remember that the natural log price is actually an exponent we raise the number e to get the stock price.

With this plot, we can better appreciate the % growth over time per period. For example, from Jan 2020 to July 2020 the log price increased fro about 1.7 to 2.5. The difference between 2.5 and 1.7, which is 0.8 is the approximate % increase in price, in this case, 80% from Jan 2020 to July 2020.

We have historical monthly adjusted prices for each stock. We can easily calculate simple returns for both stocks:

# adjprices.shift(1) refers to the previous value of adjusted price (at t-1)
R = adjprices / adjprices.shift(1) - 1
# We delete the NA values located at the first period:
R = R.dropna()

We can also calculate continuously compounded returns (r) as follows:

#  For the cc returns we get the difference of the current log price at t and the log price at t-1:
r = np.log(adjprices) - np.log(adjprices.shift(1))
# We delete the NA values located at the first period:
r = r.dropna()

Interestingly, instead of calculating returns, we can use a built-in Python function to do this. For example, the pct_change() function from Pandas calculates simple returns:

plt.clf()
plt.plot(adjprices['MSFT'].pct_change().dropna())
plt.title('Microsoft monthly returns')
plt.show()

6 Descriptive statistics of returns

We can use the describe function to all columns of the r dataframe. Remember that the r dataframe has cc returns of both Microsoft and Nvidia.

# The describe function gets the main descriptive statistics of 1 or more variables
sumret = r.describe()
print(sumret)

Ticker       MSFT       NVDA
count   72.000000  72.000000
mean     0.014522   0.047981
std      0.063162   0.135561
min     -0.113590  -0.385896
25%     -0.038561  -0.026414
50%      0.012064   0.055925
75%      0.059315   0.134034
max      0.162366   0.310008

We see that the mean of monthly cc returns of Microsoft is 1.4522374155134417%, while its standard deviation (volatility) is around 6.316223461838631%! The worse month of Microsoft had a return of -11.359012046271566%! and the best month had a return of 16.236613540090605%.

To know which days were the worse, we can do a selection based on a condition. Let’s see which months Microsoft had return less than 5%:

r[r['MSFT']< -0.05]

Ticker	MSFT	NVDA
Date
2020-09-01	-0.067361	0.011596
2021-09-01	-0.066494	-0.077308
2022-01-01	-0.078334	-0.183145
2022-04-01	-0.105213	-0.385896
2022-06-01	-0.054584	-0.208431
2022-08-01	-0.071109	-0.185089
2022-09-01	-0.113590	-0.217874
2022-12-01	-0.059109	-0.146438
2024-04-01	-0.077540	-0.044700
2024-07-01	-0.066128	-0.054138
2024-10-01	-0.057268	0.089208
2025-03-01	-0.053942	-0.142030
2025-08-01	-0.051560	-0.020964
2025-11-01	-0.051109	-0.134541
2026-01-01	-0.050377	-0.001393

We can see which was the worse month for Microsoft:

r[r['MSFT']==r['MSFT'].min()]

Ticker	MSFT	NVDA
Date
2022-09-01	-0.11359	-0.217874

The worst month for the Microsoft was Sep 2022.

To know the best months for Microsoft:

r[r['MSFT']>0.10].sort_values(by=['MSFT'], ascending=False)

Ticker	MSFT	NVDA
Date
2021-10-01	0.162366	0.210396
2025-05-01	0.152465	0.215624
2023-03-01	0.147365	0.179367
2020-04-01	0.127800	0.103279
2023-11-01	0.113927	0.137050
2020-06-01	0.107645	0.067760

We can also get the main descriptive statistics using methods of panda dataframes. In this case, we get the descriptive statistics of continuously compounded returns:

print("The monthly average cc return of Microsoft is ", r['MSFT'].mean())
print("The monthly variance of Microsoft cc return is ", r['MSFT'].var())
print("The monthly standard deviation (volatility) of Microsoft cc return is ", r['MSFT'].std())
print("The monthly median cc return of Microsoft is ",r['MSFT'].median())

The monthly average cc return of Microsoft is  0.014522374155134417
The monthly variance of Microsoft cc return is  0.003989467881988078
The monthly standard deviation (volatility) of Microsoft cc return is  0.06316223461838631
The monthly median cc return of Microsoft is  0.012063553823854711

7 The Histogram

The histogram was invented to illustrate how the values of a random variable are distributed in its whole range of values. The histogram is a frequency plot. The ranges of values of a variable that are more frequent will have a higher vertical bar compared with the ranges that are less frequent.

With the histogram of a random variable we can appreciate which are the most common values, the least common values, the possible mean and standard deviation of the variable.

In my opinion, the most important foundations/pillars of both, Statistics and the theory of Probability are:

The invention of the Histogram
The discovery of the Central Limit Theorem

Although the idea of a histogram sounds a very simple idea, it took many centuries to be developed, but it has profound impact in the development of Probability theory and Statistics, which both are the pillars of all sciences.

Read HERE chapter 3 about the Histogram and Probability distributions for more details.

Why the histogram is so important in Statistics? I hope we will find this out during this course!

We use the histogram to visualize random variables with historical values. For expected values of random variables we can use the concept of probability density function, which is analogous to the concept of the histogram, but applied to the expectation of possible values of a random variable.

8 The Central Limit Theorem

The Central Limit Theorem is one of the most important discoveries in the history of mathematics and statistics. Actually, thanks to this discovery, the field of modern Statistics was developed at the beginning of the 20th century.

The central limit theorem says that for any random variable with ANY probability distribution, when you take groups (of at least 30 elements from the original distribution), take the mean of each group, then the probability distribution of these means will have the following characteristics:

1) The distribution of the sample means will be close to normal distribution when you take many groups (the size of the groups should be each equal or bigger than 25). Actually, this happens not only with the sample mean , but also with other linear combinations such as the sum or weighted average of the variable.

2) The standard deviation of the sample means will be much less than the standard deviation of the individuals. Being more specifically, the standard deviation of the sample mean will shrink with a factor of 1/\sqrt{N}.

Then, the central limit theorem says that, no matter the original probability distribution of any random variable, if we take groups of this variable, a) the means of these groups will have a probability distribution close to the normal distribution, and b) the standard deviation of the mean will shrink according to the number of elements of each group.

An interesting question about why the standard deviation shrinks so much with a factor of 1/\sqrt{N}? We can prove this with basic probability theory and intuition. Let’s start with intuition.

When you take groups and then take the mean of each group, then extreme values that you could have in each group will cancel out when you take the average of the group. Then, it is expected that the variance of the mean of the group will be much less than variance of the variable. But how much less?

Read Chapter 6 for more detailed explanation for the CLT.

9 CHALLENGE 1 - Descriptive Statistics

Import a dataset from the World Bank site about the GDP per capita for all countries for the year 2024. (Ask Gemini for the code).

Once you download it, review the dataset and decide whether you need to do any data treatment/transformation. Make sure that you only have countries (not regions).

Do you need to do any data treatment/transformation? Explain why yes or why not. If yes, do the data treatments/transformations.

Applying what you learned about descriptive statistics, do the following:

In the world what is the typical GDP per-capita ? justify which measure(s) you used and explain the results with your words
How much (on average) the GDP per-capita varies across the countries? Justify which measure(s) you used and explain the results with your words
In terms of productivity, what can you say about the gaps that exist between the rich and the poor countries? Select only one insight and explain with your words based on the data

10 CHALLENGE 2 - Return calculation and descriptive statistics

Import monthly quotations of the Bitcoin from Jan, 2021 to Jan 15, 2026. You can use the yfinance library.

Once you download it, review the dataset and decide whether you need to do any initial data treatment/transformation.