import numpy as np
import pandas as pd
import yfinance as yf
import matplotlib.pyplot as pltWorkshop 2, Business Analytics for Decision Making
1 Creating your Google Colab Notebook
You have to work on Google Colab for this and the next workshop. In Google Colab, you MUST LOGIN with your @tec.mx account and then create a google colab document for each workshop.
You must share each Colab document (workshop) with the following accounts:
- cdorante@tec.mx
You must give Edit privileges to this account.
Rename your Notebook as “W2-YourFirstName YourLastname”.
You must submit your workshop by uploading your Google Colab link in Canvas. What you have to write in your workshop? You have to:
You have to REPLICATE and RUN all the Python code, and
DO ALL CHALLENGES stated in sections. These challenges can be Python code or just responding QUESTIONS with your own words and in CAPITAL LETTERS. You have to WRITE CLEARLY so that I can see your LINE OF THINKING!
I strongly recommended you to write your OWN NOTES about the topics as if it were your study NOTEBOOK.
2 What you have to read from the eBook:
(https://www.apradie.com/StatsBAbook/)
Introduction to Business Analytics
Chapter 3 - Descriptive Analytics
Chapter 5 - The Histogram and the Normal Distribution
Chapter 6 - The Central Limit Theorem
3 Directions
You have to do all excercises and challenges in your Google Colab Notebook
4 Data collection
We will collect real financial data from Yahoo Finance.
Import the following Python libraries:
You might have to install the yfinance library. The yfinance library has functions to download online data from Yahoo Finance where you can find financial real historical data for stocks, ETFs, cryptocurrencies, etc from most of the financial markets around the world.
Download monthly stock prices for Nvidia and Microsoft from 2020 to Jan 2026:
data=yf.download(tickers='NVDA, MSFT', start='2020-01-01', end='2026-01-31',interval='1mo', auto_adjust=True)[ 0% ][*********************100%***********************] 2 of 2 completed
I specify auto_adjust=True to indicate that I want to bring Adjusted Closing prices in the Close column. Remember that adjusted closing prices are prices adjusted for dividends and stock splits.
We got monthly stock prices for Nvidia and Microsoft from Jan 2020 into the data data frame.
We can see the first 5 rows of data:
data.head(5)| Price | Close | High | Low | Open | Volume | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Ticker | MSFT | NVDA | MSFT | NVDA | MSFT | NVDA | MSFT | NVDA | MSFT | NVDA |
| Date | ||||||||||
| 2020-01-01 | 161.630219 | 5.884465 | 165.257244 | 6.458651 | 148.603332 | 5.756039 | 150.758659 | 5.942207 | 558530000 | 6125412000 |
| 2020-02-01 | 153.825455 | 6.721724 | 181.066075 | 7.872832 | 144.321152 | 5.860073 | 161.820085 | 5.866295 | 887522300 | 11848652000 |
| 2020-03-01 | 150.151718 | 6.564618 | 166.613084 | 7.094818 | 126.168952 | 4.499602 | 157.387478 | 6.895837 | 1612695500 | 15773952000 |
| 2020-04-01 | 170.621323 | 7.278855 | 171.754279 | 7.575707 | 143.153960 | 5.936794 | 145.667437 | 6.366632 | 984705000 | 11278304000 |
| 2020-05-01 | 174.467728 | 8.841315 | 178.523562 | 9.146386 | 165.470623 | 6.993958 | 167.374772 | 7.081370 | 688845000 | 12548876000 |
Yahoo Finance keeps track of open, close, high, low and adjusted prices for each period. We can keep only the adjusted stock prices, which are the prices we must use to calculate financial returns.
We can create another data frame with only the adjusted price columns for both stocks:
adjprices = data['Close'] We see the first 5 and last 5 monthly prices of both stocks:
adjprices| Ticker | MSFT | NVDA |
|---|---|---|
| Date | ||
| 2020-01-01 | 161.630219 | 5.884465 |
| 2020-02-01 | 153.825455 | 6.721724 |
| 2020-03-01 | 150.151718 | 6.564618 |
| 2020-04-01 | 170.621323 | 7.278855 |
| 2020-05-01 | 174.467728 | 8.841315 |
| ... | ... | ... |
| 2025-09-01 | 516.982422 | 186.559097 |
| 2025-10-01 | 516.842651 | 202.478729 |
| 2025-11-01 | 491.090881 | 176.990143 |
| 2025-12-01 | 483.619995 | 186.489624 |
| 2026-01-01 | 459.859985 | 186.229996 |
73 rows × 2 columns
We can visualize the stock price of Microsoft:
plt.plot(adjprices['MSFT'])
plt.title('Microsoft adjusted prices')
plt.show()5 Return calculation
5.1 Simple and continuously compounded (cc) return
A financial simple return for a stock in period t (R_{t}) is usually calculated as the closing stock price in t plus any dividend payment at t, and then divide this sum by the previous closing price. It is calculated as a percentage change from the previous period (t-1) to the present period (t):
R_{t}=\frac{\left(price_{t}-price_{t-1}+ dividend_t\right)}{price_{t-1} }=\frac{price_{t}+dividend_t}{price_{t-1}}-1
When a stock pays dividends of do a stock split, the financial exchange make an adjustment to the historical stock prices. This adjustment to the stock prices is made so that we do not need to use dividends nor splits to calculate simple stock returns. Then, it is always recommended to use adjusted prices to calculate stock returns, unless you have information about all dividends payed in the past.
Then, with adjusted prices the formula for simple returns is easier:
R_{t}=\frac{\left(Adjprice_{t}-Adjprice_{t-1}\right)}{Adjprice_{t-1} }=\frac{Adjprice_{t}}{price_{t-1}}-1
For example, if the adjusted price of a stock at the end of January 2021 was $100.00, and its previous (December 2020) adjusted price was $80.00, then the monthly simple return of the stock in January 2021 will be:
R_{Jan2021}=\frac{Adprice_{Jan2021}}{Adprice_{Dec2020}}-1=\frac{100}{80}-1=0.25
We can use returns in decimal or in percentage (multiplying by 100). We will keep using decimals.
Although the arithmetic mean of simple returns R gives us an idea of average past return, in the case of multi-period average return, this method of calculation can be misleading. Let’s see why this is the case.
Imagine you have only 2 periods and you want to calculate the average return of an investment per period:
| Period | Investment value (at the end of the period) | Simple period Return (R) |
|---|---|---|
| 0 | $100 | NA |
| 1 | $50 | -0.50 |
| 2 | $75 | +0.50 |
Calculating the average simple return of this investment:
\bar{R}=\frac{-0.5+0.5}{2}=0%
Then, the simple average return gives me 0%, while I end up with $75, losing 25% of my initial investment ($100) over the first 2 periods. If I lost 25% of my initial investment over 2 periods, then the average mean return per period might a midpoint between 0 and 25%. The accurate mean return of an investment over time (multi-periods) is the “Geometric Mean” return.
The total return of the investment in the whole period -also called the holding-period return (HPR)- can be calculated as:
HPR=\left(1+R_{1}\right)\left(1+R_{2}\right)...\left(1+R_{N}\right)-1
Using the example, the HPR for this investment is:
HPR=\left(1-0.50\right)\left(1+0.50\right)-1=0.75 - 1 = -0.25
And the formula for the geometric average of returns will be:
\bar{R_{g}}=\sqrt[N]{\left(1+R_{1}\right)\left(1+R_{2}\right)...\left(1+R_{N}\right)}-1
Caculating the geometric average for this investment:
\bar{R_{g}}=\sqrt[2]{\left(1-0.5\right)\left(1+0.5\right)}-1= -0.13397
Then, the right average return per year is about -13.4% and the HPR for the 2 years is -25%.
However, if we use continuosuly compounded returns (r) instead of simple returns (R), then the arithmetic mean of r is an accurate measure that can be converted to simple returns to get the geometric mean, which is the accurate mean return. Let’s do the same example using continuously compounded returns:
| Period | Investment value (at the end) | Continuously compounded return (r) |
|---|---|---|
| 0 | $100 | NA |
| 1 | $50 | =log(50)-log(100)=-0.6931 |
| 2 | $75 | =log(75)-log(50)=+0.4054 |
In Finance it is very recommended to calculate continuously compounded returns (cc returns) and using cc returns instead of simple returns for data analysis, statistics and econometric models.
One way to calculate cc returns is by subtracting the natural log of the current adjusted price (at t) minus the natural log of the previous adjusted price (at t-1):
r_{t}=log(Adjprice_{t})-log(Adjprice_{t-1})
This is also called as the difference of the log of the price.
We can also calculate cc returns as the log of the current adjusted price (at t) divided by the previous adjusted price (at t-1):
r_{t}=log\left(\frac{Adjprice_{t}}{Adjprice_{t-1}}\right)
cc returns are usually represented by small r, while simple returns are represented by capital R.
But why we use natural logarithm to calculate cc returns? First we need to remember what is a natural logarithm.
5.2 Reviewing the concept of natural logarithm
Generate a new dataset with the natural logarithm (log) of MSFT and NVDA
lnprices = np.log(adjprices)We can see the first rows of this new data frame:
lnprices.head(5)| Ticker | MSFT | NVDA |
|---|---|---|
| Date | ||
| 2020-01-01 | 5.085311 | 1.772316 |
| 2020-02-01 | 5.035819 | 1.905345 |
| 2020-03-01 | 5.011646 | 1.881694 |
| 2020-04-01 | 5.139447 | 1.984974 |
| 2020-05-01 | 5.161740 | 2.179436 |
As you see, when we apply a mathematical function (in this case, the log function) to a data frame, the function is calculated to all rows of all columns of the data frame.
Let’s see the plot of both, the prices and the log prices of NVDA:
plt.clf()
plt.plot(adjprices['NVDA'])
plt.title('Stock price of NVDA')
plt.show()plt.clf()
plt.plot(lnprices['NVDA'])
plt.title('Log price of NVDA')
plt.show()As you see, the log prices have a much smaller scale (from about 1.7 to 4.7). Remember that the natural log price is actually an exponent we raise the number e to get the stock price.
With this plot, we can better appreciate the % growth over time per period. For example, from Jan 2020 to July 2020 the log price increased fro about 1.7 to 2.5. The difference between 2.5 and 1.7, which is 0.8 is the approximate % increase in price, in this case, 80% from Jan 2020 to July 2020.
We have historical monthly adjusted prices for each stock. We can easily calculate simple returns for both stocks:
# adjprices.shift(1) refers to the previous value of adjusted price (at t-1)
R = adjprices / adjprices.shift(1) - 1
# We delete the NA values located at the first period:
R = R.dropna()We can also calculate continuously compounded returns (r) as follows:
# For the cc returns we get the difference of the current log price at t and the log price at t-1:
r = np.log(adjprices) - np.log(adjprices.shift(1))
# We delete the NA values located at the first period:
r = r.dropna()Interestingly, instead of calculating returns, we can use a built-in Python function to do this. For example, the pct_change() function from Pandas calculates simple returns:
plt.clf()
plt.plot(adjprices['MSFT'].pct_change().dropna())
plt.title('Microsoft monthly returns')
plt.show()6 Descriptive statistics of returns
We can use the describe function to all columns of the r dataframe. Remember that the r dataframe has cc returns of both Microsoft and Nvidia.
# The describe function gets the main descriptive statistics of 1 or more variables
sumret = r.describe()
print(sumret) Ticker MSFT NVDA
count 72.000000 72.000000
mean 0.014522 0.047981
std 0.063162 0.135561
min -0.113590 -0.385896
25% -0.038561 -0.026414
50% 0.012064 0.055925
75% 0.059315 0.134034
max 0.162366 0.310008
We see that the mean of monthly cc returns of Microsoft is 1.4522374155134417%, while its standard deviation (volatility) is around 6.316223461838631%! The worse month of Microsoft had a return of -11.359012046271566%! and the best month had a return of 16.236613540090605%.
To know which days were the worse, we can do a selection based on a condition. Let’s see which months Microsoft had return less than 5%:
r[r['MSFT']< -0.05]| Ticker | MSFT | NVDA |
|---|---|---|
| Date | ||
| 2020-09-01 | -0.067361 | 0.011596 |
| 2021-09-01 | -0.066494 | -0.077308 |
| 2022-01-01 | -0.078334 | -0.183145 |
| 2022-04-01 | -0.105213 | -0.385896 |
| 2022-06-01 | -0.054584 | -0.208431 |
| 2022-08-01 | -0.071109 | -0.185089 |
| 2022-09-01 | -0.113590 | -0.217874 |
| 2022-12-01 | -0.059109 | -0.146438 |
| 2024-04-01 | -0.077540 | -0.044700 |
| 2024-07-01 | -0.066128 | -0.054138 |
| 2024-10-01 | -0.057268 | 0.089208 |
| 2025-03-01 | -0.053942 | -0.142030 |
| 2025-08-01 | -0.051560 | -0.020964 |
| 2025-11-01 | -0.051109 | -0.134541 |
| 2026-01-01 | -0.050377 | -0.001393 |
We can see which was the worse month for Microsoft:
r[r['MSFT']==r['MSFT'].min()]| Ticker | MSFT | NVDA |
|---|---|---|
| Date | ||
| 2022-09-01 | -0.11359 | -0.217874 |
The worst month for the Microsoft was Sep 2022.
To know the best months for Microsoft:
r[r['MSFT']>0.10].sort_values(by=['MSFT'], ascending=False)| Ticker | MSFT | NVDA |
|---|---|---|
| Date | ||
| 2021-10-01 | 0.162366 | 0.210396 |
| 2025-05-01 | 0.152465 | 0.215624 |
| 2023-03-01 | 0.147365 | 0.179367 |
| 2020-04-01 | 0.127800 | 0.103279 |
| 2023-11-01 | 0.113927 | 0.137050 |
| 2020-06-01 | 0.107645 | 0.067760 |
We can also get the main descriptive statistics using methods of panda dataframes. In this case, we get the descriptive statistics of continuously compounded returns:
print("The monthly average cc return of Microsoft is ", r['MSFT'].mean())
print("The monthly variance of Microsoft cc return is ", r['MSFT'].var())
print("The monthly standard deviation (volatility) of Microsoft cc return is ", r['MSFT'].std())
print("The monthly median cc return of Microsoft is ",r['MSFT'].median())The monthly average cc return of Microsoft is 0.014522374155134417
The monthly variance of Microsoft cc return is 0.003989467881988078
The monthly standard deviation (volatility) of Microsoft cc return is 0.06316223461838631
The monthly median cc return of Microsoft is 0.012063553823854711
7 The Histogram
The histogram was invented to illustrate how the values of a random variable are distributed in its whole range of values. The histogram is a frequency plot. The ranges of values of a variable that are more frequent will have a higher vertical bar compared with the ranges that are less frequent.
With the histogram of a random variable we can appreciate which are the most common values, the least common values, the possible mean and standard deviation of the variable.
In my opinion, the most important foundations/pillars of both, Statistics and the theory of Probability are:
The invention of the Histogram
The discovery of the Central Limit Theorem
Although the idea of a histogram sounds a very simple idea, it took many centuries to be developed, but it has profound impact in the development of Probability theory and Statistics, which both are the pillars of all sciences.
Read HERE chapter 3 about the Histogram and Probability distributions for more details.
Why the histogram is so important in Statistics? I hope we will find this out during this course!
We use the histogram to visualize random variables with historical values. For expected values of random variables we can use the concept of probability density function, which is analogous to the concept of the histogram, but applied to the expectation of possible values of a random variable.
8 The Central Limit Theorem
The Central Limit Theorem is one of the most important discoveries in the history of mathematics and statistics. Actually, thanks to this discovery, the field of modern Statistics was developed at the beginning of the 20th century.
The central limit theorem says that for any random variable with ANY probability distribution, when you take groups (of at least 30 elements from the original distribution), take the mean of each group, then the probability distribution of these means will have the following characteristics:
1) The distribution of the sample means will be close to normal distribution when you take many groups (the size of the groups should be each equal or bigger than 25). Actually, this happens not only with the sample mean , but also with other linear combinations such as the sum or weighted average of the variable.
2) The standard deviation of the sample means will be much less than the standard deviation of the individuals. Being more specifically, the standard deviation of the sample mean will shrink with a factor of 1/\sqrt{N}.
Then, the central limit theorem says that, no matter the original probability distribution of any random variable, if we take groups of this variable, a) the means of these groups will have a probability distribution close to the normal distribution, and b) the standard deviation of the mean will shrink according to the number of elements of each group.
An interesting question about why the standard deviation shrinks so much with a factor of 1/\sqrt{N}? We can prove this with basic probability theory and intuition. Let’s start with intuition.
When you take groups and then take the mean of each group, then extreme values that you could have in each group will cancel out when you take the average of the group. Then, it is expected that the variance of the mean of the group will be much less than variance of the variable. But how much less?
Read Chapter 6 for more detailed explanation for the CLT.
9 CHALLENGE 1 - Descriptive Statistics
Import a dataset from the World Bank site about the GDP per capita for all countries for the year 2024. (Ask Gemini for the code).
Once you download it, review the dataset and decide whether you need to do any data treatment/transformation. Make sure that you only have countries (not regions).
- Do you need to do any data treatment/transformation? Explain why yes or why not. If yes, do the data treatments/transformations.
Applying what you learned about descriptive statistics, do the following:
In the world what is the typical GDP per-capita ? justify which measure(s) you used and explain the results with your words
How much (on average) the GDP per-capita varies across the countries? Justify which measure(s) you used and explain the results with your words
In terms of productivity, what can you say about the gaps that exist between the rich and the poor countries? Select only one insight and explain with your words based on the data
10 CHALLENGE 2 - Return calculation and descriptive statistics
Import monthly quotations of the Bitcoin from Jan, 2021 to Jan 15, 2026. You can use the yfinance library.
Once you download it, review the dataset and decide whether you need to do any initial data treatment/transformation.
Applying what you learned about descriptive statistics, do the following:
- If you had invested in Bitcoin from Jan 2021 up to Jan 2026:
what is the monthly average % simple return you would have gained?
What is the holding-period return (total % return) you had gained over the whole period?
How much variation (on average) in % simple monthly return you would had experienced in this period?
Do any data transformation you need, and explain why and how you did so.
Ask Gemini for the code. Bellow is an example of a simple code to bring the Bitcoin quotations:
import yfinance as yf
BTC=yf.download(tickers='BTC-USD', start='2021-01-01', end='2026-01-15', interval='1mo')
# I calculate simple returns :
BTC["R"] = (BTC["Close"] / BTC["Close"].shift(1)) - 1
# I calculate the continuously compounded returns:
# This time I use the diff function, which gets the difference between the log value of today and the previous log value:
BTC["r"] = np.log(BTC['Close']).diff(1)
# I keep a new object with only returns:
BTCR = BTC[['R','r']].copy()YF.download() has changed argument auto_adjust default to True
[*********************100%***********************] 1 of 1 completed