Workshop 2, Business Analytics for Decision Making

Author

Alberto Dorantes D., Ph.D.

Published

February 5, 2025

Abstract

In this workshop we review the basics of a) descriptive statistics, b) return calculation, c) the histogram, and d) the normal distribution.

1 Workshop Directions

You have to work on Google Colab for this workshop. In Google Colab, you MUST LOGIN with your @tec.mx account and then create a google colab document for each workshop.

You must share each Colab document (workshop) with the following accounts:

cdorante@tec.mx

You must give Edit privileges to this account.

Rename your Notebook as “W2-YourFirstName YourLastname”.

You must submit your workshop by uploading your Google Colab link in Canvas. What you have to write in your workshop? You have to:

You have to REPLICATE and RUN all the Python code, and
DO ALL CHALLENGES stated in sections. These challenges can be Python code or just responding QUESTIONS with your own words and in CAPITAL LETTERS. You have to WRITE CLEARLY so that I can see your LINE OF THINKING!

I strongly recommended you to write your OWN NOTES about the topics as if it were your study NOTEBOOK.

2 Descriptive Statistics

Descriptive statistics is a set of summaries of raw data related to one or several variables of a phenomenon. Descriptive statistics usually gives us a first general idea of a phenomenon by looking at summaries such as averages and variability of variables that represent different aspects of a phenomenon.

In Economics, for example, we might be interested in learning the economic development of a country over the past 10 years. For example, we can calculate annual average growth rate of the GDP. In Finance we might be interested in knowing the annual average return of an investment for the last 5 years and also the variability of annual returns over time (the volatility).

There are two types of descriptive statistics measures:

Measures of central tendency, and
Measures of dispersion

2.1 Central tendency measures

The main central tendency measures are:

Arithmetic mean
Geometric mean
Median
Mode

2.1.1 Arithmetic mean

An arithmetic mean of a variable X is a simple measure that tells us the average value of all valid values of X, assuming that each value has the same importance (or weight). The variable X can be representing any attribute of a subject. A subject can be an individual, a group, a team, a business unit, a company, a financial portfolio, an industry, a region, a country, etc.

An example of a variable X can be the monthly sales amount of a company for the last 3 years. In this case, the variable X will have 36 observations (36 monthly sales). The subject here is a company and the variable or attribute is the company sales over time. Another example can be a variable that represents the monthly returns of a financial portfolio over the last 3 years. In this case, the variable might have about 36 observations, one for each month. The subject in this example is a financial portfolio, that might be composed of more than one stock.

To calculate the arithmetic mean of a variable X we simply sum all the non-missing values of the variable and then divide them by the number of non-missing values. Then, the calculation is as follows:

\bar{X}=\frac{{\displaystyle {\displaystyle \sum_{i=1}^{N}X_{i}}}}{N}

Where N is the number of non-missing values (observations) of X. A missing value of a variable happens when the variable X for a specific observation has no value. It is important to note that a missing value is not a zero value. When we work with real world datasets, it is very common to find non-missing values in many variables.

One of the disadvantage of the arithmetic mean is that it is very sensible to extreme values. If a variable has few extreme values, the arithmetic mean might not be a good representation of an average or mid point. In the presence of few very extreme values in a variable, the best measure of central tendency is the median, not the arithmetic mean.

2.1.2 The Median

Another measure of central tendency is the median. The median of a variable is its 50 percentile, which is the mid point of its values when the values are sorted in ascending order. When we have an even number of observations, there will be 2 mid points, so the median will be equal to the arithmetic mean of these 2 mid points. When we have an odd number of observations there will be only 1 value in the middle, which is the median.

For example, if we want to know what is the typical size of all companies that trade shares in the Mexican stock market we can calculate the median of firm size. These firms are called public firms. Firm size can be measured with different variables. We can use the total value of its assets (total assets), the market value, or the number of employees. In this example we will use total assets at the end of 2018 for all public Mexican firms. At the end of 2018 there were 146 Mexican public firms in the market exchange (“Bolsa Mexicana de Valores”). I will show how to calculate the median total assets of these 146 firms.

The 2018 total assets of the 146 Mexican public firms for 2018 are shown below (sorted alphabetically)

Mexican firms in the BMV
Firm	row #	Industry	2018 Total Assets (in thousand pesos)
ACCEL	1	Services	$6,454,560.00
AEROMEXICO	2	Transport Services	$76,772,848.00
…	…	…	…
VOLARIS	148	Transport Services	$22,310,652.00
WALMART	149	Retail	$306,528,832.00

We sort the list from the lowest to the highest value of 2018 total assets:

Firm	row #	Industry	Size Rank	2018 Total Assets (in thousand pesos)
INGEAL	98	Food & Beverages	1	$171,104.00
HIMEXSA	88	Textile	2	$494,378.00
…	…	…	…	…
FHIPO14	45	Real State	73	$27,979,1184.00
TVAZTECA	139	Telecommunications	74	$27,988,054.00
…	…	…	…	…
AMERICA MOVIL	8	Telecommunications	145	$1,429,223,392.00
GFBANORTE	69	Financial Services	146	$1,620,470,400.00

The median total assets is the mid point of the list. However, in this case, I have 146 firms, so it is not possible to find an exact mid point. Then, I need to calculate the arithmetic average assets of the 2 firms that are in the middle (firms in positions 73 and 74). Then the median will be equal to $27,983,619.00 thousand pesos (about 27 billion pesos), which is the average value between FHIPO14 and TVAZTECA assets. The arithmetic mean for total assets considering the 146 firms is $97,860,896.23 thousand pesos (about 97.8 billion pesos), which is much bigger than the median. Then, which might be the best measure that better represents the typical size of Mexican firms? In this case, the best measure is the median, so we can say that the typical size of a Mexican public firm is about $27.9 thousand million pesos.

Then, what is the difference between the mean and the median? When the distribution of the values of a variable is very close to a normal distribution, the mean and the median will be very similar, so we can use the mean or median to represent the typical value of the variable. When the variable has few very extreme values, then the distribution of values will not be similar to a normal distribution; it will have fat tails due to the presence of extreme values. In this case the best measure of central tendency is the median, not the mean.

What is a normal distribution? It is a very common probability distribution of random variables. We will further explain probability distributions. For now, just consider that many aggregate variables of all disciplines and nature follow a close-to-normal distribution.

The median gives of a better representation of the “average” value of a variable compared with the arithmetic mean when the distribution of the values does NOT follow a normal distribution. In the case of 2018 total assets we can explore its distribution using a histogram:

I will later explain in more detail what a histogram is.

I a histogram we see how often different ranges of values of a variable appear. This histogram does not look like a normal distributed variable. This histogram is said to be “skewed” to the right since there are very few firms with very high values of total assets. Normal distributed variable look like a bell shape curve where most of the values are around the arithmetic mean. In this case, we can see that most of the firms (about 100 firms) have a range of total assets between 0 and $25 thousand million pesos. Since the total of firms is 146 then, only about 46 firms have assets higher than $25 thousand million (or 25 billion pesos). Actually I can see that there are very few firms with assets greater than $1,000 thousand million (or greater than $1 trillion pesos), and one above $1,500 trillion. Looking at the previous table we can see that AMERICA MOVIL and GFBANORTE have assets greater than $1,400 trillion pesos.

With the histogram we can see that most of the firms (about 67%, 100 out of 146) have assets less than 25 billion pesos. The arithmetic mean of total assets is more than $97 billion, and the median total assets (or 50 percentile) is about $27 billion. The arithmetic mean is very sensible to extreme values, while the median is not. If we use the mean as a measure of a typical size of a Mexican firm we would be very far from the most common values of total assets. Then, the best measure of a typical size will be the median, which is about $27 billion pesos.

In sum, for skewed distributions the median will always be the best measure for central tendency, while the arithmetic mean will be a biased measure that will not represent the central or typical value.

In the case of a variable with a close-to-normal distribution, the median will be very close to the mean,so the median is always a good measure of central tendency.

Examples of business variables with a skewed distribution similar to total assets are employee salaries, income of families in a region or country, any variable from the income statement such as firm sales, firm profits, etc.

2.1.3 Mode

Mode is the value that most appear in the variable. Mode can be calculated only for discrete variables, not for continuous variables. Mode is rarely used as a central tendency measure.

2.2 Geometric mean

The geometric mean is mainly used to calculate the average period rate of growth of a variable over time. It is usually applied to historical rates of growth of a variable. For example, the following table shows the annual percentage rate of the Mexican real GDP (after inflation) (source: https://datos.bancomundial.org/indicador/NY.GDP.MKTP.KD.ZG?locations=MX):

Source: https://datos.bancomundial.org/indicador/NY.GDP.MKTP.KD.ZG?locations=MX
Year	GDP % growth = R_t
2019	-0.39%
2020	-8.35%
2021	6.05%
2022	3.69%
2023	3.2%
2024	1.3%

These % annual growth are cumulative since each % is applied to the previous annual PIB. In this case, the geometric mean is the best measure of average annual growth rate. In Economics, the estimation of this average growth is called compounded annual average growth rate (CAGR). The formula for the geometric average is:

\bar{R}_{g}=\sqrt[6]{(1+R_{1})*(1+R_{2})*...*(1+R_{N})}-1

Where

R_t = % growth rate for period t

N = Number of periods

Then, for this case, the annual average growth rate of the Mexican real GDP is:

CAGR=\sqrt[6]{(1-0.0039)*(1-0.0835)*(1+0.0605)*(1+0.0369)*(1+0.032)*(1+0.013)}-1

CAGR=0.0081 = 0.81\%

We can say that the Mexican economy -measured by its real GDP- on average grew 0.8% per year (less than 1%) for the 6-year period 2018-2024.

2.3 Dispersion measures

2.3.1 Variance and standard deviation

Standard deviation is used to measure how much on average the individual values of a variable change from the mean.

The variance of a variable X is the average of squared deviations from each individual value X_i from its mean:

Var(X)=\frac{1}{N}\sum_{i=1}^{N}\left(X_{i}-\bar{X}\right)^{2}=\sigma_{X}^{2}

Where:

X_i = Value i of the variable X

\overline{X}=\frac{1}{n}\sum_{i=1}^{n}X_{i} = Arithmetic average of X

Why the variance is the average of squared deviations? The reason is because if we do not square the deviations, then they will cancel out each other since some deviations are positive and other negative. Then, the squaring is just a trick to avoid canceling the positive with the negative deviations.

The result of the variance will be a number that our brain cannot easily interpret. To have a more reasonable measure of linear deviation, then we just take the square root of the variance, and then, we will be able to interpret that number as the average deviations of all points from their mean. This measure is called standard devation:

SD(X)=\sqrt{Var(X)}= \sqrt{\frac{1}{N}\sum_{i=1}^{N}\left(X_{i}-\bar{X}\right)^{2}}=\sigma_{X}

The variance can also be expressed as the expected value of squared deviations:

Var(X)=E[(X-\bar{X})^2]

Doing the multiplication of the squared term:

Var(X)=E[(X^2-X\bar{X}-\bar{X}X+\bar{X}^2)]

Since \bar{X} and \bar{Y} are constants, I can take them out of the expectation:

Var(X)=E[X^2]-\bar{X}E[X]-\bar{X}E[X]+\bar{X}^2

Since E[X]=\bar{X}, then:

Var(X)=E[X^2]-\bar{X}^2

Then, the variance can be defined as the expected value of X squared minus its squared mean.

Also, we can express the variance of X as:

Var(X)=\frac{1}{N}\sum_{i=1}^{N}\left(X_{i}\right)^2-\bar{X}^2

Most Statistics books and Statistics software use (N-1) instead of N as the denominator of the variance formula to get a more conservative value of the variance. This measure is called sample variance. When we divide by N in the variance formula, we are calculating the population variance. Both formulas provide very similar results, but the sample variance will be a bit bigger than the population variance, so it is a more conservative value.

In Statistics, the sample variance is an unbiased measure of the underlying (real) variance.

Then, we can re-write the formula for sample variance of X as:

Var(X)=\frac{1}{\left(N-1\right)}\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2}=\sigma_{X}^{2} And the sample standard deviation of X can be written as:

SD(X)=\sqrt[2]{Var(X)}=\sqrt{\frac{1}{(n-1)}\sum_{i=1}^{n}(X_{i}-\bar{X})^{2}}

SD(X)=\frac{\sqrt{\sum_{i=1}^{n}(X_{i}-\bar{X})^{2}}}{\sqrt{(n-1)}}=\sigma_{X}

2.4 CHALLENGE: Data management and Descriptive Statistics

For this Section 2.4, you have to replicate and run all the code.

We will use real financial data.

2.4.1 Data collection and visualization

Import the following Python libraries:

import numpy as np
import pandas as pd
import yfinance as yf
import matplotlib.pyplot as plt

You might have to install the yfinance library. The yfinance library has functions to download online data from Yahoo Finance where you can find financial real historical data for stocks, ETFs, cryptocurrencies, etc from most of the financial markets around the world.

Download monthly stock prices for Nvidia and Microsoft from 2020 to Jan 2025:

data=yf.download(tickers='NVDA, MSFT', start='2020-01-01', end='2025-01-31',interval='1mo', auto_adjust=True)

[                       0%                       ][*********************100%***********************]  2 of 2 completed

I specify auto_adjust=True to indicate that I want to bring Adjusted Closing prices in the Close column. Remember that adjusted closing prices are prices adjusted for dividends and stock splits.

We got monthly stock prices for Nvidia and Microsoft from Jan 2020 into the data data frame.

We can see the first 5 rows of data:

data.head(5)

	Close		High		Low		Open		Volume
	MSFT	NVDA	MSFT	NVDA	MSFT	NVDA	MSFT	NVDA	MSFT	NVDA
Date
2020-01-01	162.822495	5.886074	166.476275	6.460416	149.699515	5.757612	151.870740	5.943832	5.585300e+08	6.125412e+09
2020-02-01	154.960220	6.723563	182.401793	7.874986	145.385805	5.861676	163.013827	5.867900	8.875223e+08	1.184865e+10
2020-03-01	151.259338	6.566415	167.842135	7.096760	127.099660	4.500834	158.548474	6.897725	1.612696e+09	1.577395e+10
2020-04-01	171.879959	7.280847	173.021273	7.577780	144.209976	5.938419	146.741994	6.368375	9.847050e+08	1.127830e+10
2020-05-01	175.754700	8.843733	179.840452	9.148887	166.691227	6.995871	168.609423	7.083306	6.888450e+08	1.254888e+10

Yahoo Finance keeps track of open, close, high, low and adjusted prices for each period. We can keep only the adjusted stock prices, which are the prices we must use to calculate financial returns.

We can create another data frame with only the adjusted price columns for both stocks:

adjprices = data['Close']

We see the first 5 monthly prices of both stocks:

adjprices.head(5)

	MSFT	NVDA
Date
2020-01-01	162.822495	5.886074
2020-02-01	154.960220	6.723563
2020-03-01	151.259338	6.566415
2020-04-01	171.879959	7.280847
2020-05-01	175.754700	8.843733

We can see the most current 5 monthly prices for both stocks:

adjprices.tail(5)

	MSFT	NVDA
Date
2024-09-01	429.440399	121.421249
2024-10-01	405.538269	132.750854
2024-11-01	422.614075	138.240479
2024-12-01	421.500000	134.280746
2025-01-01	415.059998	120.070000

We can visualize the stock price of Microsoft:

plt.plot(adjprices['MSFT'])
plt.title('Microsoft adjusted prices')
plt.show()

3 Return calculation

3.1 Simple and continuously compounded (cc) return

A financial simple return for a stock in period t (R_{t}) is usually calculated as the closing stock price in t plus any dividend payment at t, and then divide this sum by the previous closing price. It is calculated as a percentage change from the previous period (t-1) to the present period (t):

R_{t}=\frac{\left(price_{t}-price_{t-1}+ dividend_t\right)}{price_{t-1} }=\frac{price_{t}+dividend_t}{price_{t-1}}-1

When a stock pays dividends of do a stock split, the financial exchange make an adjustment to the historical stock prices. This adjustment to the stock prices is made so that we do not need to use dividends nor splits to calculate simple stock returns. Then, it is always recommended to use adjusted prices to calculate stock returns, unless you have information about all dividends payed in the past.

Then, with adjusted prices the formula for simple returns is easier:

R_{t}=\frac{\left(Adjprice_{t}-Adjprice_{t-1}\right)}{Adjprice_{t-1} }=\frac{Adjprice_{t}}{price_{t-1}}-1

For example, if the adjusted price of a stock at the end of January 2021 was $100.00, and its previous (December 2020) adjusted price was $80.00, then the monthly simple return of the stock in January 2021 will be:

R_{Jan2021}=\frac{Adprice_{Jan2021}}{Adprice_{Dec2020}}-1=\frac{100}{80}-1=0.25

We can use returns in decimal or in percentage (multiplying by 100). We will keep using decimals.

Although the arithmetic mean of simple returns R gives us an idea of average past return, in the case of multi-period average return, this method of calculation can be misleading. Let’s see why this is the case.

Imagine you have only 2 periods and you want to calculate the average return of an investment per period:

Returns over time
Period	Investment value (at the end of the period)	Simple period Return (R)
0	$100	NA
1	$50	-0.50
2	$75	+0.50

Calculating the average simple return of this investment:

\bar{R}=\frac{-0.5+0.5}{2}=0%

Then, the simple average return gives me 0%, while I end up with $75, losing 25% of my initial investment ($100) over the first 2 periods. If I lost 25% of my initial investment over 2 periods, then the average mean return per period might a midpoint between 0 and 25%. The accurate mean return of an investment over time (multi-periods) is the “Geometric Mean” return.

The total return of the investment in the whole period -also called the holding-period return (HPR)- can be calculated as:

HPR=\left(1+R_{1}\right)\left(1+R_{2}\right)...\left(1+R_{N}\right)-1

Using the example, the HPR for this investment is:

HPR=\left(1-0.50\right)\left(1+0.50\right)-1=0.75 - 1 = -0.25

And the formula for the geometric average of returns will be:

\bar{R_{g}}=\sqrt[N]{\left(1+R_{1}\right)\left(1+R_{2}\right)...\left(1+R_{N}\right)}-1

Caculating the geometric average for this investment:

\bar{R_{g}}=\sqrt[2]{\left(1-0.5\right)\left(1+0.5\right)}-1= -0.13397

Then, the right average return per year is about -13.4% and the HPR for the 2 years is -25%.

However, if we use continuosuly compounded returns (r) instead of simple returns (R), then the arithmetic mean of r is an accurate measure that can be converted to simple returns to get the geometric mean, which is the accurate mean return. Let’s do the same example using continuously compounded returns:

Continuously compounded returns
Period	Investment value (at the end)	Continuously compounded return (r)
0	$100	NA
1	$50	=log(50)-log(100)=-0.6931
2	$75	=log(75)-log(50)=+0.4054

In Finance it is very recommended to calculate continuously compounded returns (cc returns) and using cc returns instead of simple returns for data analysis, statistics and econometric models.

One way to calculate cc returns is by subtracting the natural log of the current adjusted price (at t) minus the natural log of the previous adjusted price (at t-1):

r_{t}=log(Adjprice_{t})-log(Adjprice_{t-1})

This is also called as the difference of the log of the price.

We can also calculate cc returns as the log of the current adjusted price (at t) divided by the previous adjusted price (at t-1):

r_{t}=log\left(\frac{Adjprice_{t}}{Adjprice_{t-1}}\right)

cc returns are usually represented by small r, while simple returns are represented by capital R.

But why we use natural logarithm to calculate cc returns? First we need to remember what is a natural logarithm.

3.2 Reviewing the concept of natural logarithm

Generate a new dataset with the natural logarithm (log) of MSFT and NVDA

lnprices = np.log(adjprices)

We can see the first rows of this new data frame:

lnprices.head(5)

	MSFT	NVDA
Date
2020-01-01	5.092661	1.772589
2020-02-01	5.043168	1.905618
2020-03-01	5.018996	1.881968
2020-04-01	5.146796	1.985247
2020-05-01	5.169089	2.179709

As you see, when we apply a mathematical function (in this case, the log function) to a data frame, the function is calculated to all rows of all columns of the data frame.

Let’s see the plot of both, the prices and the log prices of NVDA:

plt.clf()
plt.plot(adjprices['NVDA'])
plt.title('Stock price of NVDA')
plt.show()

plt.clf()
plt.plot(lnprices['NVDA'])
plt.title('Log price of NVDA')
plt.show()

As you see, the log prices have a much smaller scale (from about 1.7 to 4.7). Remember that the natural log price is actually an exponent we raise the number e to get the stock price.

With this plot, we can better appreciate the % growth over time per period. For example, from Jan 2020 to July 2020 the log price increased fro about 1.7 to 2.5. The difference between 2.5 and 1.7, which is 0.8 is the approximate % increase in price, in this case, 80% from Jan 2020 to July 2020.

We have historical monthly adjusted prices for each stock. We can easily calculate simple returns for both stocks:

# adjprices.shift(1) refers to the previous value of adjusted price (at t-1)
R = adjprices / adjprices.shift(1) - 1
# We delete the NA values located at the first period:
R = R.dropna()

We can also calculate continuously compounded returns (r) as follows:

#  For the cc returns we get the difference of the current log price at t and the log price at t-1:
r = np.log(adjprices) - np.log(adjprices.shift(1))
# We delete the NA values located at the first period:
r = r.dropna()

Interestingly, instead of calculating returns, we can use a built-in Python function to do this. For example, the pct_change() function from Pandas calculates simple returns:

plt.clf()
plt.plot(adjprices['MSFT'].pct_change().dropna())
plt.title('Microsoft monthly returns')
plt.show()

3.2.1 Descriptive statistics of returns

We can use the describe function to all columns of the r dataframe. Remember that the r dataframe has cc returns of both Microsoft and Nvidia.

# The describe function gets the main descriptive statistics of 1 or more variables
sumret = r.describe()
print(sumret)

            MSFT       NVDA
count  60.000000  60.000000
mean    0.015596   0.050258
std     0.062877   0.141391
min    -0.113590  -0.385895
25%    -0.027247  -0.027567
50%     0.017801   0.062048
75%     0.059315   0.137295
max     0.162366   0.310008

We see that the mean of monthly cc returns of Microsoft is 1.5596041092359443%, while its standard deviation (volatility) is around 6.287723165618192%! The worse month of Microsoft had a return of -11.35898848542567%! and the best month had a return of 16.236623618205748%.

To know which days were the worse, we can do a selection based on a condition. Let’s see which months Microsoft had return less than 5%:

r[r['MSFT']< -0.05]

	MSFT	NVDA
Date
2020-09-01	-0.067361	0.011597
2021-09-01	-0.066494	-0.077308
2022-01-01	-0.078335	-0.183145
2022-04-01	-0.105213	-0.385895
2022-06-01	-0.054583	-0.208431
2022-08-01	-0.071110	-0.185089
2022-09-01	-0.113590	-0.217874
2022-12-01	-0.059109	-0.146438
2024-04-01	-0.077540	-0.044700
2024-07-01	-0.066128	-0.054138
2024-10-01	-0.057268	0.089208

We can see which was the worse month for Microsoft:

r[r['MSFT']==r['MSFT'].min()]

	MSFT	NVDA
Date
2022-09-01	-0.11359	-0.217874

The worst month for the Microsoft was Sep 2022.

To know the best months for Microsoft:

r[r['MSFT']>0.10].sort_values(by=['MSFT'], ascending=False)

	MSFT	NVDA
Date
2021-10-01	0.162366	0.210396
2023-03-01	0.147365	0.179367
2020-04-01	0.127800	0.103279
2023-11-01	0.113927	0.137050
2020-06-01	0.107645	0.067761

We can also get the main descriptive statistics using methods of panda dataframes. In this case, we get the descriptive statistics of continuously compounded returns:

print("The monthly average cc return of Microsoft is ", r['MSFT'].mean())
print("The monthly variance of Microsoft cc return is ", r['MSFT'].var())
print("The monthly standard deviation (volatility) of Microsoft cc return is ", r['MSFT'].std())
print("The monthly median cc return of Microsoft is ",r['MSFT'].median())

The monthly average cc return of Microsoft is  0.015596041092359443
The monthly variance of Microsoft cc return is  0.003953546260745166
The monthly standard deviation (volatility) of Microsoft cc return is  0.06287723165618192
The monthly median cc return of Microsoft is  0.01780116810050414

4 The Histogram

The histogram was invented to illustrate how the values of a random variable are distributed in its whole range of values. The histogram is a frequency plot. The ranges of values of a variable that are more frequent will have a higher vertical bar compared with the ranges that are less frequent.

With the histogram of a random variable we can appreciate which are the most common values, the least common values, the possible mean and standard deviation of the variable.

In my opinion, the most important foundations/pillars of both, Statistics and the theory of Probability are:

The invention of the Histogram
The discovery of the Central Limit Theorem

Although the idea of a histogram sounds a very simple idea, it took many centuries to be developed, but it has profound impact in the development of Probability theory and Statistics, which both are the pillars of all sciences.

I enjoy learning about the origins of the great ideas of humanity. The idea of the histogram was invented to decipher encrypted messages.

4.1 Interesting facts about the History of the Histogram

It is documented that the encryption of messages -cryptography- was commonly used since the beginning of civilizations. Unfortunately, it seems cryptography was invented by ancient Kingdoms mainly for war strategies. According to Herodotus, in the 500s BC, the Greeks used cryptography to win a war against the powerful Persian troops (Singh 2000).

Cryptography refers to the methods of ciphering messages, while cryptanalysis refers to the methods to decipher encrypted messages.

The Arabs in the years 800-900 AD were among the first to decipher encrypted messages thanks to their invention about the idea of the histogram. According to Singh (2000) and Al-Kadit (1992), in 1987 several ancient Arabic manuscripts related to cryptography and cryptanalysis (written between the year 800 AD and 1,500 AD) were discovered in Istanbul, Turkey (they were translated into English until 2002). This is a very fascinating story!

Below is an example of a frequency plot by Arabic philosopher Al-Kindi in the 850 AD compared with a recent frequency plot by Al-Kadi:

The encrypted messages at that time were written with the Caesar shift method. Then, to decipher an encrypted message, the Arabs used to count all characters to create a frequency plot, and then try to match the encrypted characters with the Arab characters. Finally, they replaced the corresponding Arabic matched characters in the original message to decipher it.

Interestingly, the idea of the frequency waited about 1,000 years to be used by French mathematicians to develop the foundations of the Probability and Statistics.

In the 1,700s and early 1,800s, the french mathematicians Abraham De Moivre and Pierre-Simon Laplace used this idea to develop the Central Limit Theory (CLT).

I believe the CLT is one of the most important and fascinating mathematical discoveries of all time.

The English scientist Karl Pearson coined the term histogram in 1891 when he was developing statistical methods applied to Biology.

Why the histogram is so important in Statistics? I hope we will find this out during this course!

4.2 CHALLENGES: Histogram

Do a histogram for daily Bitcoin cc returns using data from 2017 to date. Hints: use the plot.hist function for pandas dataframes, and the BTC dataframe.
INTERPRET the histogram with your own words and in CAPITAL LETTERS

To get daily data instead of monthly, I change the interval parameter to “1d”:

import yfinance as yf 

BTC=yf.download(tickers="BTC-USD", start="2017-01-01",interval="1d")
# Since the crypto-currencies does not pay dividends nor have splits, I did not indicate adjusted prices.
# I calculate simple returns :
BTC["R"] = (BTC["Close"] / BTC["Close"].shift(1)) - 1
# I calculate the continuously compounded returns:
# This time I use the diff function, which gets the difference between the log value of today and the previous log value:
BTC["r"] = np.log(BTC['Adj Close']).diff(1)
# I keep a new object with only returns:
BTCR = BTC[['R','r']].copy()

[*********************100%***********************]  1 of 1 completed

R_bitcoin = pd.DataFrame(BTCR[["R"]])
hist=R_bitcoin.plot.hist(bins=12,alpha=0.5,title="Histogram of daily Bitcoin Returns")

We use the histogram to visualize random variables with historical values. For expected values of random variables we can use the concept of probability density function, which is analogous to the concept of the histogram, but applied to the expectation of possible values of a random variable.

5 Probability Density Functions (PDF)

5.1 Probability Density Function of a Discrete random variable

The Probability Density Function (PDF) of a discrete random variable X is the probability of X to be equal to a specific value x_{i}:

f(x)=P(X=x_{i})

For example, when throwing a dice there are six possible outcomes: 1,2,3,4,5 and 6. All of them with the same probability since these outcomes are independent events. Every outcome has a 1/6 chance of happening. The PDF for a fair six-sided dice can be defined as:

f(x)=P(X=x_{i})=\frac{1}{6}

Where x_i=1,2,3,4,5,6

Now, instead of considering the probability of every independent outcome to take place, you might wonder about the probability of getting any number equal or less than x_{i} when throwing a dice. It seems pretty obvious that the probability would be 50% for x_{i}=3 and 100% for x_{i}=6. In plain words, we can say that getting 1,2 or 3 are 50% of the cases when throwing a dice, and a range from 1 to 6 will cover all the possibilities.

Mathematically we can express the Cumulative Density Function (CDF) as:

f(x)={\sum_{i=1}^{n}}P(X=x_{i})

Following the example of the dice, we can compute the CDF for every possible outcome as follows:

P(X\leq1)=\frac{1}{6}=0.17

P(X\leq2)=\frac{2}{6}=0.33

P(X\leq3)=\frac{3}{6}=0.50

P(X\leq4)=\frac{4}{6}=0.67

P(X\leq5)=\frac{5}{6}=0.83

P(X\leq6)=\frac{1}{6}=1

We have covered the PDF and the CDF for a six-sided dice which results are very intuitive if you have played with a dice, but what about if we combine the results from two dices? In any case, knowing about the possibilities of the combination of two dices will be most useful than the results of one single dice since most of the games and casinos around the world use a couple of dices, right? When we consider the sum of two dices (S), the range of possible outcomes goes from 2 to 12, so the PDF is defined as f(S)=P(S=x_{i}), where i=2,3,4..,12. In this case we have a total of 36 possible combinations, where only one combination will give an outcome equal to 2 or 12, there are two different combinations to get a 3 or 11, and so on. The outcome with higher probability to happen is a 7, there are six combinations that will result in a 7, as you can see in the table and graph below:

S	2	3	4	5	6	7	8	9	10	11	12
f(S)	1/36	2/36	3/36	4/36	5/36	6/36	5/36	4/36	3/36	2/36	1/36

We can see this PDF as follows:

The shape of this PDF for the combination outcome of two dices looks like the famous “bell-shaped” of the normal distribution. However there is an elemental difference between this PDF and the normal distribution; the normal distribution is the probabilistic distribution of a continuous random variable, but not a discrete random variable such as the outcome from two dices.

5.2 Probability density function (PDF) of a Continuous Random Variable

As seen in previous section, the CDF of a discrete random variable is defined as the sum of the probabilities of the independent outcomes. However, when using a continuous random variable the CDF will be defined as the integration of the function f(x) (f(x) is the PDF).

In this case we will not compute the probability of the variable X to take particular value, as we did with a discrete variable. Instead of doing that, we will calculate the probability of the continuous variable X to be within a specific range limited by a and b. The probability of a continuous variable to take a specific value is zero. The CDF will be 1 or 100% for all possible values that x can take, so:

\int_{\infty}^{\infty}f\left(x\right)dx=1

\int_{a}^{b}f\left(x\right)dx=P(a\leq x\leq b)

The probability that a continuous random variable X is between a and b, is equal to the area under the PDF on the interval [a,b]. For example, if a PDF is defined as f(x)=3x^{2};0<=x<=1 we can then compute the CDF for the interval between 0.5 and 1 as follows:

PDF=f(x)=3x^{2}

CDF=\int_{0.5}^{1}3x^{2}dx=\frac{3x^{3}}{3}\mid_{0.5}^{1}

CDF=x^{3}\mid_{0.5}^{1}=1^{3}-0.5^3=0.875=87.5\%

As demonstrated, given a PDF=f(x)=3x^{2}, the probability that X is between 0.5 and 1 is equal to 87.5%. In the same way, if you would like to know the probability that X is between 0 and 0.5, all we have to do is to evaluate the CDF for those limits.

CDF=CDF=\int_{0}^{0.5}3x^{2}dx=\frac{3x^{3}}{3}\mid_{0}^{0.5}

CDF=x^{3}\mid_{0}^{0.5}=0.5^{3}-0=0.125=12.5\%

If you see, that makes sense since 12.5% is the complement of 87.5%, so that the probability that x lies between 0 and 1 will be 1. This is true since the possible range of x as defined for this PDF is 0<=x<=1, so the probability of x being in any place of this range must be one or 100%. Note that not any arbitrary function is a PDF. A very important condition is that the integral of the function between the possible range of x must be equal to one.

Now we move to the most famous PDF, the Normal Distribution Function.

6 The Normal Distribution Function

In statistics, the most popular continuous PDF is the well-known “bell-shaped” normal distribution, which PDF is defined as:

f\left(x\right)=\frac{1}{\sigma_{x}\sqrt{2\pi}}e^{-\frac{1}{2}\left(\frac{x-\mu_{x}}{\sigma_{x}}\right)^{2}}

where x can take any real value -\infty<x<\infty

where \mu is the mean of the distribution and \sigma^{2} is the variance of the distribution. For simplification purposes, the normal distribution can also be denoted as X\sim N(\mu,\sigma^{2}) where X is the continuous random variable, "\sim" means distributed as and N means normal distribution.

So the only two parameters to be defined in order to know the behavior of the continuous random variable x are:

The mean of x and
the the variance of x.

The normal distribution is symmetric around \mu_x.

We can do a mathematical transformation (a trick) of the variable X so that its mean becomes 0 (zero) and its standard deviation and its variance = 1. To do this, I will subtract the mean of X from each value of X, and then divide by its standard deviation:

z_{i}=\frac{\left(x_{i}-\mu_x\right)}{\sigma_{x}}

I called this new transformed variable z. This transformation is called a normalization of the variable x. This makes z a normal distributed variable with mean=0 and standard deviation=1.

The formula for z is easier to remember:

f\left(z\right)=\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}\left(z\right)^{2}}

Other interesting property of the normal distribution is the probabilities according to different ranges of X:

• For the range (\mu-\sigma)<=x<=(\mu+\sigma), the area under the curve is approximately 68%

• For the range (\mu-2\sigma)<=x<=(\mu+2\sigma), the area under the curve is approximately 95%

• For the range (\mu-3\sigma)<=x<=(\mu+3\sigma), the area under the curve is approximately 99.7%.

If our variable of interest is simple returns (R) and these returns follow a normal distribution r\sim N(\mu,\sigma^{2}), the expected value of the future cc returns will be the mean of the distribution, while the standard deviation can be seen as a measure of risk.

6.1 Interesting facts about the History of the Normal Distribution

The Normal Distribution function is one of the most interesting finding in Statistics. The normal distribution can explain many phenomena in our real world; from financial returns to human behavior, to human characteristics such as height, etc.

Many people believe that Carl Friedrich Gauss was the person who discovered it, but it is not quite true. The French mathematician Abraham de Moivre was the first to discover the normal distribution when he found the distribution of the sum of binomial independent random variables.

6.2 CHALLENGE: Simulating the normal distribution

Use the mean and standard deviation of the historical cc returns of Bitcoin and simulate the same # of returns as the days we downloaded in the BTCR dataframe.

In one plot show both, the real distribution of historical cc returns and the simulated normal distribution.

Code

from matplotlib import pyplot
pyplot.clf()
rmean = BTCR["r"].mean()
rsd = BTCR["r"].std()
N = BTCR["r"].count()
simr= np.random.normal(loc=rmean,scale=rsd, size=N)
realr = BTCR["r"].to_numpy()

bins = 12

pyplot.hist(simr,bins,alpha=0.5,label='simulated rets')
pyplot.hist(realr,bins,alpha=0.5,label='real rets')
pyplot.legend(loc='upper left')
pyplot.title(label='Histogram of real and simulated cc returns of Bitcoin')

pyplot.show()

DO YOU SEE A DIFFERENCE BETWEEN THE REAL VS THE SIMULATED RETURNS? BRIEFLY EXPLAIN.

7 References

Al-Kadit, Ibrahim A. 1992. “Origins of Cryptology: The Arab Contributions.” Cryptologia 16 (2): 97–126.

Singh, Simon. 2000. The Code Book: The Science of Secrecy from Ancient Egypt to Quantum Cryptography. Anchor Books.