Workshop 1, Advanced AI - Statistics Module

Authors

Alberto Dorantes D., Ph.D.

Monterrey Tech, Queretaro Campus

Published

August 6, 2024

Abstract

In this workshop we review the basics of a) descriptive statistics, b) the histogram, c) data trasformations and cleaning, and d) the normal distribution.

1 Workshop Directions

You have to work on Google Colab for all your workshops. In Google Colab, you MUST LOGIN with your @tec.mx account and then create a google colab document for each workshop.

You must share each Colab document (workshop) with the following accounts:

cdorante.tec@gmail.com
cdorante@tec.mx

You must give Edit privileges to these accounts.

You have to follow this workshop in class to learn about topics. You have to do your own Markdown note/document for every workshop we cover.

Rename your Notebook as “W1-Statistics-AI-YourFirstName-YourLastname”.

You must submit your workshop before we start with the next workshop. What you have to write in your workshop? You have to:

You have to REPLICATE and RUN all the Python code, and
DO ALL CHALLENGES stated in sections. These challenges can be Python code or just responding QUESTIONS with your own words and in CAPITAL LETTERS. You have to WRITE CLEARLY so that I can see your LINE OF THINKING!

The submissions of your workshops is a REQUISITE for grading your final deliverable document of the Statistics Module.

I strongly recommended you to write your OWN NOTES about the topics as if it were your study NOTEBOOK.

2 Descriptive Statistics

Descriptive statistics is a set of summaries of raw data related to one or several variables of a phenomenon. Descriptive statistics usually gives us a first general idea of a phenomenon by looking at summaries such as averages and variability of variables that represent different aspects of a phenomenon.

In Economics, for example, we might be interested in knowing what has been the economic development of a country over the past 10 years. Then, we can calculate an annual average percentage growth of the gross domestic product. In Finance we might be interested in knowing the annual average return of an investment for the last 5 years and also the variability of annual returns over time.

Then, the more important measures of descriptive statistics are:

Measures of central tendency, and
Measures of dispersion

2.1 Central tendency measures

The main central tendency measures are:

Arithmetic mean
Median
Mode

2.1.1 Arithmetic mean

An arithmetic mean of a variable X is a simple measure that tells us the average value of all valid values of X, assuming that each value has the same importance (or weight). The variable X can be representing any attribute of a subject. A subject can be an individual, a group, a team, a business unit, a company, a financial portfolio, an industry, a region, a country, etc.

An example of a variable X can be the monthly sales amount of a company for the last 3 years. In this case, the variable X will have 36 observations (36 monthly sales). The subject here is a company and the variable or attribute is the company sales over time. Another example can be a variable that represents the daily returns of a financial portfolio over the last 2 years. In this case, the variable might have about 500 observations considering 250 business days each year. The subject in this example is a financial portfolio, that might be composed of more than one stock and/or bond.

To calculate the arithmetic mean of a variable X we simply sum all the non-missing values of the variable and then divide them by the number of non-missing values. Then, the calculation is as follows:

\bar{X}=\frac{{\displaystyle {\displaystyle \sum_{i=1}^{N}X_{i}}}}{N}

Where N is the number of non-missing values (observations) of X. A missing value of a variable happens when the variable X for a specific observation has no value. It is important to note that a missing value is not a zero value. When we work with real world datasets, it is very common to find non-missing values in many variables.

One of the disadvantage of the arithmetic mean is that it is very sensible to extreme values. If a variable has few extreme values, the arithmetic mean might not be a good representation of an average or mid point. In the prescence of few very extreme values in a variable, the best measure of central tendency is the median, not the arithmetic mean.

2.1.2 The Median

Another measure of central tendency is the median. The median of a variable is its 50 percentile, which is the mid point of its values when the values are sorted in ascending order. When we have an even number of observations, there will be 2 mid points, so the median will be equal to the arithmetic mean of these 2 mid points. When we have an odd number of observations there will be only 1 value in the middle, which is the median.

For example, if we want to know what is the typical size of all companies that trade shares in the Mexican stock market we can calculate the median of firm size. These firms are called public firms. Firm size can be measured with different variables. We can use the total value of its assets (total assets), the market value, or the number of employees. In this example we will use total assets at the end of 2018 for all public Mexican firms. At the end of 2018 there were 146 Mexican public firms in the market exchange (“Bolsa Mexicana de Valores”). I will show how to calculate the median total assets of these 146 firms.

The 2018 total assets of the 146 Mexican public firms for 2018 are shown below (sorted alphabetically)

Mexican firms in the BMV
Firm	row #	Industry	2018 Total Assets (in thousand pesos)
ACCEL	1	Services	$6,454,560.00
AEROMEXICO	2	Transport Services	$76,772,848.00
…	…	…	…
VOLARIS	148	Transport Services	$22,310,652.00
WALMART	149	Retail	$306,528,832.00

We sort the list from the lowest to the highest value of 2018 total assets:

Firm	row #	Industry	Size Rank	2018 Total Assets (in thousand pesos)
INGEAL	98	Food & Beverages	1	$171,104.00
HIMEXSA	88	Textile	2	$494,378.00
…	…	…	…	…
FHIPO14	45	Real State	73	$27,979,1184.00
TVAZTECA	139	Telecommunications	74	$27,988,054.00
…	…	…	…	…
AMERICA MOVIL	8	Telecommunications	145	$1,429,223,392.00
GFBANORTE	69	Financial Services	146	$1,620,470,400.00

The median total assets is the mid point of the list. However, in this case, I have 146 firms, so it is not possible to find an exact mid point. Then, I need to calculate the arithmetic average assets of the 2 firms that are in the middle (firms in positions 73 and 74). Then the median will be equal to $27,983,619.00 thousand pesos (about 27 billion pesos), which is the average value between FHIPO14 and TVAZTECA assets. The arithmetic mean for total assets considering the 146 firms is $97,860,896.23 thousand pesos (about 97.8 billion pesos), which is much bigger than the median. Then, which might be the best measure that better represents the typical size of Mexican firms? In this case, the best measure is the median, so we can say that the typical size of a Mexican public firm is about $27.9 thousand million pesos.

Then, what is the difference between the mean and the median? When the distribution of the values of a variable is very close to a normal distribution, the mean and the median will be very similar, so we can use the mean or median to represent the typical value of the variable. When the variable has few very extreme values, then the distribution of values will not be similar to a normal distribution; it will have fat tails due to the presence of extreme values. In this case the best measure of central tendency is the median, not the mean.

What is a normal distribution? It is a very common probability distribution of random variables. We will further explain probability distributions later. For now, just consider that many variables of all disciplines and nature follow a close-to-normal distribution.

The median gives of a better representation of the “average” value of a variable compared with the arithmetic mean when the distribution of the values does NOT follow a normal distribution. In the case of 2018 total assets we can explore its distribution using a histogram:

I will later explain in more detail what a histogram is.

I a histogram we see how often different ranges of values of a variable appear. This histogram does not look like a normal distributed variable. This histogram is said to be “skewed” to the right since there are very few firms with very high values of total assets. Normal distributed variable look like a bell shape curve where most of the values are around the arithmetic mean. In this case, we can see that most of the firms (about 100 firms) have a range of total assets between 0 and $25 thousand million pesos. Since the total of firms is 146 then, only about 46 firms have assets higher than $25 thousand million (or 25 billion pesos). Actually I can see that there are very few firms with assets greater than $1,000 thousand million (or greater than $1 trillion pesos), and one above $1,500 trillion. Looking at the previous table we can see that AMERICA MOVIL and GFBANORTE have assets greater than $1,400 trillion pesos.

With the histogram we can see that most of the firms (about 67%, 100 out of 146) have assets less than 25 billion pesos. The arithmetic mean of total assets is more than $97 billion, and the median total assets (or 50 percentile) is about $27 billion. The arithmetic mean is very sensible to extreme values, while the median is not. If we use the mean as a measure of a typical size of a Mexican firm we would be very far from the most common values of total assets. Then, the best measure of a typical size will be the median, which is about $27 billion pesos.

In sum, for skewed distributions the median will always be the best measure for central tendency, while the arithmetic mean will be a biased measure that will not represent the central or typical value. Actually, in the case of normal distributed variables, the median will be very close to the mean, so the median is always a good measure of central tendency.

Examples of business variables with a skewed distribution similar to total assets are employee salaries, income of families in a region or country, any variable from the income statement such as firm sales, firm profits.

2.1.3 Mode

Mode is the value that most appear in the variable. Mode can be calculated only for discrete variables, not for continuous variables. Mode is rarely used as a central tendency measure.

2.2 Dispersion measures

2.2.1 Variance and standard deviation

Standard deviation is used to measure how much on average the individual values of a variable change from the mean.

The variance of a variable X is the average of squared deviations from each individual value X_i from its mean:

Var(X)=\frac{1}{N}\sum_{i=1}^{N}\left(X_{i}-\bar{X}\right)^{2}=\sigma_{X}^{2}

Where:

X_i = Value i of the variable X

\overline{X}=\frac{1}{n}\sum_{i=1}^{n}X_{i} = Arithmetic average of X

Why the variance is the average of squared deviations? The reason is because if we do not square the deviations, then they will cancel out each other since some deviations are positive and other negative. Then, the squaring is just a trick to avoid canceling the positive with the negative deviations.

The result of the variance will be a number that our brain cannot easily interpret. To have a more reasonable measure of linear deviation, then we just take the square root of the variance, and then, we will be able to interpret that number as the average deviations of all points from their mean. This measure is called standard devation:

SD(X)=\sqrt{Var(X)}= \sqrt{\frac{1}{N}\sum_{i=1}^{N}\left(X_{i}-\bar{X}\right)^{2}}=\sigma_{X}

The variance can also be expressed as the expected value of squared deviations:

Var(X)=E[(X-\bar{X})^2]

Doing the multiplication of the squared term:

Var(X)=E[(X^2-X\bar{X}-\bar{X}X+\bar{X}^2)]

Since \bar{X} and \bar{Y} are constants, I can take them out of the expectation:

Var(X)=E[X^2]-\bar{X}E[X]-\bar{X}E[X]+\bar{X}^2

Since E[X]=\bar{X}, then:

Var(X)=E[X^2]-\bar{X}^2

Then, the variance can be defined as the expected value of X squared minus its squared mean.

Also, we can express the variance of X as:

Var(X)=\frac{1}{N}\sum_{i=1}^{N}\left(X_{i}\right)^2-\bar{X}^2

Most Statistics books and Statistics software use (N-1) instead of N as the denominator of the variance formula to get a more conservative value of the variance. This measure is called sample variance. When we divide by N in the variance formula, we are calculating the population variance. Both formulas provide very similar results, but the sample variance will be a bit bigger than the population variance, so it is a more conservative value.

In Statistics, the sample variance is an unbiased measure of the underlying (real) variance.

Then, we can re-write the formula for sample variance of X as:

Var(X)=\frac{1}{\left(N-1\right)}\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2}=\sigma_{X}^{2} And the sample standard deviation of X can be written as:

SD(X)=\sqrt[2]{Var(X)}=\sqrt{\frac{1}{(n-1)}\sum_{i=1}^{n}(X_{i}-\bar{X})^{2}}

SD(X)=\frac{\sqrt{\sum_{i=1}^{n}(X_{i}-\bar{X})^{2}}}{\sqrt{(n-1)}}=\sigma_{X}

2.3 CHALLENGE: Data management and Descriptive Statistics

For this Section 2.3, you have to replicate and run all the code.

We will use real financial data.

2.3.1 Data collection and visualization

Import the following Python libraries:

import numpy as np
import pandas as pd
import yfinance as yf

You might have to install the yfinance library. The yfinance library has functions to download online data from Yahoo Finance where you can find financial real historical data for stocks, ETFs, cryptocurrencies, etc from most of the financial markets around the world.

Download monthly prices for Bitcoin from 2017:

BTC=yf.download(tickers="BTC-USD", start="2017-01-01",interval="1mo")

[*********************100%%**********************]  1 of 1 completed

Show the content of the data:

BTC.head()

	Open	High	Low	Close	Adj Close	Volume
Date
2017-01-01	963.658020	1191.099976	755.755981	970.403015	970.403015	5143971692
2017-02-01	970.940979	1200.390015	946.690979	1179.969971	1179.969971	4282761200
2017-03-01	1180.040039	1280.310059	903.713013	1071.790039	1071.790039	10872455960
2017-04-01	1071.709961	1347.910034	1061.089966	1347.890015	1347.890015	9757448112
2017-05-01	1348.300049	2763.709961	1348.300049	2286.409912	2286.409912	34261856864

BTC.tail()

	Open	High	Low	Close	Adj Close	Volume
Date
2024-04-01	71333.484375	72715.359375	59120.066406	60636.855469	60636.855469	1016068331704
2024-05-01	60609.496094	71946.460938	56555.292969	67491.414062	67491.414062	874291509757
2024-06-01	67489.609375	71907.851562	58601.699219	62678.292969	62678.292969	726773965644
2024-07-01	62673.605469	69987.539062	53717.375000	64619.250000	64619.250000	953395573307
2024-08-01	64625.839844	65593.242188	49121.238281	56368.433594	56368.433594	317602246589

For each period, Yahoo Finance keeps track of the open, high, low, close and adjusted prices. Also, it keeps track of volume that was traded in every period. The adjusted price is the closing price that is used to calculate returns. Adjusted prices consider dividend payments and also stock splits, while close prices do not. Adjusted prices are used to calculate return - how much gain (in percentage) you would have make-

Import the matplotlib and do a plot for the Bitcoin adjusted prices:

import matplotlib
from matplotlib.pyplot import*
plot(BTC["Adj Close"])
show()

We can check the data types of each variable (column) in the dataset:

BTC.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 92 entries, 2017-01-01 to 2024-08-01
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       92 non-null     float64
 1   High       92 non-null     float64
 2   Low        92 non-null     float64
 3   Close      92 non-null     float64
 4   Adj Close  92 non-null     float64
 5   Volume     92 non-null     int64  
dtypes: float64(5), int64(1)
memory usage: 5.0 KB

2.3.2 Data transformations

We will calculate a) the natural logarithm of prices and b) returns.

The logarithm of a variable is a very useful mathematical transformation for statistical analysis. The return of a price or an investment is the percentage change of the price from one period to the next.

2.3.2.1 Return calculation

A financial simple return for a stock (R_{t}) is calculated as the simple percentage change of price from the previous period (t-1) to the present period (t):

R_{t}=\frac{\left(price_{t}-price_{t-1}\right)}{price_{t-1}}=\frac{price_{t}}{price_{t-1}}-1

For example, if the adjusted price of a stock at the end of January 2021 was $100.00, and its previous (December 2020) adjusted price was $80.00, then the monthly simple return of the stock in January 2021 will be:

R_{Jan2021}=\frac{price_{Jan2021}}{price_{Dec2020}}-1=\frac{100}{80}-1=0.25

We can use returns in decimal or in percentage (multiplying by 100). We will keep using decimals.

In Finance it is very recommended to calculate continuously compounded returns (cc returns) and using cc returns instead of simple returns for data analysis, statistics and econometric models. cc returns are calculated from the natural logarithm of prices.

2.3.2.2 Reviewing the concept of natural logarithm

What is a natural logarithm?

The natural logarithm of a number is the exponent that the number e (=2.71…) needs to be raised to get another number. For example, let’s name x=natural logarithm of a stock price p. Then:

e^x = p

The way to get the value of x that satisfies this equality is actually getting the natural log of p:

x = log_e(p)

Then, we have to remember that the natural logarithm is actually an exponent that you need to raise the number e to get a result or a specific number.

The natural log is the logarithm of base e (=2.71…). The number e is an irrational number (it cannot be expressed as a division of 2 natural numbers), and it is also called the Euler constant. Leonard Euler (1707-1783) took the idea of the logarithm from the great mathematician Jacob Bernoulli, and discovered very astonishing features of the e number. Euler is considered the most productive mathematician of all times. Some historians believe that Jacob Bernoulli discovered the number e around 1690 when he was playing with calculations to know how an amount of money grows over time with an interest rate.

How e is related to the grow of financial amounts over time? It is mainly related with the concept of compounding

2.3.2.3 The effect of compounding in calculating returns

Here is a simple example:

If I invest $100.00 today (t=0) with an annual interest rate of 50%, then the end balance of my investment at the end of the first year will be:

I_1=100*(1+0.50)=150

If the interest rate is 100%, then I would get:

I_1=100*(1+1)=200

Then, the general formula to get the final amount of my investment at the beginning of year 2, for any interest rate R can be:

I_1=I_0*(1+R)

The (1+R) is the growth factor of my investment.

In Finance, the investment amount is called principal. If the interests are calculated (compounded) each month instead of each year, then I would end up with a higher amount at the end of the year.

Monthly compounding means that a monthly interest rate is applied to the amount to get the interest of the month, and then the interest of the month is added to the investment (principal). Then, at the beginning of month 2 the principal will be higher than the initial investment. At the end of month 2 the interest will be calculated using the updated principal amount. Putting in simple math terms, the final balance of an investment at the end of month 1 when doing monthly compounding will be:

I_1=I_0*\left(1+\frac{R}{12}\right)

We can do the same for month 2:

I_2=I_1*\left(1+\frac{R}{12}\right)^{1}

We can plug the calculation for I_1 in this formula to express I_2 in terms of the initial investment:

I_2=I_0*\left(1+\frac{R}{12}\right)\left(1+\frac{R}{12}\right)

We group the growth factor using an exponent:

I_2=I_0*\left(1+\frac{R}{12}\right)^{2}

We can see the pattern to calculate the end balance of the investment in month 12 when comounding monthly. The monthly interest rate is equal to the annual interest rate R divided by 12 (R/N). Then, with an annual rate of 100% and monthly compounding (N=12), the end value of the investment will be:

I_{12}=100*\left(1+\frac{1}{12}\right)^{1*12}=100*(2.613..)

In this case, the growth factor is (1+1/12)^{12}, which is equal to 2.613.

Instead of compounding each month, if the compounding is every moment, then we are calculating a continuously compounded rate.

If we do a continuously compounding for the previous example, then the growth factor for one year becomes the astonishing Euler constant e:

Let’s do an example for a compounding of each second (1 year has 31,536,000 seconds). The investment at the end of the year 1 (or month 12) will be:

I_{12}=100*\left(1+\frac{1}{31536000}\right)^{1*31536000}=100*(2.718282..)\cong100*e^1

Now we see that e^1 is the GROWTH FACTOR after 1 year if we do the compounding of the interests every moment!

We can generalize to any other annual interest rate R, so that e^R is the growth factor for an annual nominal rate R when the interest is compounded every moment.

When compounding every instant, we use small r instead of R for the interest rate. Then, the growth factor will be: e^r

Then we can do a relationship between this growth rate and an effective equivalent rate:

\left(1+EffectiveRate\right)=e^{r}

If we apply the natural logarithm to both sides of the equation:

ln\left(1+EffectiveRate\right)=ln\left(e^r\right)

Since the natural logarithm function is the inverse of the exponential function, then:

ln\left(1+EffectiveRate\right)=r

In the previous example with a nominal rate of 100%, when doing a continuously compounding, then the effective rate will be:

\left(1+EffectiveRate\right)=e^{r}=2.7182

EffectiveRate=e^{r}-1

Doing the calculation of the effective rate for this example:

EffectiveRate=e^{1}-1 = 2.7182.. - 1 = 1.7182 = 171.82\%

Then, when compounding every moment, starting with a nominal rate of 100% annual interest rate, the actual effective annual rate would be 171.82%!

2.3.2.4 Continuously compounded returns

One way to calculate cc returns is by subtracting the log of the current price (at t) minus the log of the previous price (at t-1):

r_{t}=log(price_{t})-log(price_{t-1})

This is also called as the difference of the log of the price.

We can also calculate cc returns as the log of the current adjusted price (at t) divided by the previous adjusted price (at t-1):

r_{t}=log\left(\frac{price_{t}}{price_{t-1}}\right)

cc returns are usually represented by small r, while simple returns are represented by capital R.

In Python we can get the previous value of a variable using the shift(n) function. This function works for Panda dataframes. Then we can calculate a new column for the simple returns of Bitcoin as follows:

BTC["R"] = (BTC["Adj Close"] / BTC["Adj Close"].shift(1)) - 1
print(BTC["R"])

Date
2017-01-01         NaN
2017-02-01    0.215959
2017-03-01   -0.091680
2017-04-01    0.257606
2017-05-01    0.696288
                ...   
2024-04-01   -0.149954
2024-05-01    0.113043
2024-06-01   -0.071315
2024-07-01    0.030967
2024-08-01   -0.127684
Name: R, Length: 92, dtype: float64

We calculate cc returns in a new column using the shift function, or the diff function:

BTC["r"] = np.log(BTC['Adj Close']) - np.log(BTC['Adj Close'].shift(1))
# Another way to do the same as above is appying the diff method to the log:  
BTC["r2"] = np.log(BTC['Adj Close']).diff(1)

# We can see that r and r2 are the same: 
print(BTC.tail())

# I keep a new object with only returns:
BTCR = BTC[['R','r']].copy()

                    Open          High           Low         Close  \
Date                                                                 
2024-04-01  71333.484375  72715.359375  59120.066406  60636.855469   
2024-05-01  60609.496094  71946.460938  56555.292969  67491.414062   
2024-06-01  67489.609375  71907.851562  58601.699219  62678.292969   
2024-07-01  62673.605469  69987.539062  53717.375000  64619.250000   
2024-08-01  64625.839844  65593.242188  49121.238281  56368.433594   

               Adj Close         Volume         R         r        r2  
Date                                                                   
2024-04-01  60636.855469  1016068331704 -0.149954 -0.162465 -0.162465  
2024-05-01  67491.414062   874291509757  0.113043  0.107098  0.107098  
2024-06-01  62678.292969   726773965644 -0.071315 -0.073985 -0.073985  
2024-07-01  64619.250000   953395573307  0.030967  0.030497  0.030497  
2024-08-01  56368.433594   317602246589 -0.127684 -0.136603 -0.136603

We have a null value for the first day since we cannot calculate returns for day 1. We can drop the rows with NA values to ease data calculations:

BTCR= BTCR.dropna()

2.3.2.5 Descriptive statistics of returns

We can use the describe function applied to the r column as follows:

sumret = BTC["R"].describe()
print(sumret)

count    91.000000
mean      0.070230
std       0.233737
min      -0.377688
25%      -0.077135
50%       0.030967
75%       0.223136
max       0.696288
Name: R, dtype: float64

We see that the mean of monthly Bitcoin returns is 7.023035207843672%, while standard deviation is around 23.373705824478996%! The worse month of Bitcoin had a return of -37.76883082049309%! and the best month had a return of 69.62881891411045%.

To know which days were the worse, we can do a selection based on a condition. Let’s see which days had a daily return less than 15%:

BTCR[BTCR["R"]<-0.15]

	R	r
Date
2018-01-01	-0.277987	-0.325713
2018-03-01	-0.329333	-0.399482
2018-05-01	-0.188991	-0.209476
2018-11-01	-0.364116	-0.452739
2019-11-01	-0.177177	-0.195014
2020-03-01	-0.251278	-0.289387
2021-05-01	-0.353546	-0.436253
2021-12-01	-0.187684	-0.207865
2022-01-01	-0.168947	-0.185061
2022-04-01	-0.171806	-0.188507
2022-05-01	-0.157035	-0.170830
2022-06-01	-0.377688	-0.474314
2022-11-01	-0.162336	-0.177139

BTCR[BTCR["R"]==BTCR["R"].min()]

	R	r
Date
2022-06-01	-0.377688	-0.474314

The worst month for the Bitcoin was June 2022.

To know the best months for Bitcoin:

BTCR[BTCR["R"]>0.15].sort_values(by=['R'], ascending=False)

	R	r
Date
2017-05-01	0.696288	0.528442
2017-08-01	0.635768	0.492113
2019-05-01	0.602493	0.471561
2017-11-01	0.582091	0.458748
2017-10-01	0.490858	0.399352
2020-12-01	0.477732	0.390508
2024-02-01	0.437169	0.362675
2020-11-01	0.424123	0.353556
2021-10-01	0.400267	0.336663
2023-01-01	0.398356	0.335297
2017-12-01	0.383326	0.324490
2021-02-01	0.363088	0.309752
2020-04-01	0.344779	0.296230
2018-04-01	0.325089	0.281480
2021-03-01	0.305311	0.266441
2019-04-01	0.303337	0.264928
2020-01-01	0.299840	0.262241
2023-10-01	0.285519	0.251163
2020-10-01	0.277853	0.245181
2019-06-01	0.261549	0.232340
2017-04-01	0.257606	0.229210
2020-07-01	0.239163	0.214436
2023-03-01	0.230313	0.207268
2017-02-01	0.215959	0.195533
2018-07-01	0.214934	0.194690
2021-07-01	0.187934	0.172216
2022-07-01	0.179541	0.165125
2024-03-01	0.165613	0.153247
2017-07-01	0.159019	0.147574

We can also get the main descriptive statistics using methods of panda dataframes. In this case, we get the descriptive statistics of continuously compounded returns:

print("The monthly average cc return of Bitcoin is ", BTCR["r"].mean())
print("The monthly variance of Bitcoin cc return is ", BTCR["r"].var())
print("The monthly standard deviation (volatility) of Bitcoin cc return is ", BTCR["r"].std())
print("The monthly median cc return of Bitcoin is ",BTCR["r"].median())

The monthly average cc return of Bitcoin is  0.04463684756061698
The monthly variance of Bitcoin cc return is  0.047070853539325976
The monthly standard deviation (volatility) of Bitcoin cc return is  0.21695818384962107
The monthly median cc return of Bitcoin is  0.030497170964087772

3 The Histogram

The histogram was invented to illustrate how the values of a random variable are distributed in its whole range of values. The histogram is a frequency plot. The ranges of values of a variable that are more frequent will have a higher vertical bar compared with the ranges that are less frequent.

With the histogram of a random variable we can appreciate which are the most common values, the least common values, the possible mean and standard deviation of the variable.

In my opinion, the most important foundations/pillars of both, Statistics and the theory of Probability are:

The invention of the Histogram
The discovery of the Central Limit Theorem

Although the idea of a histogram sounds a very simple idea, it took many centuries to be developed, but it has profound impact in the development of Probability theory and Statistics, which both are the pillars of all sciences.

I enjoy learning about the origins of the great ideas of humanity. The idea of the histogram was invented to decipher encrypted messages.

3.1 Interesting facts about the History of the Histogram

It is documented that the encryption of messages -cryptography- was commonly used since the beginning of civilizations. Unfortunately, it seems cryptography was invented by ancient Kingdoms mainly for war strategies. According to Herodotus, in the 500s BC, the Greeks used cryptography to win a war against the powerful Persian troops (Singh 2000).

Cryptography refers to the methods of ciphering messages, while cryptanalysis refers to the methods to decipher encrypted messages.

The Arabs in the years 800-900 AD were among the first to decipher encrypted messages thanks to their invention about the idea of the histogram. According to Singh (2000) and Al-Kadit (1992), in 1987 several ancient Arabic manuscripts related to cryptography and cryptanalysis (written between the year 800 AD and 1,500 AD) were discovered in Istanbul, Turkey (they were translated into English until 2002). This is a very fascinating story!

Below is an example of a frequency plot by Arabic philosopher Al-Kindi in the 850 AD compared with a recent frequency plot by Al-Kadi:

The encrypted messages at that time were written with the Caesar shift method. Then, to decipher an encrypted message, the Arabs used to count all characters to create a frequency plot, and then try to match the encrypted characters with the Arab characters. Finally, the replaced the corresponding Arabic matched characters in the original message to decipher it.

Interestingly, the idea of the frequency waited about 1,000 years to be used by French mathematicians to develop the foundations of the Statistics discipline. In the1700s and early 1,800s, the french mathematicians Abraham De Moivre and Pierre-Simon Laplace used this idea to develop the Central Limit Theory (CLT).

I believe the CLT is one of the most important and fascinating mathematical discoveries of all time.

The English scientist Karl Pearson coined the term histogram in 1891 when he was developing statistical methods applied to Biology.

Why the histogram is so important in Statistics? I hope we will find this out during this course!

3.2 CHALLENGES: Histogram

Do a histogram for daily Bitcoin cc returns. Hints: use the plot.hist function for pandas dataframes, and the BTC dataframe.
Interpret the histogram with your own words and in CAPITAL LETTERS

To get daily data instead of monthly, I change the interval parameter to “1d”:

BTC=yf.download(tickers="BTC-USD", start="2017-01-01",interval="1d")
# I calculate simple and cc return columns:
BTC["R"] = (BTC["Adj Close"] / BTC["Adj Close"].shift(1)) - 1
BTC["r"] = np.log(BTC['Adj Close']).diff(1)
# I keep a new object with only returns:
BTCR = BTC[['R','r']].copy()

[*********************100%%**********************]  1 of 1 completed

Code

R_bitcoin = pd.DataFrame(BTCR[["R"]])
hist=R_bitcoin.plot.hist(bins=12,alpha=0.5,title="Histogram of daily Bitcoin Returns")

We use the histogram to visualize random variables with historical values. For expected values of random variables we can use the concept of probability density function, which is analogous to the concept of the histogram, but applied to the expectation of possible values of a random variable.

4 Probability Density Functions (PDF)

4.1 Probability Density Function of a Discrete random variable

The Probability Density Function (PDF) of a discrete random variable X is the probability of X to be equal to a specific value x_{i}:

f(x)=P(X=x_{i})

For example, when throwing a dice there are six possible outcomes: 1,2,3,4,5 and 6. All of them with the same probability since these outcomes are independent events. Every outcome has a 1/6 chance of happening. The PDF for a fair six-sided dice can be defined as:

f(x)=P(X=x_{i})=\frac{1}{6}

Where x_i=1,2,3,4,5,6

Now, instead of considering the probability of every independent outcome to take place, you might wonder about the probability of getting any number equal or less than x_{i} when throwing a dice. It seems pretty obvious that the probability would be 50% for x_{i}=3 and 100% for x_{i}=6. In plain words, we can say that getting 1,2 or 3 are 50% of the cases when throwing a dice, and a range from 1 to 6 will cover all the possibilities.

Mathematically we can express the Cumulative Density Function (CDF) as:

f(x)={\sum_{i=1}^{n}}P(X=x_{i})

Following the example of the dice, we can compute the CDF for every possible outcome as follows:

P(X\leq1)=\frac{1}{6}=0.17

P(X\leq2)=\frac{2}{6}=0.33

P(X\leq3)=\frac{3}{6}=0.50

P(X\leq4)=\frac{4}{6}=0.67

P(X\leq5)=\frac{5}{6}=0.83

P(X\leq6)=\frac{1}{6}=1

We have covered the PDF and the CDF for a six-sided dice which results are very intuitive if you have played with a dice, but what about if we combine the results from two dices? In any case, knowing about the possibilities of the combination of two dices will be most useful than the results of one single dice since most of the games and casinos around the world use a couple of dices, right? When we consider the sum of two dices (S), the range of possible outcomes goes from 2 to 12, so the PDF is defined as f(S)=P(S=x_{i}), where i=2,3,4..,12. In this case we have a total of 36 possible combinations, where only one combination will give an outcome equal to 2 or 12, there are two different combinations to get a 3 or 11, and so on. The outcome with higher probability to happen is a 7, there are six combinations that will result in a 7, as you can see in the table and graph below:

S	2	3	4	5	6	7	8	9	10	11	12
f(S)	1/36	2/36	3/36	4/36	5/36	6/36	5/36	4/36	3/36	2/36	1/36

We can see this PDF as follows:

The shape of this PDF for the combination outcome of two dices looks like the famous “bell-shaped” of the normal distribution. However there is an elemental difference between this PDF and the normal distribution; the normal distribution is the probabilistic distribution of a continuous random variable, but not a discrete random variable such as the outcome from two dices.

4.2 Probability density function (PDF) of a Continuous Random Variable

As seen in previous section, the CDF of a discrete random variable is defined as the sum of the probabilities of the independent outcomes. However, when using a continuous random variable the CDF will be defined as the integration of the function f(x) (f(x) is the PDF).

In this case we will not compute the probability of the variable X to take particular value, as we did with a discrete variable. Instead of doing that, we will calculate the probability of the continuous variable X to be within a specific range limited by a and b. The probability of a continuous variable to take a specific value is zero. The CDF will be 1 or 100% for all possible values that x can take, so:

\sideset{}{_{-\infty}^{\infty}}\intop f(x)\,dx=1

\sideset{}{_{a}^{b}}\intop f(x),dx=P(a\leq x\leq b)

The probability that a continuous random variable X is between a and b, is equal to the area under the PDF on the interval [a,b]. For example, if a PDF is defined as f(x)=3x^{2};0<=x<=1 we can then compute the CDF for the interval between 0.5 and 1 as follows:

PDF=f(x)=3x^{2}

CDF=\sideset{}{_{0.5}^{1}}\intop3x^{2}\,dx=3(x^{3})/3\mid_{0.5}^{1}

CDF=x^{3}\mid_{0.5}^{1}=1^{3}-0.5^{3}=1-0.125=0.875=87.5\%

As demonstrated, given a PDF=f(x)=3x^{2}, the probability that X is between 0.5 and 1 is equal to 87.5%. In the same way, if you would like to know the probability that X is between 0 and 0.5, all we have to do is to evaluate the CDF for those limits.

CDF=\sideset{}{_{0}^{0.5}}\intop3x^{2}\,dx

CDF=\sideset{}{_{0}^{0.5}}\intop3x^{2}\,dx=3(x^{3})/3\mid_{0}^{0.5}

CDF=x^{3}\mid_{0}^{0.5}=0.5^{3}-0=0.125=12.5\%

If you see, that makes sense since 12.5% is the complement of 87.5%, so that the probability that x lies between 0 and 1 will be 1. This is true since the possible range of x as defined for this PDF is 0<=x<=1, so the probability of x being in any place of this range must be one or 100%. Note that not any arbitrary function is a PDF. A very important condition is that the integral of the function between the possible range of x must be equal to one.

Now we move to the most famous PDF, the Normal Distribution Function.

5 The Normal Distribution Function

In statistics, the most popular continuous PDF is the well-known “bell-shaped” normal distribution, which PDF is defined as:

f(x)=\frac{1}{\sigma\sqrt{2\pi}}e^{\left(-\frac{1}{2}\frac{(x-\mu)^{2}}{\sigma^{2}}\right)}\

where x can take any real value -\infty<x<\infty

where \mu is the mean of the distribution and \sigma^{2} is the variance of the distribution. For simplification purposes, the normal distribution can also be denoted as X\sim N(\mu,\sigma^{2}) where X is the continuous random variable, "\sim" means distributed as and N means normal distribution.

So the only two parameters to be defined in order to know the behavior of the continuous random variable X are:

The mean of X and
the the variance of X.

The normal distribution is symmetric around \mu.

Other interesting property of the normal distribution is the probabilities according to different ranges of X:

• For the range (\mu-\sigma)<=x<=(\mu+\sigma), the area under the curve is approximately 68%

• For the range (\mu-2\sigma)<=x<=(\mu+2\sigma), the area under the curve is approximately 95%

• For the range (\mu-3\sigma)<=x<=(\mu+3\sigma), the area under the curve is approximately 99.7%.

If our variable of interest is simple returns (R) and these returns follow a normal distribution r\sim N(\mu,\sigma^{2}), the expected value of the future cc returns will be the mean of the distribution, while the standard deviation can be seen as a measure of risk.

5.1 Interesting facts about the History of the Normal Distribution

The Normal Distribution function is one of the most interesting finding in Statistics. The normal distribution can explain many phenomena in our real world; from financial returns to human behavior, to human characteristics such as height, etc.

Many people believe that Carl Friedrich Gauss was the person who discover it, but it is not quite true. The French mathematician Abraham de Moivre was the first to discover the normal distribution when he found the distribution of the sum of binomial independent random variables.

Very interesting Story… it’s pending … check later for an update!

5.2 CHALLENGE: Simulating the normal distribution

Use the mean and standard deviation of the historical cc returns of Bitcoin and simulate the same # of returns as the days we downloaded in the BTCR dataframe.

In one plot show both, the real distribution of historical cc returns and the simulated normal distribution.

Code

from matplotlib import pyplot
pyplot.clf()
rmean = BTCR["r"].mean()
rsd = BTCR["r"].std()
N = BTCR["r"].count()
simr= np.random.normal(loc=rmean,scale=rsd, size=N)
realr = BTCR["r"].to_numpy()

bins = 12

pyplot.hist(simr,bins,alpha=0.5,label='simulated rets')
pyplot.hist(realr,bins,alpha=0.5,label='real rets')
pyplot.legend(loc='upper left')
pyplot.title(label='Histogram of real and simulated cc returns of Bitcoin')

pyplot.show()

DO YOU SEE A DIFFERENCE BEWEEN THE REAL VS THE SIMULATED RETURNS? BRIEFLY EXPLAIN.

6 References

Al-Kadit, Ibrahim A. 1992. “Origins of Cryptology: The Arab Contributions.” Cryptologia 16 (2): 97–126.

Singh, Simon. 2000. The Code Book: The Science of Secrecy from Ancient Egypt to Quantum Cryptography. Anchor Books.