Workshop 1 Solution, Econometric Models

Author

Alberto Dorantes

Published

February 17, 2025

Abstract

This is an INDIVIDUAL workshop. In this workshop we learn about 1) calculation of returns and 2) descriptive statistics, and 2) the Central Limit Theorem, and 3) Hypothesis testing. In addition, we learn the basics of Python language, more specifically data management.

1 Introduction to data management in Finance

In Python we use data frames to do data analysis such as descriptive statistics and econometrics. A data frame is like a worksheet where each row is an observation and each column represents a variable or feature or characteristic of the row.

There are several Python packages used to data collection and data management for Finance. We will use the yfinance package to download real-time market data from that Yahoo Finance web site.

We also use the numpy package to do mathematical calculations such as logaritms.

It is usually needed to install Python package and then import the package. However, since we use Google Colab, hundreds of popular Python packages are already installed.

If you find a package that is not installed in Python, you need to install with the command !pip install.

1.1 Importing Python packages

The packages we will use in this workshop are already installed in Google Colab, so we only need to import them:

import yfinance as yf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

1.2 Downloading real financial prices

# Download stock prices
data = yf.download('NVDA, MSFT', start='2020-01-01', end='2024-07-31', interval='1mo', auto_adjust=True)


[                       0%                       ]
[*********************100%***********************]  2 of 2 completed

# The auto_adjust=True means that it will bring adjusted close prices after considering dividends and stocks splits.

We got monthly stock prices for Nvidia and Microsoft from Jan 2020 into the data data frame.

We can see the first 5 rows of data:

data.head(5)

Price                           Close            ...      Volume             
Ticker                           MSFT      NVDA  ...        MSFT         NVDA
Date                                             ...                         
2020-01-01 00:00:00+00:00  162.822510  5.886074  ...   558530000   6125412000
2020-02-01 00:00:00+00:00  154.960175  6.723564  ...   887522300  11848652000
2020-03-01 00:00:00+00:00  151.259293  6.566414  ...  1612695500  15773952000
2020-04-01 00:00:00+00:00  171.880005  7.280848  ...   984705000  11278304000
2020-05-01 00:00:00+00:00  175.754745  8.843730  ...   688845000  12548876000

[5 rows x 10 columns]

Yahoo Finance keeps track of open, close, high, low and adjusted prices for each period. We can keep only the adjusted stock prices, which are the prices we must use to calculate financial returns.

We can create another data frame with only the adjusted closing price columns for both stocks:

adjprices = data['Close']

We see the first 5 monthly prices of both stocks:

adjprices.head(5)

Ticker                           MSFT      NVDA
Date                                           
2020-01-01 00:00:00+00:00  162.822510  5.886074
2020-02-01 00:00:00+00:00  154.960175  6.723564
2020-03-01 00:00:00+00:00  151.259293  6.566414
2020-04-01 00:00:00+00:00  171.880005  7.280848
2020-05-01 00:00:00+00:00  175.754745  8.843730

We can see the most current 5 monthly prices for both stocks:

adjprices.tail(5)

Ticker                           MSFT        NVDA
Date                                             
2024-03-01 00:00:00+00:00  418.369476   90.330391
2024-04-01 00:00:00+00:00  387.154877   86.381561
2024-05-01 00:00:00+00:00  412.810730  109.607071
2024-06-01 00:00:00+00:00  445.254639  123.510773
2024-07-01 00:00:00+00:00  416.763123  117.001923

We can visualize the stock price of Microsoft:

plt.plot(adjprices['MSFT'])
plt.title('Microsoft adjusted prices')
plt.show()

We can plot Microsoft monthly returns using the pct_change() function of Pandas:

plt.clf()
plt.plot(adjprices['MSFT'].pct_change().dropna())
plt.title('Microsoft monthly returns')
plt.show()

2 Return calculation

2.1 Simple and continuously compounded (cc) return

A financial simple return for a stock in period t (R_{t}) is usually calculated as the closing stock price in t plus any dividend payment at t, and then divide this sum by the previous closing price. It is calculated as a percentage change from the previous period (t-1) to the present period (t):

R_{t}=\frac{\left(price_{t}-price_{t-1}+ dividend_t\right)}{price_{t-1} }=\frac{price_{t}+dividend_t}{price_{t-1}}-1

When a stock pays dividends of do a stock split, the financial exchange make an adjustment to the historical stock prices. This adjustment to the stock prices is made so that we do not need to use dividends nor splits to calculate simple stock returns. Then, it is always recommended to use adjusted prices to calculate stock returns, unless you have information about all dividends payed in the past.

Then, with adjusted prices the formula for simple returns is easier:

R_{t}=\frac{\left(Adjprice_{t}-Adjprice_{t-1}\right)}{Adjprice_{t-1} }=\frac{Adjprice_{t}}{price_{t-1}}-1

For example, if the adjusted price of a stock at the end of January 2021 was $100.00, and its previous (December 2020) adjusted price was $80.00, then the monthly simple return of the stock in January 2021 will be:

R_{Jan2021}=\frac{Adprice_{Jan2021}}{Adprice_{Dec2020}}-1=\frac{100}{80}-1=0.25

We can use returns in decimal or in percentage (multiplying by 100). We will keep using decimals.

Although the arithmetic mean of simple returns R gives us an idea of average past return, in the case of multi-period average return, this method of calculation can be misleading. Let’s see why this is the case.

Imagine you have only 2 periods and you want to calculate the average return of an investment per period:

Returns over time
Period	Investment value (at the end of the period)	Simple period Return (R)
0	$100	NA
1	$50	-0.50
2	$75	+0.50

Calculating the average simple return of this investment:

\bar{R}=\frac{-0.5+0.5}{2}=0%

Then, the simple average return gives me 0%, while I end up with $75, losing 25% of my initial investment ($100) over the first 2 periods. If I lost 25% of my initial investment over 2 periods, then the average mean return per period might a midpoint between 0 and 25%. The accurate mean return of an investment over time (multi-periods) is the “Geometric Mean” return.

The total return of the investment in the whole period -also called the holding-period return (HPR)- can be calculated as:

HPR=\left(1+R_{1}\right)\left(1+R_{2}\right)...\left(1+R_{N}\right)-1

Using the example, the HPR for this investment is:

HPR=\left(1-0.50\right)\left(1+0.50\right)-1=0.75 - 1 = -0.25

And the formula for the geometric average of returns will be:

\bar{R_{g}}=\sqrt[N]{\left(1+R_{1}\right)\left(1+R_{2}\right)...\left(1+R_{N}\right)}-1

Caculating the geometric average for this investment:

\bar{R_{g}}=\sqrt[2]{\left(1-0.5\right)\left(1+0.5\right)}-1= -0.13397

Then, the right average return per year is about -13.4% and the HPR for the 2 years is -25%.

However, if we use continuosuly compounded returns (r) instead of simple returns (R), then the arithmetic mean of r is an accurate measure that can be converted to simple returns to get the geometric mean, which is the accurate mean return. Let’s do the same example using continuously compounded returns:

Continuously compounded returns
Period	Investment value (at the end)	Continuously compounded return (r)
0	$100	NA
1	$50	=log(50)-log(100)=-0.6931
2	$75	=log(75)-log(50)=+0.4054

In Finance it is very recommended to calculate continuously compounded returns (cc returns) and using cc returns instead of simple returns for data analysis, statistics and econometric models.

One way to calculate cc returns is by subtracting the natural log of the current adjusted price (at t) minus the natural log of the previous adjusted price (at t-1):

r_{t}=log(Adjprice_{t})-log(Adjprice_{t-1})

This is also called as the difference of the log of the price.

We can also calculate cc returns as the log of the current adjusted price (at t) divided by the previous adjusted price (at t-1):

r_{t}=log\left(\frac{Adjprice_{t}}{Adjprice_{t-1}}\right)

cc returns are usually represented by small r, while simple returns are represented by capital R.

But why we use natural logarithm to calculate cc returns? First we need to remember what is a natural logarithm.

2.2 Reviewing the concept of natural logarithm

Generate a new dataset with the natural logarithm (log) of MSFT and NVDA

lnprices = np.log(adjprices)

We can see the first rows of this new data frame:

lnprices.head(5)

Ticker                         MSFT      NVDA
Date                                         
2020-01-01 00:00:00+00:00  5.092661  1.772589
2020-02-01 00:00:00+00:00  5.043168  1.905618
2020-03-01 00:00:00+00:00  5.018996  1.881968
2020-04-01 00:00:00+00:00  5.146797  1.985247
2020-05-01 00:00:00+00:00  5.169090  2.179709

As you see, when we apply a mathematical function (in this case, the log function) to a data frame, the function is calculated to all rows of all columns of the data frame.

Let’s see the plot of both, the prices and the log prices of NVDA:

plt.clf()
plt.plot(adjprices['NVDA'])
plt.title('Stock price of NVDA')
plt.show()

plt.clf()
plt.plot(lnprices['NVDA'])
plt.title('Log price of NVDA')
plt.show()

As you see, the log prices have a much smaller scale (from about 1.7 to 4.7). Remember that the natural log price is actually an exponent we raise the number e to get the stock price.

With this plot, we can better appreciate the % growth over time per period. For example, from Jan 2020 to July 2020 the log price increased fro about 1.7 to 2.5. The difference between 2.5 and 1.7, which is 0.8 is the approximate % increase in price, in this case, 80% from Jan 2020 to July 2020.

What is a natural logarithm?

The natural logarithm of a number is the exponent that the number e (=2.71…) needs to be raised to get another number. For example, let’s name x=natural logarithm of a stock price p. Then:

e^x = p The way to get the value of x that satisfies this equality is actually getting the natural log of p:

x = log_e(p)

Then, we have to remember that the natural logarithm is actually an exponent that you need to raise the number e to get a specific number.

The natural log is the logarithm of base e (=2.71…). The number e is an irrational number (it cannot be expressed as a division of 2 natural numbers), and it is also called the Euler constant. Leonard Euler (1707-1783) took the idea of the logarithm from the great mathematician Jacob Bernoulli, and discovered very astonishing features of the e number. Euler is considered the most productive mathematician of all times. Some historians believe that Jacob Bernoulli discovered the number e around 1690 when he was playing with calculations to know how an amount of money grows over time with an interest rate.

How e is related to the grow of financial amounts over time?

Here is a simple example:

If I invest $100.00 with an annual interest rate of 50%, then the end balance of my investment at the end of the first year (at the beginning of year 2) will be:

I_2=100*(1+0.50)^1

If the interest rate is 100%, then I would get:

I_2=100*(1+1)^1=200

Then, the general formula to get the final amount of my investment at the beginning of year 2, for any interest rate R can be:

I_2=I_1*(1+R)^1 The (1+R) is the growth factor of my investment.

In Finance, the investment amount is called principal. If the interests are calculated (compounded) each month instead of each year, then I would end up with a higher amount at the end of the year.

Monthly compounding means that a monthly interest rate is applied to the amount to get the interest of the month, and then the interest of the month is added to the investment (principal). Then, for month 2 the principal will be higher than the initial investment. At the end of month 2 the interest will be calculated using the updated principal amount. Putting in simple math terms, the final balance of an investment at the beginning of year 2 when doing monthly compounding will be:

I_2=I_1*\left(1+\frac{R}{N}\right)^{1*N}

For monthly compounding, N=12, so the monthly interest rate is equal to the annual interest rate R divided by N (R/N). Then, with an annual rate of 100% and monthly compounding (N=12):

I_2=100*\left(1+\frac{1}{12}\right)^{1*12}=100*(2.613..)

In this case, the growth factor is (1+1/12)^{12}, which is equal to 2.613.

Instead of compounding each month, if the compounding is every moment, then we are doing a continuously compounded rate.

If we do a continuously compounding for the previous example, then the growth factor for one year becomes the astonishing Euler constant e:

Let’s do an example for a compounding of each second (1 year has 31,536,000 seconds). The investment at the end of the year 1 (or at the beginning of year 2) will be:

I_2=100*\left(1+\frac{1}{31536000}\right)^{1*31536000}=100*(2.718282..)\cong100*e^1

Now we see that e^1 is the GROWTH FACTOR after 1 year if we do the compounding of the interests every moment!

We can generalize to any other annual interest rate R, so that e^R is the growth factor for an annual nominal rate R when the interests are compounded every moment.

When compounding every instant, we use small r instead of R for the interest rate. Then, the growth factor will be: e^r

Then we can do a relationship between this growth rate and an effective equivalent rate:

\left(1+EffectiveRate\right)=e^{r}

If we apply the natural logarithm to both sides of the equation:

ln\left(1+EffectiveRate\right)=ln\left(e^r\right)

Since the natural logarithm function is the inverse of the exponential function, then:

ln\left(1+EffectiveRate\right)=r In the previous example with a nominal rate of 100%, when doing a continuously compounding, then the effective rate will be:

\left(1+EffectiveRate\right)=e^{r}=2.7182

EffectiveRate=e^{r}-1

Doing the calculation of the effective rate for this example:

EffectiveRate=e^{1}-1 = 2.7182.. - 1 = 1.7182 = 171.82\%

Then, when compounding every moment, starting with a nominal rate of 100% annual interest rate, the actual effective annual rate would be 171.82%!

2.3 Return calculation

We have historical monthly adjusted prices for each stock. We can easily calculate simple returns for both stocks:

R = adjprices / adjprices.shift(1) - 1
# I delete the NA values located at the first period:
R = R.dropna()

The shift(1) function gets the previous price value (1 period ago), and it does so for all periods.

We can also calculate continuously compounded returns (r) as follows:

r = np.log(adjprices) - np.log(adjprices.shift(1))
# I delete the NA values located at the first period:
r = r.dropna()

3 Descriptive statistics for Finance

Descriptive statistics is a set of summaries of raw data related to one or several variables of a phenomenon. Descriptive statistics usually gives us a first general idea of a phenomenon by looking at summaries such as averages and variability of variables that represent different aspects of a phenomenon.

In Finance we are interested in knowing the average past return of an investment or how much investment returns usually move up and down over time. We can use descriptive statistics to a) learn about past return and risk of investments, and 2) make inferences about average return and risk for estimating expected values for the future.

In Economics, for example, we might be interested in knowing what has been the economic development of a country over the past 10 years. Then, we can calculate an annual average percentage growth of the gross domestic product. In Finance we might be interested in knowing the annual average return of an investment for the last 5 years and also the variability of annual returns over time.

Then, the more important measures of descriptive statistics are:

Measures of central tendency, and
Measures of dispersion

3.1 Central tendency measures

The main central tendency measures are:

Arithmetic mean
Weighted mean
Median
Mode

3.1.1 Arithmetic mean

An arithmetic mean of a variable X is a simple measure that tells us the average value of all valid values of X, assuming that each value has the same importance (or weight). The variable X can be representing any attribute of a subject. A subject can be an individual, a group, a team, a business unit, a company, a financial portfolio, an industry, a region, a country, etc.

An example of a variable X can be the monthly sales amount of a company for the last 3 years. In this case, the variable X will have 36 observations (36 monthly sales). The subject here is a company and the variable or attribute is the company sales over time. Another example can be a variable that represents the daily returns of a financial portfolio over the last 2 years. In this case, the variable might have about 500 observations considering 250 business days each year. The subject in this example is a financial portfolio, that might be composed of more than one stock and/or bond.

To calculate the arithmetic mean of a variable X we simply sum all the non-missing values of the variable and then divide them by the number of non-missing values. Then, the calculation is as follows:

\bar{X}=\frac{{\displaystyle {\displaystyle \sum_{i=1}^{N}X_{i}}}}{N}

Where N is the number of non-missing values (observations) of X. A missing value of a variable happens when the variable X for a specific observation has no value. It is important to note that a missing value is not a zero value. When we work with real world datasets, it is very common to find non-missing values in many variables.

One of the disadvantage of the arithmetic mean is that it is very sensible to extreme values. If a variable has few extreme values, the arithmetic mean might not be a good representation of an average or mid point. In the prescence of few very extreme values in a variable, the best measure of central tendency is the median, not the arithmetic mean.

3.1.2 Weighted Mean

The weighted mean is an average, but each observation can have different weight or level of importance. In the arithmetic mean, each observation has the same weight, which is 1/N. We can say that the arithmetic mean is an equally-weighted average.

The formula for the weighted average is the following:

$$ {X_w}={}

$$ The sum of all weights w_i must be 1 (100%)

3.1.3 The Median

Another measure of central tendency is the median. The median of a variable is its 50 percentile, which is the mid point of its values when the values are sorted in ascending order. When we have an even number of observations, there will be 2 mid points, so the median will be equal to the arithmetic mean of these 2 mid points. When we have an odd number of observations there will be only 1 value in the middle, which is the median.

For example, if we want to know what is the typical size of all companies that trade shares in the Mexican stock market we can calculate the median of firm size. These firms are called public firms. Firm size can be measured with different variables. We can use the total value of its assets (total assets), the market value, or the number of employees. In this example we will use total assets at the end of 2018 for all public Mexican firms. At the end of 2018 there were 146 Mexican public firms in the market exchange (“Bolsa Mexicana de Valores”). I will show how to calculate the median total assets of these 146 firms.

The 2018 total assets of the 146 Mexican public firms for 2018 are shown below (sorted alphabetically)

Mexican firms in the BMV
Firm	row #	Industry	2018 Total Assets (in thousand pesos)
ACCEL	1	Services	$6,454,560.00
AEROMEXICO	2	Transport Services	$76,772,848.00
…	…	…	…
VOLARIS	148	Transport Services	$22,310,652.00
WALMART	149	Retail	$306,528,832.00

We sort the list from the lowest to the highest value of 2018 total assets:

Firm	row #	Industry	Size Rank	2018 Total Assets (in thousand pesos)
INGEAL	98	Food & Beverages	1	$171,104.00
HIMEXSA	88	Textile	2	$494,378.00
…	…	…	…	…
FHIPO14	45	Real State	73	$27,979,1184.00
TVAZTECA	139	Telecommunications	74	$27,988,054.00
…	…	…	…	…
AMERICA MOVIL	8	Telecommunications	145	$1,429,223,392.00
GFBANORTE	69	Financial Services	146	$1,620,470,400.00

The median total assets is the mid point of the list. However, in this case, I have 146 firms, so it is not possible to find an exact mid point. Then, I need to calculate the arithmetic average assets of the 2 firms that are in the middle (firms in positions 73 and 74). Then the median will be equal to $27,983,619.00 thousand pesos (about 27 billion pesos), which is the average value between FHIPO14 and TVAZTECA assets. The arithmetic mean for total assets considering the 146 firms is $97,860,896.23 thousand pesos (about 97.8 billion pesos), which is much bigger than the median. Then, which might be the best measure that better represents the typical size of Mexican firms? In this case, the best measure is the median, so we can say that the typical size of a Mexican public firm is about $27.9 thousand million pesos.

Then, what is the difference between the mean and the median? When the distribution of the values of a variable is very close to a normal distribution, the mean and the median will be very similar, so we can use the mean or median to represent the typical value of the variable. When the variable has few very extreme values, then the distribution of values will not be similar to a normal distribution; it will have fat tails due to the presence of extreme values. In this case the best measure of central tendency is the median, not the mean.

What is a normal distribution? It is a very common probability distribution of random variables. We will further explain probability distributions later. For now, just consider that many variables of all disciplines and nature follow a close-to-normal distribution.

The median gives of a better representation of the “average” value of a variable compared with the arithmetic mean when the distribution of the values does NOT follow a normal distribution. In the case of 2018 total assets we can explore its distribution using a histogram:

I will later explain in more detail what a histogram is.

I a histogram we see how often different ranges of values of a variable appear. This histogram does not look like a normal distributed variable. This histogram is said to be “skewed” to the right since there are very few firms with very high values of total assets. Normal distributed variable look like a bell shape curve where most of the values are around the arithmetic mean. In this case, we can see that most of the firms (about 100 firms) have a range of total assets between 0 and $25 thousand million pesos. Since the total of firms is 146 then, only about 46 firms have assets higher than $25 thousand million (or 25 billion pesos). Actually I can see that there are very few firms with assets greater than $1,000 thousand million (or greater than $1 trillion pesos), and one above $1,500 trillion. Looking at the previous table we can see that AMERICA MOVIL and GFBANORTE have assets greater than $1,400 trillion pesos.

With the histogram we can see that most of the firms (about 67%, 100 out of 146) have assets less than 25 billion pesos. The arithmetic mean of total assets is more than $97 billion, and the median total assets (or 50 percentile) is about $27 billion. The arithmetic mean is very sensible to extreme values, while the median is not. If we use the mean as a measure of a typical size of a Mexican firm we would be very far from the most common values of total assets. Then, the best measure of a typical size will be the median, which is about $27 billion pesos.

In sum, for skewed distributions the median will always be the best measure for central tendency, while the arithmetic mean will be a biased measure that will not represent the central or typical value. Actually, in the case of normal distributed variables, the median will be very close to the mean, so the median is always a good measure of central tendency.

Examples of business variables with a skewed distribution similar to total assets are employee salaries, income of families in a region or country, any variable from the income statement such as firm sales, firm profits.

3.1.4 Mode

Mode is the value that most appear in the variable. Mode can be calculated only for discrete variables, not for continuous variables. Mode is rarely used as a central tendency measure.

3.2 Dispersion measures

3.2.1 Variance and standard deviation

Standard deviation is used to measure how much on average the individual values of a variable change from the mean.

The variance of a variable X is the average of squared deviations from each individual value X_i from its mean:

Var(X)=\frac{1}{N}\sum_{i=1}^{N}\left(X_{i}-\bar{X}\right)^{2}=\sigma_{X}^{2}

Where:

X_i = Value i of the variable X

\overline{X}=\frac{1}{n}\sum_{i=1}^{n}X_{i} = Arithmetic average of X

Why the variance is the average of squared deviations? The reason is because if we do not square the deviations, then they will cancel out each other since some deviations are positive and other negative. Then, the squaring is just a trick to avoid canceling the positive with the negative deviations.

The result of the variance will be a number that our brain cannot easily interpret. To have a more reasonable measure of linear deviation, then we just take the square root of the variance, and then, we will be able to interpret that number as the average deviations of all points from their mean. This measure is called standard devation:

SD(X)=\sqrt{Var(X)}= \sqrt{\frac{1}{N}\sum_{i=1}^{N}\left(X_{i}-\bar{X}\right)^{2}}=\sigma_{X}

The variance can also be expressed as the expected value of squared deviations:

Var(X)=E[(X-\bar{X})^2]

Doing the multiplication of the squared term:

Var(X)=E[(X^2-X\bar{X}-\bar{X}X+\bar{X}^2)]

Since \bar{X} and \bar{Y} are constants, I can take them out of the expectation:

Var(X)=E[X^2]-\bar{X}E[X]-\bar{X}E[X]+\bar{X}^2

Since E[X]=\bar{X}, then:

Var(X)=E[X^2]-\bar{X}^2

Then, the variance can be defined as the expected value of X squared minus its squared mean.

Also, we can express the variance of X as:

Var(X)=\frac{1}{N}\sum_{i=1}^{N}\left(X_{i}\right)^2-\bar{X}^2

Most Statistics books and Statistics software use (N-1) instead of N as the denominator of the variance formula to get a more conservative value of the variance. This measure is called sample variance. When we divide by N in the variance formula, we are calculating the population variance. Both formulas provide very similar results, but the sample variance will be a bit bigger than the population variance, so it is a more conservative value.

In Statistics, the sample variance is an unbiased measure of the underlying (real) variance.

Then, we can re-write the formula for sample variance of X as:

Var(X)=\frac{1}{\left(N-1\right)}\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2}=\sigma_{X}^{2}

And the sample standard deviation of X can be written as:

SD(X)=\sqrt[2]{Var(X)}=\sqrt{\frac{1}{(n-1)}\sum_{i=1}^{n}(X_{i}-\bar{X})^{2}}

SD(X)=\frac{\sqrt{\sum_{i=1}^{n}(X_{i}-\bar{X})^{2}}}{\sqrt{(n-1)}}=\sigma_{X}

4 The Histogram

The histogram was invented to illustrate how the values of a random variable are distributed in its whole range of values. The histogram is a frequency plot. The ranges of values of a variable that are more frequent will have a higher vertical bar compared with the ranges that are less frequent.

With the histogram of a random variable we can appreciate which are the most common values, the least common values, the possible mean and standard deviation of the variable.

In my opinion, the most important foundations/pillars of both, Statistics and the theory of Probability are:

The invention of the Histogram
The discovery of the Central Limit Theorem

Although the idea of a histogram sounds a very simple idea, it took many centuries to be developed, but it has profound impact in the development of Probability theory and Statistics, which both are the pillars of all sciences.

I enjoy learning about the origins of the great ideas of humanity. The idea of the histogram was invented to decipher encrypted messages.

4.1 Interesting facts about the History of the Histogram

It is documented that the encryption of messages -cryptography- was commonly used since the beginning of civilizations. Unfortunately, it seems cryptography was invented by ancient Kingdoms mainly for war strategies. According to Herodotus, in the 500s BC, the Greeks used cryptography to win a war against the powerful Persian troops [@2000_SinghSimon_BOOK].

Cryptography refers to the methods of ciphering messages, while cryptoanalysis refers to the methods to decipher encrypted messages.

In the 800s AD, the Arabs deciphered encrypted messages thanks to their invention of the histogram. According to @2000_SinghSimon_BOOK and @1992_AlKaditIbrahimA, in 1987 several ancient Arabic manuscripts related to cryptography and cryptoanalysis (written between the year 800 AD and 1,500 AD) were discovered in Istanbul, Turkey (they were translated into English until 2002). This is a fascinating story!

Below is an example of a frequency plot by Arabic philosopher Al-Kindi in the 850 AD compared with a recent frequency plot by Al-Kadi:

The encrypted messages at that time were written with the Caesar shift method. Then, to decipher an encrypted message, the Arabs used to count all characters to create a frequency plot, and then try to match the encrypted characters with the Arab characters. Finally, the replaced the corresponding Arabic matched characters in the original message to decipher it.

Interestingly, the idea of the frequency waited about 1,000 years to be used by French mathematicians to develop the foundations of the Statistics discipline. In the 1700s and early 1800s, the french mathematicians Abraham De Moivre and Pierre-Simon Laplace used this idea to develop the Central Limit Theory (CLT).

I believe the CLT is one of the most important and fascinating mathematical discoveries of all time.

The English scientist Karl Pearson coined the term histogram in 1891 when he was developing statistical methods applied to Biology.

Why the histogram is so important in Statistics? I hope we will find this out during this course!

5 The Central Limit Theorem

The Central Limit Theorem is one of the most important discoveries in the history of mathematics and statistics. Actually, thanks to this discovery, the field of modern Statistics was developed at the beginning of the 20th century.

The central limit theorem says that for any random variable with ANY probability distribution, when you take groups (of at least 30 elements from the original distribution), take the mean of each group, then the probability distribution of these means will have the following characteristics:

1) The distribution of the sample means will be close to normal distribution when you take many groups (the size of the groups should be each equal or bigger than 25). Actually, this happens not only with the sample mean , but also with other linear combinations such as the sum or weighted average of the variable.

2) The standard deviation of the sample means will be much less than the standard deviation of the individuals. Being more specifically, the standard deviation of the sample mean will shrink with a factor of 1/\sqrt{N}.

Then, the central limit theorem says that, no matter the original probability distribution of any random variable, if we take groups of this variable, a) the means of these groups will have a probability distribution close to the normal distribution, and b) the standard deviation of the mean will shrink according to the number of elements of each group.

An interesting question about why the standard deviation shrinks so much with a factor of 1/\sqrt{N}? We can prove this with basic probability theory and intuition. Let’s start with intuition.

When you take groups and then take the mean of each group, then extreme values that you could have in each group will cancel out when you take the average of the group. Then, it is expected that the variance of the mean of the group will be much less than variance of the variable. But how much less?

Now let’s use simple math and probability theory to examine this relationship between these variances:

Let’s define a random variable X as a the weight of students X1, X2, … XN. The mean will be:

\bar{X}=\frac{1}{N}\left(X_{1}+X_{2}+...+X_{N}\right)

We can estimate the variance of this mean as follows:

VAR\left(\bar{X}\right)=VAR\left(\frac{1}{N}\left(X_{1}+X_{2}+...+X_{N}\right)\right)

Applying basic probability rules I can express the variance as:

VAR\left(\bar{X}\right)=\left(\frac{1}{N}\right)^{2}VAR\left(X_{1}+X_{2}+...+X_{N}\right)

VAR\left(\bar{X}\right)=\left(\frac{1}{N}\right)^{2}\left[VAR\left(X_{1}\right)+VAR\left(X_{2}\right)+...+VAR\left(X_{N}\right)\right]

Since the variance of X_1 is the same as the variance of X_2 and it is also the same for any X_N, then:

VAR\left(\bar{X}\right)=\left(\frac{1}{N}\right)^{2}N\left[VAR\left(X\right)\right]

Then we can express the variance of the mean as:

VAR\left(\bar{X}\right)=\left(\frac{1}{N}\right)\left[VAR\left(X\right)\right]

We can say that the expected variance of the sample mean is equal to the variance of the individuals divided by N, that is the sample size.

Finally we can get the sample standard deviation by taking the square root of the variance:

SD(\bar{X})=\sqrt{\frac{1}{N}}\left[SD(X)\right]

SD(\bar{X})=\frac{SD(X)}{\sqrt{N}}

Then the expected standard deviation of the sample mean of a random variable is equal to the individual standard deviation of the variable divided by the squared root of N.

Thanks to the CLT we can make inferences (good guesses) about:

The mean of any random variable
The standard deviation of the means
Since the means will behave as normal distribution, we can estimate how much the mean can vary over time! This variation is called the standard error (the standard deviation of the mean)

After the CLT, the concept of hypothesis testing and Linear Regression were further developed. With the CLT we have a theory to make inferences about population means and standard deviation of any random variable using samples.

The method to make these type of inferences is called Hypothesis Testing.

6 HYPOTHESIS TESTING

The idea of hypothesis testing is to provide strong evidence - using facts - about a specific belief that is usually supported by a theory or common sense. This belief is usually the belief of the person conducting the hypothesis testing. This belief is called the Alternative Hypothesis.

The person who wants to show evidence about his/her belief is supposed to be very humble so the only way to be convincing is by using data and a rigorous statistical method.

Let’s imagine 2 individuals, Juanito and Diablito. Juanito wants to convince Diablito about a belief. Diablito is very, very skeptical and intolerant. Diablito is also an expert in Statistics! Then, Juanito needs a very strong statistical method to convince Diablito. Juanito also needs be very humble so Diablito does not get angry.

Then, Juanito decides to start assuming that his belief is NOT TRUE, so Diablito will be receptive to continue listening Juanito. Juanito decides to collect real data about his belief and decide to define 2 hypotheses:

H0: The Null Hypothesis. This hypothesis is the opposite of Juanito and he starts accepting that this hypothesis is TRUE. This is also called the HYPOTHETICAL mean.
Ha: The Alternative Hypothesis. This is what Juanito beliefs, but he starts accepting that this is NOT TRUE.

Diablito is an expert in Statistics, so he knows the Central Limit Theorem very well! However, Juanito is humble, but he also knows the CLT very well.

Then Juanito does the following to try to convince Diablito:

Juanito collects a random sample related to his belief. His belief is about the mean of a variable X; he believes that the mean of X is greater than zero.
He calculates the mean and standard deviation of the sample.

Since he collected a random sample, then he and Diablito know that, thanks to the CLT:

The mean of this sample will behave VERY SIMILAR to a normal distribution,
The standard deviation of this sample is much less than the standard deviation of the individuals of the sample and this can be calculated by dividing the individual standard deviation by the square root of sample size.
With a probability of about 95%, the sample mean will have a value between 2 standard deviations less than its TRUE mean (in this case =0, the mean of the H0) and 2 standard deviations higher than its mean

Then, if Juanito shows that the calculated sample mean of X is higher than zero (the hypothetical mean H0) plus 2 standard deviations, then Juanito will have a very powerful statistical evidence to show Diablito that the probability that the true mean of X is bigger than zero will be above 95%!

Juanito is very smart (as Diablito). Then, we calculates an easy measure to quickly know how far its sample mean is away from the hypothetical mean (zero), but measured in # of standard deviations of the mean. This new standardized distance is usually called z or t. If z is 2 or more, he can say that the sample mean is 2 standard deviations above the supposed true mean (zero), so he could convince Diablito about his belief that the actual TRUE mean is greater than zero.

Although the CLT says that the mean of a sample behaves as normal, it has been found that the CLT is more consistent with the t-Student probability distribution. Actually, the t-Student distribution is very, very similar (almost the same) to the normal distribution when the sample size is bigger than 30. But for small samples, the t-Student does a much better job in representing what happens with the CLT for small samples compared with the z normal distribution.

Then, for hypothesis testing we will use the t-Student distribution instead of the z normal distribution.

The hypothesis testing to compare the mean of a variable with a value or with another mean is called one-sample t-test.

6.1 One-sample t-test

We start with the simple case of hypothesis testing: the One-Sample t-test.

We will learn about this with an example.

We download historical monthly data for the S&P500 market index and calculate cc returns:

import yfinance as yf
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

# Download stock data
data = yf.download('^GSPC', start='2020-01-01', end='2025-01-31', interval='1mo')


[*********************100%***********************]  1 of 1 completed

adjprices = data['Close']  # Get adjusted close prices

# Calculate continuously compounded returns

ccr = np.log(adjprices) - np.log(adjprices.shift(1))
ccr = ccr.dropna()

Here is an example of a t-test to check whether the S&P 500 has an average monthly returns significantly greater than zero:

# H0: mean(ccr$GSPC) = 0
# Ha: mean(ccr$GSPC) <> 0

# Standard error
se_GSPC = np.std(ccr,ddof=1) / np.sqrt(len(ccr))
print(f"Standard error S&P 500 = {se_GSPC}")

Standard error S&P 500 = 0.006789653490941936

# t-value
mean_ccr = np.mean(ccr)
t_GSPC = (mean_ccr - 0) / se_GSPC
print(f"t-value S&P 500 = {t_GSPC}")

t-value S&P 500 = 1.540082367227796

Since the t-value of the mean return of S&P 500 is lower than 2, I can’t reject the null hypothesis at the 95% confidence level. Therefore, at the 95% confidence level, S&P 500 mean return is not statistically greater than 0.

We can calculate the p-value of the test.

What is the p-value of a test?

The p-value of a test is the probability that I will be wrong if I reject the NULL hypothesis. In other words, (1-pvalue) will be the probability that MY HYPOTHESIS (the alternative hypothesis) is true!

(1-pvalue)% is called the confidence level I can use to reject the null hypothesis.

Fortunately, there is a Python function that does the same we did, but faster and it gets the p-value of the test:

from scipy import stats as st

# One-sided t-test
ttest_GSPC = st.ttest_1samp(ccr, 0, alternative='greater')

# Showing the t-Statistics and the p-value:
ttest_GSPC

TtestResult(statistic=1.540082367227796, pvalue=0.0644434328348415, df=59)

I got the same result with this t.test function.

But what does this mean? Does this mean that investing in the S&P is not going to give you positive returns over time? No, this is not quite a good conclusion.

We got a p-value= 6.4%. This means that we can accept our hypothesis that monthly mean return of the S&P500 is >0 at the 93.6% (=1-pvalue)!

When we get a p-value<0.05 we can say that we have strong evidence (at least at the 95% confidence level) to reject the null hypothesis (accept my hypothesis). In this case we say that our result is statistically significant.

When we get a p-value between 5% and 10% (0.05 and 0.10) we can say that we evidence at the 90% confidence level to reject the null hypothesis. We can also say that our results are marginally significant.

6.2 Hypothesis testing - two-sample t-test

Now we will do a hypothesis testing to compare the means of two groups. This test is usually named two-sample t-test.

In the case of the two-sample t-test we try to check whether the mean of a group is greater than the mean of another group.

Imagine we have two random variables X and Y and we take a random sample of each variable to check whether the mean of X is greater than the mean of Y.

We start writing the null and alternative hypothesis as follows:

H0:\mu_{x}=\mu_{y}

Ha:\mu_{x}\neq\mu_{y}

We do simple algebra to leave a number in the right-hand side of the equality, and a random variable in the left-hand side of the equation. Then, we re-write these hypotheses as:

H0:(\mu_{x}-\mu_{y})=0

Ha:(\mu_{x}-\mu_{y})\neq0

The Greek letter \mu is used to represent the population mean of a variable.

To test this hypothesis we take a random sample of X and Y and calculate their means.

Then, in this case, the variable of study is the difference of 2 means! Then, we can name the variable of study as diff:

diff = (\mu_{x}-\mu_{y})

Since we use sample means instead of population means, we can re-define this difference as:

diff = (\bar{X}-\bar{Y})

The steps for all hypothesis tests are basically the same. What changes is the calculation of the standard deviation of the variable of study, which is usually names standard error.

For the case of one-sample t-test, the standard error was calculated as \frac{SD}{\sqrt{N}}, where SD is the individual sample standard deviation of the variable, and N is the sample size.

In the case of two-sample t-test, the standard error SE can be calculated using different formulas depending on the assumptions of the test. In this workshop, we will assume that the population variances of both groups are NOT EQUAL, and the sample size of both groups is the same (N). For these assumptions, the formula is the following:

SD(diff)=SE=\sqrt{\frac{Var(X)+Var(Y)}{N}}

But, where does this formula come from?

We can easily derive this formula by applying basic probability rules to the variance of a difference of 2 means. Let’s do so.

The variances of each group of X and Y might be different, so we can estimate the variance of the DIFFERENCE as:

Var(\bar{X}-\bar{Y})=Var(\bar{X})+Var(\bar{Y})

This is true if only if \bar{X} and \bar{Y} are independent. We will assume that both random variables are not dependent of each other. This might not apply for certain real-world problems, but we will assume that for simplicity. If there is dependence I need to add another term that is equal to 2 times the covariance between both variables.

Why the variance of a difference of 2 random variables is the SUM of their variance? This sounds counter-intuitive, but it is correct. The intuition behind this is that when we make the difference we do not know which random variable will be negative or positive. If a value of \bar{Y}_i<0 then we will end up with a SUM instead of a difference!

As we learned in the CLT, the variance of the mean of a random variable is reduced according to its sample size: Var(\bar{X)}=\frac{Var(X)}{N}. Then:

Var(\bar{X}-\bar{Y})=\frac{Var(X)}{N}+\frac{Var(Y)}{N}

Factorizing the expression:

Var(\bar{X})=\frac{1}{N}\left[Var(X)+Var(Y)\right]

We take the squared root to get the expected standard deviation of (\bar{X}-\bar{Y}):

SD(\bar{X})=\sqrt{\frac{1}{N}\left[Var(X)+Var(Y)\right]}

Then, the method for hypothesis testing is the same we did in the case of one-sample t-test. We just need to use this formula as the denominator of the t-statistic.

The, the t-statistic for the two-sample t-test is calculated as:

t=\frac{(\bar{X}-\bar{Y})-0}{\sqrt{\frac{Var(X)+Var(Y)}{N}}}

Remember that the value of t is the # of standard deviations of the variable of study (in this case, the difference of the 2 means) that the empirical difference we got from the data is away from the hypothetical value, zero.

The rule of thumb we have used is that if |t|>2 we have have statistical evidence at least at the 95% confidence level to reject the null hypothesis (or to support our alternative hypothesis).

6.3 EXAMPLE - IS AMD MEAN RETURN HIGHER THAN ORACLE MEAN RETURN?

Do a t-test to check whether the mean monthly cc return of AMD (AMD) is significantly greater than the mean monthly return of Intel. Use data from Jan 2020 to date_

# Getting price data. I indicate getting adjusting prices as close prices:
sprices=yf.download(tickers='AMD ORCL', start='2020-01-01', end='2025-01-31',interval='1mo', auto_adjust=True)


[                       0%                       ]
[*********************100%***********************]  2 of 2 completed

# I select the Close columns for both stocks:
sprices=sprices['Close']

The sprices data frame has 2 columns: close price for AMD and close price for INTC:

sprices.head(5)

Ticker                           AMD       ORCL
Date                                           
2020-01-01 00:00:00+00:00  47.000000  48.435822
2020-02-01 00:00:00+00:00  45.480000  45.877945
2020-03-01 00:00:00+00:00  45.480000  44.829803
2020-04-01 00:00:00+00:00  52.389999  49.133751
2020-05-01 00:00:00+00:00  53.799999  50.112736

Now we calculate monthly continuously compounded returns for both stocks:

# Calculating cc returns as the difference of the log price and the log price of the previous month:
sr = np.log(sprices) - np.log(sprices.shift(1))
# we can also calculate cc returns using the diff function:
# sr = np.log(sprices).diff(1)

# Deleting the first month with NAs:
sr=sr.dropna()

sr.head()

Ticker                          AMD      ORCL
Date                                         
2020-02-01 00:00:00+00:00 -0.032875 -0.054255
2020-03-01 00:00:00+00:00  0.000000 -0.023111
2020-04-01 00:00:00+00:00  0.141443  0.091673
2020-05-01 00:00:00+00:00  0.026558  0.019729
2020-06-01 00:00:00+00:00 -0.022367  0.027515

We can calculate the monthly mean return for both stocks:

amd_mean = sr['AMD'].mean()
orcl_mean = sr['ORCL'].mean()
print(f"AMD mean cc % return= {100*amd_mean}%")

AMD mean cc % return= 1.5050190594531336%

print(f"Oracle mean cc % return= {100*orcl_mean}%")

Oracle mean cc % return= 2.089094493784317%

We can see that, on average, Oracle mean return is higher than the AMD mean return. However, we need to check whether this difference is statistically significant. Then, we need to do a 2-sample t-test.

We state the hypothesis and calculate the t-Statistic:

# Stating the hypotheses: 
# H0: (mean(rAMD) - mean(rORACLE)) = 0
# Ha: (mean(rAMD) - mean(rORACLE)) <> 0

# Calculating the standard error of the difference of the means:
# Getting the number of non-missing observations for the sample:
N = sr['AMD'].count()
# Getting the variances of both columns: 
amd_var = sr['AMD'].var()
orcl_var = sr['ORCL'].var()
# Now we get the standard error for the mean difference:
sediff = np.sqrt((1/N) * (amd_var + orcl_var ) )

# Calculating the t-Statistic:
t = (sr['AMD'].mean() - sr['ORCL'].mean()) / sediff
print(f"t-Statistic = {t}")

t-Statistic = -0.2678501343596106

Fortunately we can use a Python function to easily calculate the t-value along with its p-value and 95% confidence interval:

# I do the 2-way sample t-test using the ttest_ind function from stats:
st.ttest_ind(sr['AMD'],sr['ORCL'],equal_var=False)

TtestResult(statistic=-0.2678501343596106, pvalue=0.7894068913063057, df=93.15007762175287)

# With this function we avoid calculating all steps of the hypothesis test!

We got the same t-value as above, and also we got the p-value of the test. Since the t-value is much greater than 2, and pvalue is much greater than 0.05, we cannot reject the null hypothesis.

Then, although the ORCL mean return is higher than that of AMD, we do not have significance evidence (at the 95% confidence) to say that the mean of ORCL return is greater than the mean of AMD return.

We can use another Python function that display more results for this t-test:

import researchpy as rp
# Using the ttest function from researchpy:
rp.ttest(sr['AMD'],sr['ORCL'],equal_variances=False)

(   Variable      N      Mean        SD        SE  95% Conf.  Interval
0       AMD   60.0  0.015050  0.147082  0.018988  -0.022945  0.053045
1      ORCL   60.0  0.020891  0.083049  0.010722  -0.000563  0.042345
2  combined  120.0  0.017971  0.118970  0.010860  -0.003534  0.039475,          Satterthwaite t-test  results
0  Difference (AMD - ORCL) =   -0.0058
1       Degrees of freedom =   93.1501
2                        t =   -0.2679
3    Two side test p value =    0.7894
4   Difference < 0 p value =    0.3947
5   Difference > 0 p value =    0.6053
6                Cohen's d =   -0.0489
7                Hedge's g =   -0.0486
8           Glass's delta1 =   -0.0397
9         Point-Biserial r =   -0.0277)

# With this function we avoid calculating all steps of the hypothesis test!

If you receive an error, you might need to install the researchpy package with the command:

!pip install researchpy

We got the same results as above, but now we see details about the mean of each stock, the difference, the tvalue, and the pvalue of the difference!

7 Confidence level, Type I Error and pvalue

The confidence level of a test is related to the error level of the test. For a confidence level of 95% there is a probability that we make a mistaken conclusion of rejecting the null hypothesis. Then, for a 95% confidence level, we can end up in a mistaken conclusion 5% of the time. This error is also called the Type I Error.

The pvalue of the test is actually the exact probability of making a Type I Error after we calculate the exact t-statistic. In other words, the pvalue is the probability that we will be wrong if we reject the null hypothesis (and support our hypothesis).

For each value of a t-statistic, there is a corresponding pvalue. We can relate both values in the following figure of the t-Student PDF:

For a 95% confidence level and 2-tailed pvalue, the critical t value is close to 2 (is not exactly 2); it can change according to N, the # of observation of the sample.

When the sample size N>30, the t-Student distribution approximates the Z normal distribution. In the above figure we can see that when N>30 and t=2, the approximates pvalues are: 1-tailed pvalue = 2.5%, and 2-tailed pvalues=5%.

Then, what is 1-tailed and 2-tailed pvalues? The 2-tailed pvalue will always be twice the value of the 1-tailed pvalue since the t-Student distribution is symetric.

We always want to have a very small pvalue in order to reject H0. Then, the 1-tailed pvalue seems to be the one to use. However, the 2-tailed pvalue is a more conservative value (the diablito will feel ok with this value). Most of the statistical software and computer languages report 2-tailed pvalue.

Then, which pvalue is the right to use? It depends on the context. When there is a theory that supports the alternative hypothesis, we can use the 1-tailed pvalue. For now, we can be conservative and use the 2-tailed pvalue for our t-tests.

Then, we can define the p-value of a t-test (in terms of the confidence level of the test) as:

pvalue=(1-ConfidenceLevel)

In the case of 1-tailed pvalue and a 95% confidence evel, the critical t-value is less than 2; it is approximately 1.65:

The pvalue cannot be calculated with an analytic formula since the integral of the Z normal or t-Student PDF has no close analytic solution. We need to use tables. Fortunately, all statistic software and most computer languages can easily calculate pvalues for any hypothesis test.

8 CHALLENGES

8.1 Challenge 1 SOLUTION

Using Python code, you have to calculate the mean, standard deviation and variance of the simple and continuously compounded returns of MSFT and NVDA.

We can easily calculate the main measures of descriptive statistics with the describe function as follows:

For simple returns, we have the R data frame, so we apply the describe function to R:

R.describe()

Ticker       MSFT       NVDA
count   54.000000  54.000000
mean     0.019683   0.067837
std      0.066578   0.150722
min     -0.107376  -0.320158
25%     -0.032025  -0.026493
50%      0.019096   0.076195
75%      0.066142   0.178234
max      0.176291   0.363437

WHICH STOCK WOULD YOU PREFER ACCORDING TO THEIR MEAN AND STANDARD DEVIATION OF RETURNS? JUSTIFY YOUR ANSWER

LET’S START ANALYZING THE MEAN AND STANDARD DEVIATION OF BOTH STOCKS:

WE CAN SEE THAT THE MONTHLY MEAN RETURN FOR MICROSOFT IS ABOUT 1.96% (0.019683). IF WE ANNUALIZE THIS MONTHLY RETURN, WE COULD GET AN ANNUAL AVERAGE RETURN OF APPROX. 24% (12*1.96%).

THE MONTHLY STANDARD DEVIATION OF MSFT RETURN IS ABOUT 6.65% (0.0665). IN FINANCE THE STANDARD DEVIATION OF RETURNS IS USUALLY CALLED VOLATILITY. THIS MEANS THAT ON AVERAGE, THE MONTHLY DEVIATIONS OF MSFT RETURNS FROM ITS MEAN IS ABOUT 6.65%. IF WE WANT TO ANNUALIZE THE STANDARD DEVIATION (VOLATILITY), WE NEED TO MULTIPLY THE MONTHLY STANDARD DEVIATION TIMES THE SQUARE ROOT OF 12 (# OF MONTHS IN THE YEAR). IN THIS CASE, THE APPROX. ANNUAL STANDARD DEVIATION WILL BE AROUND 6.65% * SQRT(12) = 0.2303 = 23.03% !

IT IS EASIER TO INTERPRET ANNUAL RETURNS AND ANNUAL VOLATILITY SO THAT WE CAN HAVE A QUICK COMPARISON WITH OTHER INVESTMENTS SUCH AS RISK-FREE RATE (CETES IN MEXICO).

THEN, WE CAN SAY THAT THE ANNUAL AVERAGE RETURN OF MSFT HAS BEEN AROUND 24%, BUT THIS AVERAGE RETURN CAN MOVE ON AVERAGE 23% ABOVE OR BELOW THE AVERAGE ANNUAL RETURN. IT SOUNDS AN INTERESTING INVESTMENT SINCE THE ANNUAL VOLATILITY IS A LITTLE BIT LESS THAN THE ANNUAL RETURN, SO IF MSFT RETURNS BEHAVES LIKE A NORMAL DISTRIBUTED VARIABLE, ABOUT 68% OF THE CASES MSFT WILL OFFER POSITIVE RETURNS (>0)!

INTERPRETATIONS FOR NVDA (NVIDIA):

WE CAN SEE THAT THE MONTHLY MEAN RETURN FOR NVDA IS ABOUT 6.78% (0.06783). IF WE ANNUALIZE THIS MONTHLY RETURN, WE COULD GET AN ANNUAL AVERAGE RETURN OF APPROX. 81.39% (12*6.78%)!

THE MONTHLY STANDARD DEVIATION OF NVDA RETURN IS ABOUT 15.07% (0.1507). IN FINANCE THE STANDARD DEVIATION OF RETURNS IS USUALLY CALLED VOLATILITY. THIS MEANS THAT ON AVERAGE, THE MONTHLY DEVIATIONS OF NVDA RETURNS FROM ITS MEAN IS ABOUT 15.07%. IF WE WANT TO ANNUALIZE THE STANDARD DEVIATION (VOLATILITY), WE NEED TO MULTIPLY THE MONTHLY STANDARD DEVIATION TIMES THE SQUARE ROOT OF 12 (# OF MONTHS IN THE YEAR). IN THIS CASE, THE APPROX. ANNUAL STANDARD DEVIATION WILL BE AROUND 15.07%% * SQRT(12) = 0.5202 = 52.02% !

IT IS EASIER TO INTERPRET ANNUAL RETURNS AND ANNUAL VOLATILITY SO THAT WE CAN HAVE A QUICK COMPARISON WITH OTHER INVESTMENTS SUCH AS RISK-FREE RATE (CETES IN MEXICO).

THEN, WE CAN SAY THAT THE ANNUAL AVERAGE RETURN OF NVDA HAS BEEN AROUND 81%, BUT THIS AVERAGE RETURN CAN MOVE ON AVERAGE 52% ABOVE OR BELOW THE AVERAGE ANNUAL RETURN. IT SOUNDS AN INTERESTING INVESTMENT SINCE THE ANNUAL VOLATILITY IS A LESS LESS THAN THE ANNUAL RETURN, SO IF NVDA RETURNS BEHAVES LIKE A NORMAL DISTRIBUTED VARIABLE, ABOUT 68% OF THE CASES NVDA WILL OFFER POSITIVE RETURNS (>0)!

THEN, WHICH STOCK WOULD I PREFER?

COMPARING MSFT WITH NVDA, NVDA ANNUAL AVERAGE RETURN IS MUCH BIGGER THAN THAT OF MSFT (83% VS 24%). HOWEVER, NVDA RISK IS MORE THAN THE DOBLE OF MICROSOFT SINCE THE VOLATILITY OF NVDA IS MORE THAN 2 TIMES THE VOLATILITY OF MSFT!

THEN, WHICH INVESTMENT LOOKS BETTER FOR YOU? IT DEPENDS ON YOUR AVERSION FOR RISK!! ARE YOU WILLING TO LOSE MORE THAN 50% OF YOUR MONEY IN 1 YEAR, BUT WAIT TO RECOVER YOUR INVESTMENT AND MAKE HIGH RETURNS IN THE LONG-RUN? THEN, YOU CAN GO FOR NVDA!!

WHICH STOCK IS MORE VOLATILE? EXPLAIN INTERPRET VOLATILITY -THE STANDARD DEVIATION - OF THIS STOCK

AS MENTIONED ABOVE, NVIDIA IS MUCH MORE VOLATILE; NVIDIA VOLATILITY IS ABOUT THE DOUBLE OF THAT OF MICROSOFT.

The describe function does not show the variance. We can just square the standard deviations or we can calculate variance as follows:

R.var()

Ticker
MSFT    0.004433
NVDA    0.022717
dtype: float64

The interpretation of variance of returns is not very informative. We can say the THE AVERAGE OF SQUARED DEVIATIONS OF RETURNS FOR MSFT IS ABOUT 0.004433, WHILE THE THAT OF NVDA IS 0.0227. We mainly use variance to calculate standard deviation (by taking the square root).

Actually, we can also calculate mean and standard deviation as follows:

R.mean()

Ticker
MSFT    0.019683
NVDA    0.067837
dtype: float64

R.std()

Ticker
MSFT    0.066578
NVDA    0.150722
dtype: float64

For cc returns we have the r data frame, so:

r.describe()

Ticker       MSFT       NVDA
count   54.000000  54.000000
mean     0.017405   0.055363
std      0.065182   0.146783
min     -0.113590  -0.385895
25%     -0.032560  -0.026850
50%      0.018915   0.073416
75%      0.064046   0.163906
max      0.162366   0.310008

When comparing the mean and standard deviations for each stock of the simple vs cc returns are very similar. We can see that the cc returns experience a mean that is a little bit smaller than the mean of the simple return.

In sum, the continuously compounded returns are always more conservative than the simple returns:

For positive returns, cc returns are always smaller in magnitud (less positive)
For negative returns, cc returns are always greater in magnitud (more negative)

If the returns are daily or monthly, the simple and cc returns are very similar. If you calculate annual returns, you can start observing important differences.

8.2 Challenge 2 SOLUTION

Do a histogram for daily Bitcoin simple returns.

Hints: use the plot.hist function for pandas data frames.

INTERPRET the histogram with your own words and in CAPITAL LETTERS

BTC=yf.download(tickers="BTC-USD", start="2017-01-01",end="2025-02-17",interval="1d")


[*********************100%***********************]  1 of 1 completed

# I calculate simple and cc return columns:
RBTC = (BTC["Close"] / BTC["Close"].shift(1)) - 1


hist=RBTC.plot.hist(bins=12,alpha=0.5,title="Histogram of daily Bitcoin Returns")
hist

INTERPRETATION OF THE HISTOGRAM:

WE CAN SEE THAT MOST OF THE DAYS (AROUND 70%), BITCOIN HAS OFFERED RETURNS BETWEEN -5% TO +5% IN ONE DAY.

LOOKING AT THE LEFT SIDE OF THE HISTOGRAM, WE CAN ALSO SEE THAT CLOSE TO 125 DAYS, BITCOIN HAS OFFERED RETURNS BETWEEN -10% AND -5% IN ONLY 1 DAY!

LOOKING AT THE RIGHT SIDE OF THE HISTOGRAM, WE CAN SEE THAT CLOSE TO 25 DAYS, BITCOIN HAS OFFERED RETURNS BETWEEN +10% AND +15% IN 1 DAY.

WE CAN ALSO SEE THAT AROUND 1750 DAYS BITCOIN HAS OFFER POSITIVE DAILY RETURNS BETWEEN 0 AND +5%. AND ABOUT 800 DAYS BITCOIN HAS OFFERED NEGATIVE DAILY RETURNS BETWEEN -5% AND 0%.

WE CAN INCREASE THE # OF BINS OF THE HISTOGRAM TO BETTER APPRECIATE HOW CLOSE THIS DISTRIBUTION IS FROM THE NORMAL DISTRIBUTION:

hist=RBTC.plot.hist(bins=24,alpha=0.5,title="Histogram of daily Bitcoin Returns")
hist

WITH 24 BINS, THE DISTRIBUTION OF DAILY RETURNS LOOK CLOSE TO A NORMAL DISTRIBUTION. HOWEVER, WE CAN SEE THAT THE DISTRIBUTION IS NOT QUITE VERY SYMETRIC; IT SEEMS THAT THERE HAS BEEN MORE DAYS WITH NEGATIVE VS DAYS WITH POSITIVE RETURNS!

WE COULD SEE HOW MANY DAYS BITCOIN HAS OFFER NEGATIVE AND POSITIVE RETURNS:

NUMBER OF DAYS WITH POSITIVE DAILY RETURNS:

RBTC[RBTC>=0].count()

NUMBER OF DAYS WITH NEGATIVE RETURNS:

RBTC[RBTC<0].count()

AGAINST OUR PRELIMINARY APPRECIATION, WE CAN SEE THAT BETWEEN 2017 AND 2025 (FEB), BITCOIN HAS HAD MORE DAYS WITH POSITIVE RETURNS COMPARED TO THOSE WITH NEGATIVE RETURNS.

8.3 Challenge 3 SOLUTION

Do a one-sample t-test to check whether the mean monthly returns of MSFT are significantly greater than zero.

I DOWNLOAD MICROSOFT PRICES FROM 2020 TO THE MOST RECENT COMPLETE MONTH, AND CALCULATE HISTORICAL CC RETURNS:

# Download a dataset with prices for Pfizer
msft = yf.download("MSFT", start="2020-01-01", end="2025-01-31", interval='1mo', auto_adjust=True)


[*********************100%***********************]  1 of 1 completed


# I create another dataset with the Adjusted Closing price:
msftprices = msft['Close']

# I calculate cc returns of MSFT:
msftreturns = np.log(msftprices).diff(1)
# The diff function calculates the difference between the log price at t and the log price at t-1

# I drop the first row since the return cannot be calculated for the first month:
msftreturns = msftreturns.dropna()

a. I START DEFINING THE HYPOTHESES AND CALCULATING THE T-VALUE:

H0: mean(msftreturns) = 0 Ha: mean(msftreturns) > 0

msft_meanret = msftreturns.mean()
print("fThe mean returns of Microsoft is ",msft_meanret)

fThe mean returns of Microsoft is  0.015596039530455238

b. I CALCULATE THE STANDARD ERROR, WHICH IS THE STANDARD DEVIATION OF THE MEAN RETURN OF MSFT:

se_msftreturns = msftreturns.std() / np.sqrt(msftreturns.count())
print(f"Standard Error (Std. Deviation) of the AVERAGE of MSFT returns is: ",se_msftreturns)

Standard Error (Std. Deviation) of the AVERAGE of MSFT returns is:  0.008117413260483671

c. CALCULATE THE t-value:

THE t-value IS THE DISTANCE BETWEEN THE ACTUAL (REAL) HISTORICAL MEAN AND THE HYPOTHETICAL MEAN, WHICH IS ZERO, BUT THIS DISTANCE IS MEASURED IN # OF STANDARD ERRORS, SO I NEED TO DIVIDE DE DISTANCE BY THE STANDARD ERROR:

t = (msftreturns.mean() - 0) / se_msftreturns 
print(f"The t-value of the test is ",t)

The t-value of the test is  1.9213065825265074

INTERPRETATION OF THE t-value:

THE ACTUAL HISTORICAL MEAN RETURN OF MSFT IS 1.9213 STANDARD DEVIATIONS AWAY FROM THE HYPOTHETICAL MEAN, WHICH IS ZERO. IN OTHER WORDS, THE DISTANCE BETWEEN THE ACTUAL MEAN RETURN OF 0.0156 and 0.00 IS 1.9213 STANDARD DEVIATIONS.

IF WE WANT TO USE A CONFIDENCE LEVEL OF 95%, THE CRITICAL VALUE OF t IS A NUMBER VERY CLOSE TO 2. WHAT IS THE CRITICAL t VALUE? IT IS THE MINIMUM VALUE OF t TO REJECT THE NULL HYPOTHESIS.

WE CAN CALCULATE THE EXACT CRITICAL VALUE OF t WITH A 95% CONFIDENCE IN PYTHON AS FOLLOWS:

# The degrees of freedom in this case is equal to N - 1: 
df = len(msftreturns) - 1

# Two-tailed t-critical value for 95% confidence

alpha = 0.05 
t_critical = st.t.ppf(1 - alpha / 2, df)

print(f"T-critical value at 95% confidence: {t_critical:.4f}")

T-critical value at 95% confidence: 2.0010

Since the absolute value of t-value of the mean return of PFIZER is LESS than the critical value 2.0010, I cannot reject the null hypothesis at the 95% confidence level. Therefore, MICROSOFT mean return IS NOT significantly greater than 0 at the 95% confidence level

NOTE THAT THE t-value WAS VERY CLOSE TO 2. IN THIS CAES, I APPLIED THE RULE OF THUMB OF REJECTING THE NULL IF THE t-VALUE IS LESS THAN THE t-CRITICAL VALUE (ABOUT 2). WHEN YOU HAVE A t-VALUE CLOSE TO TO, IT IS RECOMMENDED TO CALCULATE THE p-value TO FINALLY DECIDE WHETHER TO REJECT THE NULL HYPOTHESIS OR NOT.

IN CAN ALSO CALCULATE THE t-CRITICAL VALUE FOR A 90% CONFIDENCE LEVEL TO SEE WHETHER I CAN REJECT THE NULL HYPOTHESIS AT THE 90% CONFIDENCE LEVEL:

# The degrees of freedom in this case is equal to N - 1: 
df = len(msftreturns) - 1

# Two-tailed t-critical value for 90% confidence

# I define alpha = 10% since I want to get the 90% confidence level.
# alpha = (1 - ConfidenceLevel)
alpha = 0.10 
t_critical = st.t.ppf(1 - alpha / 2, df)

print(f"T-critical value at 90% confidence: {t_critical:.4f}")

T-critical value at 90% confidence: 1.6711

THE t-CRITICAL VALUE FOR A 90% CONFIDENCE LEVEL IS LESS THAN 2!

THEN, AT THE 90% CONFIDENCE LEVEL, SINCE THE t-VALUE I GOT WAS LESS THAN THE t-CRITICAL VALUE, THEN I CAN REJECT THE NULL HYPOTHESIS!! IN OTHER WORDS, I CAN SAY THAT I HAVE STATISTICAL EVIDENCE AT THE 90% CONFIDENCE LEVEL TO SAY THAT THE MEAN RETURN OF MSFT IS GREATER THAN ZERO!

Run the t-test using a Python function.

from scipy import stats as st
# Two-sided t-test
ttest = st.ttest_1samp(msftreturns,popmean=0)
print(ttest)

TtestResult(statistic=1.9213065825265074, pvalue=0.05952820847809123, df=59)

WITH THE ttest_1samp FUNCTION I GOT THE SAME VALUE FOR t.

CONCLUSION OF THIS TEST: SINCE THE p-value IS GREATER THAN 0.05, I DO NOT HAVE ENOUGH STATISTICAL EVIDENCE AT THE 95% CONFIDENCE LEVEL TO REJECT THE NULL HYPOTHESIS. IN OTHER WORDS, I DO NOT HAVE ENOUGH STATISTICAL EVIDENCE TO SAY THAT THE AVERAGE MONTHLY RETURNS OF MSFT FROM JAN 2020 TO JAN 2025 IS SIGNIFICANTLY GREATER THAN ZERO AT THE 95% CONFIDENCE LEVEL.

THE P-VALUE IS THE PROBABILITY OF MAKING A MISTAKE IF I REJECT THE NULL HYPOTHESIS. IN OTHER WORDS, FOR THIS TEST, THERE IS A 5.95 % PROBABILITY THAT I WILL BE WRONG IF I ACCEPT MY CONCLUSION THAT MSFT MONTHLY RETURNS ARE GREATER THAN ZERO.

THEN, WITH THE p-VALUE I GOT, I CAN USE MY CRITICAL THINKING TO FINALLY DECIDE WHETHER I REJECT THE NULL OR NOT. IN THIS CASE, SINCE THE p-VALUE IS JUST 6%, I WOULD REJECT THE NULL HYPOTHESIS SINCE THE CONFIDENCE LEVEL IS ABOUT 94.05% (1 - pvalue)!!

NOTE ABOUT THE 2-tailed pvalue AND THE 1-tailed pvalue:

IN THE PREVIOUS RESULT, I USE A 2-tailed pvalue. WHAT DOES THAT MEAN?

BY DEFAULT, ANY STATISTICAL SOFTWARE CALCULATES 2-tailed pvalue TO BE CONSERVATIVE! HOWEVER, YOU CAN USE THE 1-tailed pvalue, WHICH IS HALF OF THE 2-tailed pvalue IF YOU HAVE EXTERNAL SUPPORT (WITH THEORY OR EXPERIENCE) TO SAY THAT THE MEAN OF MSFT RETURNS IS POSITIVE!

HOW CAN YOU CALCULATE THE 1-tailed pvalue? JUST ADD THE alternative PARAMETER TO THE PYTHON FUNCTION t-test

# One-sided t-test
ttest = st.ttest_1samp(msftreturns,popmean=0, alternative='greater')
print(ttest)

TtestResult(statistic=1.9213065825265074, pvalue=0.029764104239045616, df=59)

THEN, USING A 1-SIDED t-test, I CAN REJECT THE NULL HYPOTHESIS AT THE 95% CONFIDENCE LEVEL! IN THIS CASE I CAN MAKE AN ARGUMENT BASED ON SOLID FINANCIAL STATEMENTS RESULTS OF MSFT FOR THE LAST PREVIUOS YEARS TO SAY THAT MSFT MEAN RETURNS ARE GREATER THAN ZERO.