Workshop 1, Econometric Models

Author

Alberto Dorantes

Published

February 10, 2024

Abstract
This is an INDIVIDUAL workshop. In this workshop we learn about 1) calculation of returns and 2) descriptive statistics, and 2) the Central Limit Theorem, and 3) Hypothesis testing. In addition, we learn the basics of Python language, more specifically data management.

0.1 General Directions for each workshop

You have to work on Google Colab for all your workshops. In Google Colab, you MUST LOGIN with your @tec.mx account and then create a google colab document for each workshop.

You must share each Colab document (workshop) with the following account:

  • cdorante@tec.mx

You must give Edit privileges to these accounts.

In Google Colab you can work with Python or R notebooks. The default is Python notebooks.

Your Notebook will have a default name like “Untitled2.ipynb”. Click on this name and change it to “W1-Econometrics-YourFirstName-YourLastname”.

Pay attention in class to learn how to write text and Python code in your Notebook.

In your Workshop Notebook you have to:

  • Replicate all the Python code along with its output.

  • You have to do whatever is asked in the workshop. It can be: responses to specific questions and/or do an exercise/challenge.

For ANY QUESTION or INTERPRETATION, you have to RESPOND IN CAPITAL LETTERS right after the question.

  • It is STRONGLY RECOMMENDED that you write your OWN NOTES as if this were your personal notebook to study for the FINAL EXAM. Your own workshop/notebook will be very helpful for your further study.

Once you finish your workshop, make sure that you RUN ALL CHUNKS. You can run each code chunk by clicking on the “Run” button located in the top-left section of each chunk. You can also run all the chunks in one-shot with Ctrl-F9. You have to submit to Canvas the web link of your Google Colab workshop.

1 Introduction to data management in Finance

In Python we use data frames to do data analysis such as descriptive statistics and econometrics. A data frame is like a worksheet where each row is an observation and each column represents a variable or feature or characteristic of the row.

There are several Python packages used to data collection and data management for Finance. We will use the yfinance package to download real-time market data from that Yahoo Finance web site.

We also use the numpy package to do mathematical calculations such as logaritms.

It is usually needed to install Python package and then import the package. However, since we use Google Colab, hundreds of popular Python packages are already installed.

If you find a package that is not installed in Python, you need to install with the command !pip install.

1.1 Importing Python packages

The packages we will use in this workshop are already installed in Google Colab, so we only need to import them:

import yfinance as yf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

1.2 Downloading real financial prices

# Download stock prices
data = yf.download('NVDA, MSFT', start='2020-01-01', end='2024-07-31', interval='1mo', auto_adjust=True)
# The auto_adjust=True means that it will bring adjusted close prices after considering dividends and stocks splits. 
[                       0%                       ][*********************100%***********************]  2 of 2 completed

We got monthly stock prices for Nvidia and Microsoft from Jan 2020 into the data data frame.

We can see the first 5 rows of data:

data.head(5)
Price Close High Low Open Volume
Ticker MSFT NVDA MSFT NVDA MSFT NVDA MSFT NVDA MSFT NVDA
Date
2020-01-01 00:00:00+00:00 162.822525 5.886074 166.476307 6.460417 149.699543 5.757613 151.870769 5.943832 558530000 6125412000
2020-02-01 00:00:00+00:00 154.960190 6.723562 182.401757 7.874984 145.385777 5.861675 163.013795 5.867899 887522300 11848652000
2020-03-01 00:00:00+00:00 151.259338 6.566414 167.842135 7.096759 127.099660 4.500833 158.548474 6.897724 1612695500 15773952000
2020-04-01 00:00:00+00:00 171.879944 7.280848 173.021258 7.577781 144.209963 5.938420 146.741981 6.368375 984705000 11278304000
2020-05-01 00:00:00+00:00 175.754715 8.843732 179.840467 9.148886 166.691241 6.995870 168.609438 7.083306 688845000 12548876000

Yahoo Finance keeps track of open, close, high, low and adjusted prices for each period. We can keep only the adjusted stock prices, which are the prices we must use to calculate financial returns.

We can create another data frame with only the adjusted closing price columns for both stocks:

adjprices = data['Close']  

We see the first 5 monthly prices of both stocks:

adjprices.head(5)
Ticker MSFT NVDA
Date
2020-01-01 00:00:00+00:00 162.822525 5.886074
2020-02-01 00:00:00+00:00 154.960190 6.723562
2020-03-01 00:00:00+00:00 151.259338 6.566414
2020-04-01 00:00:00+00:00 171.879944 7.280848
2020-05-01 00:00:00+00:00 175.754715 8.843732

We can see the most current 5 monthly prices for both stocks:

adjprices.tail(5)
Ticker MSFT NVDA
Date
2024-03-01 00:00:00+00:00 418.369476 90.330383
2024-04-01 00:00:00+00:00 387.154846 86.381561
2024-05-01 00:00:00+00:00 412.810730 109.607071
2024-06-01 00:00:00+00:00 445.254639 123.510773
2024-07-01 00:00:00+00:00 416.763092 117.001923

We can visualize the stock price of Microsoft:

plt.plot(adjprices['MSFT'])
plt.title('Microsoft adjusted prices')
plt.show()

We can plot Microsoft monthly returns using the pct_change() function of Pandas:

plt.clf()
plt.plot(adjprices['MSFT'].pct_change().dropna())
plt.title('Microsoft monthly returns')
plt.show()

2 Return calculation

2.1 Simple and continuously compounded (cc) return

A financial simple return for a stock in period t (R_{t}) is usually calculated as the closing stock price in t plus any dividend payment at t, and then divide this sum by the previous closing price. It is calculated as a percentage change from the previous period (t-1) to the present period (t):

R_{t}=\frac{\left(price_{t}-price_{t-1}+ dividend_t\right)}{price_{t-1} }=\frac{price_{t}+dividend_t}{price_{t-1}}-1

When a stock pays dividends of do a stock split, the financial exchange make an adjustment to the historical stock prices. This adjustment to the stock prices is made so that we do not need to use dividends nor splits to calculate simple stock returns. Then, it is always recommended to use adjusted prices to calculate stock returns, unless you have information about all dividends payed in the past.

Then, with adjusted prices the formula for simple returns is easier:

R_{t}=\frac{\left(Adjprice_{t}-Adjprice_{t-1}\right)}{Adjprice_{t-1} }=\frac{Adjprice_{t}}{price_{t-1}}-1

For example, if the adjusted price of a stock at the end of January 2021 was $100.00, and its previous (December 2020) adjusted price was $80.00, then the monthly simple return of the stock in January 2021 will be:

R_{Jan2021}=\frac{Adprice_{Jan2021}}{Adprice_{Dec2020}}-1=\frac{100}{80}-1=0.25

We can use returns in decimal or in percentage (multiplying by 100). We will keep using decimals.

Although the arithmetic mean of simple returns R gives us an idea of average past return, in the case of multi-period average return, this method of calculation can be misleading. Let’s see why this is the case.

Imagine you have only 2 periods and you want to calculate the average return of an investment per period:

Returns over time
Period Investment value (at the end of the period) Simple period Return (R)
0 $100 NA
1 $50 -0.50
2 $75 +0.50

Calculating the average simple return of this investment:

\bar{R}=\frac{-0.5+0.5}{2}=0%

Then, the simple average return gives me 0%, while I end up with $75, losing 25% of my initial investment ($100) over the first 2 periods. If I lost 25% of my initial investment over 2 periods, then the average mean return per period might a midpoint between 0 and 25%. The accurate mean return of an investment over time (multi-periods) is the “Geometric Mean” return.

The total return of the investment in the whole period -also called the holding-period return (HPR)- can be calculated as:

HPR=\left(1+R_{1}\right)\left(1+R_{2}\right)...\left(1+R_{N}\right)-1

Using the example, the HPR for this investment is:

HPR=\left(1-0.50\right)\left(1+0.50\right)-1=0.75 - 1 = -0.25

And the formula for the geometric average of returns will be:

\bar{R_{g}}=\sqrt[N]{\left(1+R_{1}\right)\left(1+R_{2}\right)...\left(1+R_{N}\right)}-1

Caculating the geometric average for this investment:

\bar{R_{g}}=\sqrt[2]{\left(1-0.5\right)\left(1+0.5\right)}-1= -0.13397

Then, the right average return per year is about -13.4% and the HPR for the 2 years is -25%.

However, if we use continuosuly compounded returns (r) instead of simple returns (R), then the arithmetic mean of r is an accurate measure that can be converted to simple returns to get the geometric mean, which is the accurate mean return. Let’s do the same example using continuously compounded returns:

Continuously compounded returns
Period Investment value (at the end) Continuously compounded return (r)
0 $100 NA
1 $50 =log(50)-log(100)=-0.6931
2 $75 =log(75)-log(50)=+0.4054

In Finance it is very recommended to calculate continuously compounded returns (cc returns) and using cc returns instead of simple returns for data analysis, statistics and econometric models.

One way to calculate cc returns is by subtracting the natural log of the current adjusted price (at t) minus the natural log of the previous adjusted price (at t-1):

r_{t}=log(Adjprice_{t})-log(Adjprice_{t-1})

This is also called as the difference of the log of the price.

We can also calculate cc returns as the log of the current adjusted price (at t) divided by the previous adjusted price (at t-1):

r_{t}=log\left(\frac{Adjprice_{t}}{Adjprice_{t-1}}\right)

cc returns are usually represented by small r, while simple returns are represented by capital R.

But why we use natural logarithm to calculate cc returns? First we need to remember what is a natural logarithm.

2.2 Reviewing the concept of natural logarithm

Generate a new dataset with the natural logarithm (log) of MSFT and NVDA

lnprices = np.log(adjprices)

We can see the first rows of this new data frame:

lnprices.head(5)
Ticker MSFT NVDA
Date
2020-01-01 00:00:00+00:00 5.092661 1.772589
2020-02-01 00:00:00+00:00 5.043168 1.905618
2020-03-01 00:00:00+00:00 5.018996 1.881968
2020-04-01 00:00:00+00:00 5.146796 1.985247
2020-05-01 00:00:00+00:00 5.169089 2.179709

As you see, when we apply a mathematical function (in this case, the log function) to a data frame, the function is calculated to all rows of all columns of the data frame.

Let’s see the plot of both, the prices and the log prices of NVDA:

plt.clf()
plt.plot(adjprices['NVDA'])
plt.title('Stock price of NVDA')
plt.show()

plt.clf()
plt.plot(lnprices['NVDA'])
plt.title('Log price of NVDA')
plt.show()

As you see, the log prices have a much smaller scale (from about 1.7 to 4.7). Remember that the natural log price is actually an exponent we raise the number e to get the stock price.

With this plot, we can better appreciate the % growth over time per period. For example, from Jan 2020 to July 2020 the log price increased fro about 1.7 to 2.5. The difference between 2.5 and 1.7, which is 0.8 is the approximate % increase in price, in this case, 80% from Jan 2020 to July 2020.

What is a natural logarithm?

The natural logarithm of a number is the exponent that the number e (=2.71…) needs to be raised to get another number. For example, let’s name x=natural logarithm of a stock price p. Then:

e^x = p The way to get the value of x that satisfies this equality is actually getting the natural log of p:

x = log_e(p)

Then, we have to remember that the natural logarithm is actually an exponent that you need to raise the number e to get a specific number.

The natural log is the logarithm of base e (=2.71…). The number e is an irrational number (it cannot be expressed as a division of 2 natural numbers), and it is also called the Euler constant. Leonard Euler (1707-1783) took the idea of the logarithm from the great mathematician Jacob Bernoulli, and discovered very astonishing features of the e number. Euler is considered the most productive mathematician of all times. Some historians believe that Jacob Bernoulli discovered the number e around 1690 when he was playing with calculations to know how an amount of money grows over time with an interest rate.

How e is related to the grow of financial amounts over time?

Here is a simple example:

If I invest $100.00 with an annual interest rate of 50%, then the end balance of my investment at the end of the first year (at the beginning of year 2) will be:

I_2=100*(1+0.50)^1

If the interest rate is 100%, then I would get:

I_2=100*(1+1)^1=200

Then, the general formula to get the final amount of my investment at the beginning of year 2, for any interest rate R can be:

I_2=I_1*(1+R)^1 The (1+R) is the growth factor of my investment.

In Finance, the investment amount is called principal. If the interests are calculated (compounded) each month instead of each year, then I would end up with a higher amount at the end of the year.

Monthly compounding means that a monthly interest rate is applied to the amount to get the interest of the month, and then the interest of the month is added to the investment (principal). Then, for month 2 the principal will be higher than the initial investment. At the end of month 2 the interest will be calculated using the updated principal amount. Putting in simple math terms, the final balance of an investment at the beginning of year 2 when doing monthly compounding will be:

I_2=I_1*\left(1+\frac{R}{N}\right)^{1*N}

For monthly compounding, N=12, so the monthly interest rate is equal to the annual interest rate R divided by N (R/N). Then, with an annual rate of 100% and monthly compounding (N=12):

I_2=100*\left(1+\frac{1}{12}\right)^{1*12}=100*(2.613..)

In this case, the growth factor is (1+1/12)^{12}, which is equal to 2.613.

Instead of compounding each month, if the compounding is every moment, then we are doing a continuously compounded rate.

If we do a continuously compounding for the previous example, then the growth factor for one year becomes the astonishing Euler constant e:

Let’s do an example for a compounding of each second (1 year has 31,536,000 seconds). The investment at the end of the year 1 (or at the beginning of year 2) will be:

I_2=100*\left(1+\frac{1}{31536000}\right)^{1*31536000}=100*(2.718282..)\cong100*e^1

Now we see that e^1 is the GROWTH FACTOR after 1 year if we do the compounding of the interests every moment!

We can generalize to any other annual interest rate R, so that e^R is the growth factor for an annual nominal rate R when the interests are compounded every moment.

When compounding every instant, we use small r instead of R for the interest rate. Then, the growth factor will be: e^r

Then we can do a relationship between this growth rate and an effective equivalent rate:

\left(1+EffectiveRate\right)=e^{r}

If we apply the natural logarithm to both sides of the equation:

ln\left(1+EffectiveRate\right)=ln\left(e^r\right)

Since the natural logarithm function is the inverse of the exponential function, then:

ln\left(1+EffectiveRate\right)=r In the previous example with a nominal rate of 100%, when doing a continuously compounding, then the effective rate will be:

\left(1+EffectiveRate\right)=e^{r}=2.7182

EffectiveRate=e^{r}-1

Doing the calculation of the effective rate for this example:

EffectiveRate=e^{1}-1 = 2.7182.. - 1 = 1.7182 = 171.82\%

Then, when compounding every moment, starting with a nominal rate of 100% annual interest rate, the actual effective annual rate would be 171.82%!

2.3 Return calculation

We have historical monthly adjusted prices for each stock. We can easily calculate simple returns for both stocks:

R = adjprices / adjprices.shift(1) - 1
# I delete the NA values located at the first period:
R = R.dropna()

The shift(1) function gets the previous price value (1 period ago), and it does so for all periods.

We can also calculate continuously compounded returns (r) as follows:

r = np.log(adjprices) - np.log(adjprices.shift(1))
# I delete the NA values located at the first period:
r = r.dropna()

3 Descriptive statistics for Finance

Descriptive statistics is a set of summaries of raw data related to one or several variables of a phenomenon. Descriptive statistics usually gives us a first general idea of a phenomenon by looking at summaries such as averages and variability of variables that represent different aspects of a phenomenon.

In Finance we are interested in knowing the average past return of an investment or how much investment returns usually move up and down over time. We can use descriptive statistics to a) learn about past return and risk of investments, and 2) make inferences about average return and risk for estimating expected values for the future.

In Economics, for example, we might be interested in knowing what has been the economic development of a country over the past 10 years. Then, we can calculate an annual average percentage growth of the gross domestic product. In Finance we might be interested in knowing the annual average return of an investment for the last 5 years and also the variability of annual returns over time.

Then, the more important measures of descriptive statistics are:

  • Measures of central tendency, and

  • Measures of dispersion

3.1 Central tendency measures

The main central tendency measures are:

  • Arithmetic mean

  • Median

  • Mode

3.1.1 Arithmetic mean

An arithmetic mean of a variable X is a simple measure that tells us the average value of all valid values of X, assuming that each value has the same importance (or weight). The variable X can be representing any attribute of a subject. A subject can be an individual, a group, a team, a business unit, a company, a financial portfolio, an industry, a region, a country, etc.

An example of a variable X can be the monthly sales amount of a company for the last 3 years. In this case, the variable X will have 36 observations (36 monthly sales). The subject here is a company and the variable or attribute is the company sales over time. Another example can be a variable that represents the daily returns of a financial portfolio over the last 2 years. In this case, the variable might have about 500 observations considering 250 business days each year. The subject in this example is a financial portfolio, that might be composed of more than one stock and/or bond.

To calculate the arithmetic mean of a variable X we simply sum all the non-missing values of the variable and then divide them by the number of non-missing values. Then, the calculation is as follows:

\bar{X}=\frac{{\displaystyle {\displaystyle \sum_{i=1}^{N}X_{i}}}}{N}

Where N is the number of non-missing values (observations) of X. A missing value of a variable happens when the variable X for a specific observation has no value. It is important to note that a missing value is not a zero value. When we work with real world datasets, it is very common to find non-missing values in many variables.

One of the disadvantage of the arithmetic mean is that it is very sensible to extreme values. If a variable has few extreme values, the arithmetic mean might not be a good representation of an average or mid point. In the prescence of few very extreme values in a variable, the best measure of central tendency is the median, not the arithmetic mean.

3.1.2 The Median

Another measure of central tendency is the median. The median of a variable is its 50 percentile, which is the mid point of its values when the values are sorted in ascending order. When we have an even number of observations, there will be 2 mid points, so the median will be equal to the arithmetic mean of these 2 mid points. When we have an odd number of observations there will be only 1 value in the middle, which is the median.

For example, if we want to know what is the typical size of all companies that trade shares in the Mexican stock market we can calculate the median of firm size. These firms are called public firms. Firm size can be measured with different variables. We can use the total value of its assets (total assets), the market value, or the number of employees. In this example we will use total assets at the end of 2018 for all public Mexican firms. At the end of 2018 there were 146 Mexican public firms in the market exchange (“Bolsa Mexicana de Valores”). I will show how to calculate the median total assets of these 146 firms.

The 2018 total assets of the 146 Mexican public firms for 2018 are shown below (sorted alphabetically)

Mexican firms in the BMV
Firm row # Industry 2018 Total Assets (in thousand pesos)
ACCEL 1 Services $6,454,560.00
AEROMEXICO 2 Transport Services $76,772,848.00
VOLARIS 148 Transport Services $22,310,652.00
WALMART 149 Retail $306,528,832.00

We sort the list from the lowest to the highest value of 2018 total assets:

Firm row # Industry Size Rank 2018 Total Assets (in thousand pesos)
INGEAL 98 Food & Beverages 1 $171,104.00
HIMEXSA 88 Textile 2 $494,378.00
FHIPO14 45 Real State 73 $27,979,1184.00
TVAZTECA 139 Telecommunications 74 $27,988,054.00
AMERICA MOVIL 8 Telecommunications 145 $1,429,223,392.00
GFBANORTE 69 Financial Services 146 $1,620,470,400.00

The median total assets is the mid point of the list. However, in this case, I have 146 firms, so it is not possible to find an exact mid point. Then, I need to calculate the arithmetic average assets of the 2 firms that are in the middle (firms in positions 73 and 74). Then the median will be equal to $27,983,619.00 thousand pesos (about 27 billion pesos), which is the average value between FHIPO14 and TVAZTECA assets. The arithmetic mean for total assets considering the 146 firms is $97,860,896.23 thousand pesos (about 97.8 billion pesos), which is much bigger than the median. Then, which might be the best measure that better represents the typical size of Mexican firms? In this case, the best measure is the median, so we can say that the typical size of a Mexican public firm is about $27.9 thousand million pesos.

Then, what is the difference between the mean and the median? When the distribution of the values of a variable is very close to a normal distribution, the mean and the median will be very similar, so we can use the mean or median to represent the typical value of the variable. When the variable has few very extreme values, then the distribution of values will not be similar to a normal distribution; it will have fat tails due to the presence of extreme values. In this case the best measure of central tendency is the median, not the mean.

What is a normal distribution? It is a very common probability distribution of random variables. We will further explain probability distributions later. For now, just consider that many variables of all disciplines and nature follow a close-to-normal distribution.

The median gives of a better representation of the “average” value of a variable compared with the arithmetic mean when the distribution of the values does NOT follow a normal distribution. In the case of 2018 total assets we can explore its distribution using a histogram:

I will later explain in more detail what a histogram is.

I a histogram we see how often different ranges of values of a variable appear. This histogram does not look like a normal distributed variable. This histogram is said to be “skewed” to the right since there are very few firms with very high values of total assets. Normal distributed variable look like a bell shape curve where most of the values are around the arithmetic mean. In this case, we can see that most of the firms (about 100 firms) have a range of total assets between 0 and $25 thousand million pesos. Since the total of firms is 146 then, only about 46 firms have assets higher than $25 thousand million (or 25 billion pesos). Actually I can see that there are very few firms with assets greater than $1,000 thousand million (or greater than $1 trillion pesos), and one above $1,500 trillion. Looking at the previous table we can see that AMERICA MOVIL and GFBANORTE have assets greater than $1,400 trillion pesos.

With the histogram we can see that most of the firms (about 67%, 100 out of 146) have assets less than 25 billion pesos. The arithmetic mean of total assets is more than $97 billion, and the median total assets (or 50 percentile) is about $27 billion. The arithmetic mean is very sensible to extreme values, while the median is not. If we use the mean as a measure of a typical size of a Mexican firm we would be very far from the most common values of total assets. Then, the best measure of a typical size will be the median, which is about $27 billion pesos.

In sum, for skewed distributions the median will always be the best measure for central tendency, while the arithmetic mean will be a biased measure that will not represent the central or typical value. Actually, in the case of normal distributed variables, the median will be very close to the mean, so the median is always a good measure of central tendency.

Examples of business variables with a skewed distribution similar to total assets are employee salaries, income of families in a region or country, any variable from the income statement such as firm sales, firm profits.

3.1.3 Mode

Mode is the value that most appear in the variable. Mode can be calculated only for discrete variables, not for continuous variables. Mode is rarely used as a central tendency measure.

3.2 Dispersion measures

3.2.1 Variance and standard deviation

Standard deviation is used to measure how much on average the individual values of a variable change from the mean.

The variance of a variable X is the average of squared deviations from each individual value X_i from its mean:

Var(X)=\frac{1}{N}\sum_{i=1}^{N}\left(X_{i}-\bar{X}\right)^{2}=\sigma_{X}^{2}

Where:

X_i = Value i of the variable X

\overline{X}=\frac{1}{n}\sum_{i=1}^{n}X_{i} = Arithmetic average of X

Why the variance is the average of squared deviations? The reason is because if we do not square the deviations, then they will cancel out each other since some deviations are positive and other negative. Then, the squaring is just a trick to avoid canceling the positive with the negative deviations.

The result of the variance will be a number that our brain cannot easily interpret. To have a more reasonable measure of linear deviation, then we just take the square root of the variance, and then, we will be able to interpret that number as the average deviations of all points from their mean. This measure is called standard devation:

SD(X)=\sqrt{Var(X)}= \sqrt{\frac{1}{N}\sum_{i=1}^{N}\left(X_{i}-\bar{X}\right)^{2}}=\sigma_{X}

The variance can also be expressed as the expected value of squared deviations:

Var(X)=E[(X-\bar{X})^2]

Doing the multiplication of the squared term:

Var(X)=E[(X^2-X\bar{X}-\bar{X}X+\bar{X}^2)]

Since \bar{X} and \bar{Y} are constants, I can take them out of the expectation:

Var(X)=E[X^2]-\bar{X}E[X]-\bar{X}E[X]+\bar{X}^2

Since E[X]=\bar{X}, then:

Var(X)=E[X^2]-\bar{X}^2

Then, the variance can be defined as the expected value of X squared minus its squared mean.

Also, we can express the variance of X as:

Var(X)=\frac{1}{N}\sum_{i=1}^{N}\left(X_{i}\right)^2-\bar{X}^2

Most Statistics books and Statistics software use (N-1) instead of N as the denominator of the variance formula to get a more conservative value of the variance. This measure is called sample variance. When we divide by N in the variance formula, we are calculating the population variance. Both formulas provide very similar results, but the sample variance will be a bit bigger than the population variance, so it is a more conservative value.

In Statistics, the sample variance is an unbiased measure of the underlying (real) variance.

Then, we can re-write the formula for sample variance of X as:

Var(X)=\frac{1}{\left(N-1\right)}\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2}=\sigma_{X}^{2} And the sample standard deviation of X can be written as:

SD(X)=\sqrt[2]{Var(X)}=\sqrt{\frac{1}{(n-1)}\sum_{i=1}^{n}(X_{i}-\bar{X})^{2}}

SD(X)=\frac{\sqrt{\sum_{i=1}^{n}(X_{i}-\bar{X})^{2}}}{\sqrt{(n-1)}}=\sigma_{X}

4 The Histogram

The histogram was invented to illustrate how the values of a random variable are distributed in its whole range of values. The histogram is a frequency plot. The ranges of values of a variable that are more frequent will have a higher vertical bar compared with the ranges that are less frequent.

With the histogram of a random variable we can appreciate which are the most common values, the least common values, the possible mean and standard deviation of the variable.

In my opinion, the most important foundations/pillars of both, Statistics and the theory of Probability are:

  • The invention of the Histogram

  • The discovery of the Central Limit Theorem

Although the idea of a histogram sounds a very simple idea, it took many centuries to be developed, but it has profound impact in the development of Probability theory and Statistics, which both are the pillars of all sciences.

I enjoy learning about the origins of the great ideas of humanity. The idea of the histogram was invented to decipher encrypted messages.

4.1 Interesting facts about the History of the Histogram

It is documented that the encryption of messages -cryptography- was commonly used since the beginning of civilizations. Unfortunately, it seems cryptography was invented by ancient Kingdoms mainly for war strategies. According to Herodotus, in the 500s BC, the Greeks used cryptography to win a war against the powerful Persian troops [@2000_SinghSimon_BOOK].

Cryptography refers to the methods of ciphering messages, while cryptoanalysis refers to the methods to decipher encrypted messages.

In the 800s AD, the Arabs deciphered encrypted messages thanks to their invention of the histogram. According to @2000_SinghSimon_BOOK and @1992_AlKaditIbrahimA, in 1987 several ancient Arabic manuscripts related to cryptography and cryptoanalysis (written between the year 800 AD and 1,500 AD) were discovered in Istanbul, Turkey (they were translated into English until 2002). This is a fascinating story!

Below is an example of a frequency plot by Arabic philosopher Al-Kindi in the 850 AD compared with a recent frequency plot by Al-Kadi:

Figure taken from @1992_AlKaditIbrahimA

The encrypted messages at that time were written with the Caesar shift method. Then, to decipher an encrypted message, the Arabs used to count all characters to create a frequency plot, and then try to match the encrypted characters with the Arab characters. Finally, the replaced the corresponding Arabic matched characters in the original message to decipher it.

Interestingly, the idea of the frequency waited about 1,000 years to be used by French mathematicians to develop the foundations of the Statistics discipline. In the 1700s and early 1800s, the french mathematicians Abraham De Moivre and Pierre-Simon Laplace used this idea to develop the Central Limit Theory (CLT).

I believe the CLT is one of the most important and fascinating mathematical discoveries of all time.

The English scientist Karl Pearson coined the term histogram in 1891 when he was developing statistical methods applied to Biology.

Why the histogram is so important in Statistics? I hope we will find this out during this course!

5 The Central Limit Theorem

The Central Limit Theorem is one of the most important discoveries in the history of mathematics and statistics. Actually, thanks to this discovery, the field of modern Statistics was developed at the beginning of the 20th century.

The central limit theorem says that for any random variable with ANY probability distribution, when you take groups (of at least 30 elements from the original distribution), take the mean of each group, then the probability distribution of these means will have the following characteristics:

1) The distribution of the sample means will be close to normal distribution when you take many groups (the size of the groups should be each equal or bigger than 25). Actually, this happens not only with the sample mean , but also with other linear combinations such as the sum or weighted average of the variable.

2) The standard deviation of the sample means will be much less than the standard deviation of the individuals. Being more specifically, the standard deviation of the sample mean will shrink with a factor of 1/\sqrt{N}.

Then, the central limit theorem says that, no matter the original probability distribution of any random variable, if we take groups of this variable, a) the means of these groups will have a probability distribution close to the normal distribution, and b) the standard deviation of the mean will shrink according to the number of elements of each group.

An interesting question about why the standard deviation shrinks so much with a factor of 1/\sqrt{N}? We can prove this with basic probability theory and intuition. Let’s start with intuition.

When you take groups and then take the mean of each group, then extreme values that you could have in each group will cancel out when you take the average of the group. Then, it is expected that the variance of the mean of the group will be much less than variance of the variable. But how much less?

Now let’s use simple math and probability theory to examine this relationship between these variances:

Let’s define a random variable X as a the weight of students X1, X2, … XN. The mean will be:

\bar{X}=\frac{1}{N}\left(X_{1}+X_{2}+...+X_{N}\right)

We can estimate the variance of this mean as follows:

VAR\left(\bar{X}\right)=VAR\left(\frac{1}{N}\left(X_{1}+X_{2}+...+X_{N}\right)\right)

Applying basic probability rules I can express the variance as:

VAR\left(\bar{X}\right)=\left(\frac{1}{N}\right)^{2}VAR\left(X_{1}+X_{2}+...+X_{N}\right)

VAR\left(\bar{X}\right)=\left(\frac{1}{N}\right)^{2}\left[VAR\left(X_{1}\right)+VAR\left(X_{2}\right)+...+VAR\left(X_{N}\right)\right]

Since the variance of X_1 is the same as the variance of X_2 and it is also the same for any X_N, then:

VAR\left(\bar{X}\right)=\left(\frac{1}{N}\right)^{2}N\left[VAR\left(X\right)\right]

Then we can express the variance of the mean as:

VAR\left(\bar{X}\right)=\left(\frac{1}{N}\right)\left[VAR\left(X\right)\right]

We can say that the expected variance of the sample mean is equal to the variance of the individuals divided by N, that is the sample size.

Finally we can get the sample standard deviation by taking the square root of the variance:

SD(\bar{X})=\sqrt{\frac{1}{N}}\left[SD(X)\right]

SD(\bar{X})=\frac{SD(X)}{\sqrt{N}}

Then the expected standard deviation of the sample mean of a random variable is equal to the individual standard deviation of the variable divided by the squared root of N.

Thanks to the CLT we can make inferences (good guesses) about:

  1. The mean of any random variable
  2. The standard deviation of the means
  3. Since the means will behave as normal distribution, we can estimate how much the mean can vary over time! This variation is called the standard error (the standard deviation of the mean)

After the CLT, the concept of hypothesis testing and Linear Regression were further developed. With the CLT we have a theory to make inferences about population means and standard deviation of any random variable using samples.

The method to make these type of inferences is called Hypothesis Testing.

6 HYPOTHESIS TESTING

The idea of hypothesis testing is to provide strong evidence - using facts - about a specific belief that is usually supported by a theory or common sense. This belief is usually the belief of the person conducting the hypothesis testing. This belief is called the Alternative Hypothesis.

The person who wants to show evidence about his/her belief is supposed to be very humble so the only way to be convincing is by using data and a rigorous statistical method.

Let’s imagine 2 individuals, Juanito and Diablito. Juanito wants to convince Diablito about a belief. Diablito is very, very skeptical and intolerant. Diablito is also an expert in Statistics! Then, Juanito needs a very strong statistical method to convince Diablito. Juanito also needs be very humble so Diablito does not get angry.

Then, Juanito decides to start assuming that his belief is NOT TRUE, so Diablito will be receptive to continue listening Juanito. Juanito decides to collect real data about his belief and decide to define 2 hypotheses:

  • H0: The Null Hypothesis. This hypothesis is the opposite of Juanito and he starts accepting that this hypothesis is TRUE. This is also called the HYPOTHETICAL mean.

  • Ha: The Alternative Hypothesis. This is what Juanito beliefs, but he starts accepting that this is NOT TRUE.

Diablito is an expert in Statistics, so he knows the Central Limit Theorem very well! However, Juanito is humble, but he also knows the CLT very well.

Then Juanito does the following to try to convince Diablito:

  • Juanito collects a random sample related to his belief. His belief is about the mean of a variable X; he believes that the mean of X is greater than zero.

  • He calculates the mean and standard deviation of the sample.

Since he collected a random sample, then he and Diablito know that, thanks to the CLT:

  1. The mean of this sample will behave VERY SIMILAR to a normal distribution,

  2. The standard deviation of this sample is much less than the standard deviation of the individuals of the sample and this can be calculated by dividing the individual standard deviation by the square root of sample size.

  3. With a probability of about 95%, the sample mean will have a value between 2 standard deviations less than its TRUE mean (in this case =0, the mean of the H0) and 2 standard deviations higher than its mean

Then, if Juanito shows that the calculated sample mean of X is higher than zero (the hypothetical mean H0) plus 2 standard deviations, then Juanito will have a very powerful statistical evidence to show Diablito that the probability that the true mean of X is bigger than zero will be above 95%!

Juanito is very smart (as Diablito). Then, we calculates an easy measure to quickly know how far its sample mean is away from the hypothetical mean (zero), but measured in # of standard deviations of the mean. This new standardized distance is usually called z or t. If z is 2 or more, he can say that the sample mean is 2 standard deviations above the supposed true mean (zero), so he could convince Diablito about his belief that the actual TRUE mean is greater than zero.

Although the CLT says that the mean of a sample behaves as normal, it has been found that the CLT is more consistent with the t-Student probability distribution. Actually, the t-Student distribution is very, very similar (almost the same) to the normal distribution when the sample size is bigger than 30. But for small samples, the t-Student does a much better job in representing what happens with the CLT for small samples compared with the z normal distribution.

Then, for hypothesis testing we will use the t-Student distribution instead of the z normal distribution.

The hypothesis testing to compare the mean of a variable with a value or with another mean is called one-sample t-test.

6.1 One-sample t-test

We start with the simple case of hypothesis testing: the One-Sample t-test.

We will learn about this with an example.

We download historical monthly data for the S&P500 market index and calculate cc returns:

import yfinance as yf
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

# Download stock data
data = yf.download('^GSPC', start='2020-01-01', end='2025-01-31', interval='1mo')
adjprices = data['Close']  # Get adjusted close prices

# Calculate continuously compounded returns

ccr = np.log(adjprices) - np.log(adjprices.shift(1))
ccr = ccr.dropna()

Here is an example of a t-test to check whether the S&P 500 has an average monthly returns significantly greater than zero:

# H0: mean(ccr$GSPC) = 0
# Ha: mean(ccr$GSPC) <> 0

# Standard error
se_GSPC = np.std(ccr,ddof=1) / np.sqrt(len(ccr))
print(f"Standard error S&P 500 = {se_GSPC}")

# t-value
mean_ccr = np.mean(ccr)
t_GSPC = (mean_ccr - 0) / se_GSPC
print(f"t-value S&P 500 = {t_GSPC}")
Standard error S&P 500 = 0.006789653490941936
t-value S&P 500 = 1.540082367227796

Since the t-value of the mean return of S&P 500 is lower than 2, I can’t reject the null hypothesis at the 95% confidence level. Therefore, at the 95% confidence level, S&P 500 mean return is not statistically greater than 0.

We can calculate the p-value of the test.

What is the p-value of a test?

The p-value of a test is the probability that I will be wrong if I reject the NULL hypothesis. In other words, (1-pvalue) will be the probability that MY HYPOTHESIS (the alternative hypothesis) is true!

(1-pvalue)% is called the confidence level I can use to reject the null hypothesis.

Fortunately, there is a Python function that does the same we did, but faster and it gets the p-value of the test:

from scipy import stats as st

# One-sided t-test
ttest_GSPC = st.ttest_1samp(ccr, 0, alternative='greater')

# Showing the t-Statistics and the p-value:
ttest_GSPC
TtestResult(statistic=1.540082367227796, pvalue=0.0644434328348415, df=59)

I got the same result with this t.test function.

But what does this mean? Does this mean that investing in the S&P is not going to give you positive returns over time? No, this is not quite a good conclusion.

We got a p-value= 6.4%. This means that we can accept our hypothesis that monthly mean return of the S&P500 is >0 at the 93.6% (=1-pvalue)!

When we get a p-value<0.05 we can say that we have strong evidence (at least at the 95% confidence level) to reject the null hypothesis (accept my hypothesis). In this case we say that our result is statistically significant.

When we get a p-value between 5% and 10% (0.05 and 0.10) we can say that we evidence at the 90% confidence level to reject the null hypothesis. We can also say that our results are marginally significant.

6.2 Hypothesis testing - two-sample t-test

Now we will do a hypothesis testing to compare the means of two groups. This test is usually named two-sample t-test.

In the case of the two-sample t-test we try to check whether the mean of a group is greater than the mean of another group.

Imagine we have two random variables X and Y and we take a random sample of each variable to check whether the mean of X is greater than the mean of Y.

We start writing the null and alternative hypothesis as follows:

H0:\mu_{x}=\mu_{y}

Ha:\mu_{x}\neq\mu_{y}

We do simple algebra to leave a number in the right-hand side of the equality, and a random variable in the left-hand side of the equation. Then, we re-write these hypotheses as:

H0:(\mu_{x}-\mu_{y})=0

Ha:(\mu_{x}-\mu_{y})\neq0

The Greek letter \mu is used to represent the population mean of a variable.

To test this hypothesis we take a random sample of X and Y and calculate their means.

Then, in this case, the variable of study is the difference of 2 means! Then, we can name the variable of study as diff:

diff = (\mu_{x}-\mu_{y})

Since we use sample means instead of population means, we can re-define this difference as:

diff = (\bar{X}-\bar{Y})

The steps for all hypothesis tests are basically the same. What changes is the calculation of the standard deviation of the variable of study, which is usually names standard error.

For the case of one-sample t-test, the standard error was calculated as \frac{SD}{\sqrt{N}}, where SD is the individual sample standard deviation of the variable, and N is the sample size.

In the case of two-sample t-test, the standard error SE can be calculated using different formulas depending on the assumptions of the test. In this workshop, we will assume that the population variances of both groups are NOT EQUAL, and the sample size of both groups is the same (N). For these assumptions, the formula is the following:

SD(diff)=SE=\sqrt{\frac{Var(X)+Var(Y)}{N}}

But, where does this formula come from?

We can easily derive this formula by applying basic probability rules to the variance of a difference of 2 means. Let’s do so.

The variances of each group of X and Y might be different, so we can estimate the variance of the DIFFERENCE as:

Var(\bar{X}-\bar{Y})=Var(\bar{X})+Var(\bar{Y})

This is true if only if \bar{X} and \bar{Y} are independent. We will assume that both random variables are not dependent of each other. This might not apply for certain real-world problems, but we will assume that for simplicity. If there is dependence I need to add another term that is equal to 2 times the covariance between both variables.

Why the variance of a difference of 2 random variables is the SUM of their variance? This sounds counter-intuitive, but it is correct. The intuition behind this is that when we make the difference we do not know which random variable will be negative or positive. If a value of \bar{Y}_i<0 then we will end up with a SUM instead of a difference!

As we learned in the CLT, the variance of the mean of a random variable is reduced according to its sample size: Var(\bar{X)}=\frac{Var(X)}{N}. Then:

Var(\bar{X}-\bar{Y})=\frac{Var(X)}{N}+\frac{Var(Y)}{N}

Factorizing the expression:

Var(\bar{X})=\frac{1}{N}\left[Var(X)+Var(Y)\right]

We take the squared root to get the expected standard deviation of (\bar{X}-\bar{Y}):

SD(\bar{X})=\sqrt{\frac{1}{N}\left[Var(X)+Var(Y)\right]}

Then, the method for hypothesis testing is the same we did in the case of one-sample t-test. We just need to use this formula as the denominator of the t-statistic.

The, the t-statistic for the two-sample t-test is calculated as:

t=\frac{(\bar{X}-\bar{Y})-0}{\sqrt{\frac{Var(X)+Var(Y)}{N}}}

Remember that the value of t is the # of standard deviations of the variable of study (in this case, the difference of the 2 means) that the empirical difference we got from the data is away from the hypothetical value, zero.

The rule of thumb we have used is that if |t|>2 we have have statistical evidence at least at the 95% confidence level to reject the null hypothesis (or to support our alternative hypothesis).

6.3 EXAMPLE - IS AMD MEAN RETURN HIGHER THAN ORACLE MEAN RETURN?

Do a t-test to check whether the mean monthly cc return of AMD (AMD) is significantly greater than the mean monthly return of Intel. Use data from Jan 2020 to date_

# Getting price data. I indicate getting adjusting prices as close prices:
sprices=yf.download(tickers='AMD ORCL', start='2020-01-01', end='2025-01-31',interval='1mo', auto_adjust=True)
# I select the Close columns for both stocks:
sprices=sprices['Close']
[                       0%                       ][*********************100%***********************]  2 of 2 completed

The sprices data frame has 2 columns: close price for AMD and close price for INTC:

sprices.head(5)
Ticker AMD ORCL
Date
2020-01-01 00:00:00+00:00 47.000000 48.435822
2020-02-01 00:00:00+00:00 45.480000 45.877949
2020-03-01 00:00:00+00:00 45.480000 44.829792
2020-04-01 00:00:00+00:00 52.389999 49.133751
2020-05-01 00:00:00+00:00 53.799999 50.112755

Now we calculate monthly continuously compounded returns for both stocks:

# Calculating cc returns as the difference of the log price and the log price of the previous month:
sr = np.log(sprices) - np.log(sprices.shift(1))
# we can also calculate cc returns using the diff function:
# sr = np.log(sprices).diff(1)

# Deleting the first month with NAs:
sr=sr.dropna()

sr.head()
Ticker AMD ORCL
Date
2020-02-01 00:00:00+00:00 -0.032875 -0.054255
2020-03-01 00:00:00+00:00 0.000000 -0.023112
2020-04-01 00:00:00+00:00 0.141443 0.091673
2020-05-01 00:00:00+00:00 0.026558 0.019729
2020-06-01 00:00:00+00:00 -0.022367 0.027514

We can calculate the monthly mean return for both stocks:

amd_mean = sr['AMD'].mean()
orcl_mean = sr['ORCL'].mean()
print(f"AMD mean cc % return= {100*amd_mean}%")
print(f"Oracle mean cc % return= {100*orcl_mean}%")
AMD mean cc % return= 1.5050190594531336%
Oracle mean cc % return= 2.089094493784317%

We can see that, on average, Oracle mean return is higher than the AMD mean return. However, we need to check whether this difference is statistically significant. Then, we need to do a 2-sample t-test.

We state the hypothesis and calculate the t-Statistic:

# Stating the hypotheses: 
# H0: (mean(rAMD) - mean(rORACLE)) = 0
# Ha: (mean(rAMD) - mean(rORACLE)) <> 0

# Calculating the standard error of the difference of the means:
# Getting the number of non-missing observations for the sample:
N = sr['AMD'].count()
# Getting the variances of both columns: 
amd_var = sr['AMD'].var()
orcl_var = sr['ORCL'].var()
# Now we get the standard error for the mean difference:
sediff = np.sqrt((1/N) * (amd_var + orcl_var ) )

# Calculating the t-Statistic:
t = (sr['AMD'].mean() - sr['ORCL'].mean()) / sediff
print(f"t-Statistic = {t}")
t-Statistic = -0.2678501133489891

Fortunately we can use a Python function to easily calculate the t-value along with its p-value and 95% confidence interval:

# I do the 2-way sample t-test using the ttest_ind function from stats:
st.ttest_ind(sr['AMD'],sr['ORCL'],equal_var=False)
# With this function we avoid calculating all steps of the hypothesis test!
TtestResult(statistic=-0.2678501133489891, pvalue=0.789406907315596, df=93.15009569369988)

We got the same t-value as above, and also we got the p-value of the test. Since the t-value is much greater than 2, and pvalue is much greater than 0.05, we cannot reject the null hypothesis.

Then, although the ORCL mean return is higher than that of AMD, we do not have significance evidence (at the 95% confidence) to say that the mean of ORCL return is greater than the mean of AMD return.

We can use another Python function that display more results for this t-test:

import researchpy as rp
# Using the ttest function from researchpy:
rp.ttest(sr['AMD'],sr['ORCL'],equal_variances=False)
# With this function we avoid calculating all steps of the hypothesis test!
C:\Users\alber\AppData\Local\Programs\Python\Python312\Lib\site-packages\researchpy\ttest.py:301: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'AMD' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  table.iloc[0,0] = group1_name
C:\Users\alber\AppData\Local\Programs\Python\Python312\Lib\site-packages\researchpy\ttest.py:460: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'Difference (AMD - ORCL) = ' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  table2.iloc[0,0] = f"Difference ({group1_name} - {group2_name}) = "
(   Variable      N      Mean        SD        SE  95% Conf.  Interval
 0       AMD   60.0  0.015050  0.147082  0.018988  -0.022945  0.053045
 1      ORCL   60.0  0.020891  0.083049  0.010722  -0.000563  0.042345
 2  combined  120.0  0.017971  0.118970  0.010860  -0.003534  0.039475,
          Satterthwaite t-test  results
 0  Difference (AMD - ORCL) =   -0.0058
 1       Degrees of freedom =   93.1501
 2                        t =   -0.2679
 3    Two side test p value =    0.7894
 4   Difference < 0 p value =    0.3947
 5   Difference > 0 p value =    0.6053
 6                Cohen's d =   -0.0489
 7                Hedge's g =   -0.0486
 8           Glass's delta1 =   -0.0397
 9         Point-Biserial r =   -0.0277)

We got the same results as above, but now we see details about the mean of each stock, the difference, the tvalue, and the pvalue of the difference!

7 Confidence level, Type I Error and pvalue

The confidence level of a test is related to the error level of the test. For a confidence level of 95% there is a probability that we make a mistaken conclusion of rejecting the null hypothesis. Then, for a 95% confidence level, we can end up in a mistaken conclusion 5% of the time. This error is also called the Type I Error.

The pvalue of the test is actually the exact probability of making a Type I Error after we calculate the exact t-statistic. In other words, the pvalue is the probability that we will be wrong if we reject the null hypothesis (and support our hypothesis).

For each value of a t-statistic, there is a corresponding pvalue. We can relate both values in the following figure of the t-Student PDF:

Illustrating t-Statistics vs pvalue

For a 95% confidence level and 2-tailed pvalue, the critical t value is close to 2 (is not exactly 2); it can change according to N, the # of observation of the sample.

When the sample size N>30, the t-Student distribution approximates the Z normal distribution. In the above figure we can see that when N>30 and t=2, the approximates pvalues are: 1-tailed pvalue = 2.5%, and 2-tailed pvalues=5%.

Then, what is 1-tailed and 2-tailed pvalues? The 2-tailed pvalue will always be twice the value of the 1-tailed pvalue since the t-Student distribution is symetric.

We always want to have a very small pvalue in order to reject H0. Then, the 1-tailed pvalue seems to be the one to use. However, the 2-tailed pvalue is a more conservative value (the diablito will feel ok with this value). Most of the statistical software and computer languages report 2-tailed pvalue.

Then, which pvalue is the right to use? It depends on the context. When there is a theory that supports the alternative hypothesis, we can use the 1-tailed pvalue. For now, we can be conservative and use the 2-tailed pvalue for our t-tests.

Then, we can define the p-value of a t-test (in terms of the confidence level of the test) as:

pvalue=(1-ConfidenceLevel)

In the case of 1-tailed pvalue and a 95% confidence evel, the critical t-value is less than 2; it is approximately 1.65:

Illustrating 1-tailed t and pvalues

The pvalue cannot be calculated with an analytic formula since the integral of the Z normal or t-Student PDF has no close analytic solution. We need to use tables. Fortunately, all statistic software and most computer languages can easily calculate pvalues for any hypothesis test.

8 CHALLENGES

8.1 Challenge 1

  • Using Python code, you have to calculate the mean, standard deviation and variance of the simple and continuously compounded returns of MSFT and NVDA.

  • WHICH STOCK WOULD YOU PREFER ACCORDING TO THEIR MEAN AND STANDARD DEVIATION OF RETURNS? JUSTIFY YOUR ANSWER

  • WHICH STOCK IS MORE VOLATILE? EXPLAIN INTERPRET VOLATILITY -THE STANDARD DEVIATION - OF THIS STOCK

Hint: Remember that you already have a data frame for simple (R) and continuously compounded returns (r). You can apply the functions describe or the functions mean, sd, var to the r and R data frames.

8.2 Challenge 2

  • Do a histogram for daily Bitcoin simple returns.

Hints: use the plot.hist function for pandas data frames.

  • INTERPRET the histogram with your own words and in CAPITAL LETTERS

Hint: You can use Gemini or ChatGPT to come up with the Python code to get daily prices, get returns and show a histogram of returns.

Code
BTC=yf.download(tickers="BTC-USD", start="2017-01-01",interval="1mo", auto_adjust=True)
# I calculate simple return columns using closing prices:
RBTC = (BTC["Close"] / BTC["Close"].shift(1)) - 1
[*********************100%***********************]  1 of 1 completed
Code
hist=RBTC.plot.hist(bins=12,alpha=0.5,title="Histogram of monthly Bitcoin Returns")

8.3 Challenge 3

Do a one-sample t-test to check whether the mean monthly returns of MSFT are significantly greater than zero.

Hint: you have to bring monthly prices from Yahoo, get returns, and do the t-test according to what we learned in this workshop