Workshop 3, Business Analytics for Decision Making

Author

Alberto Dorantes D., Ph.D.

Published

January 16, 2026

Abstract

In this workshop we learn about the Central Limit Theorem, Hypothesis Testing, and measures of linear relationships.

1 Workshop Directions

For this Workshop you have to read the following chapters of my e-book:

Chapter 5- The Central Limit Theorem

Chapter 6 - Hypothesis Testing

Chapter 7 - Measures of Linear Relationships

In your Google Colab Notebook you have to replicate any Python code and do all CHALLENGES of this Workshop.

2 Hypothesis testing exercises

2.1 One-sample t-test

We start with the simple case of hypothesis testing: the One-Sample t-test.

We will learn about this with an example.

We download historical monthly data for the S&P500 market index and calculate cc returns:

{r, echo=FALSE} library(reticulate) options(scipen=100) options(digits=4)

import yfinance as yf
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

# Download stock data
data = yf.download('^GSPC', start='2020-01-01', end='2025-01-31', interval='1mo', auto_adjust=True)
adjprices = data['Close']  # Get adjusted close prices

# Calculate continuously compounded returns

ccr = np.log(adjprices) - np.log(adjprices.shift(1))
ccr = ccr.dropna()

Here is an example of a t-test to check whether the S&P 500 has an average monthly returns significantly greater than zero:

# H0: mean(ccr$GSPC) = 0
# Ha: mean(ccr$GSPC) <> 0

# Standard error
se_GSPC = np.std(ccr,ddof=1) / np.sqrt(len(ccr))
print(f"Standard error S&P 500 = {se_GSPC}")

# t-value
mean_ccr = np.mean(ccr)
t_GSPC = (mean_ccr - 0) / se_GSPC
print(f"t-value S&P 500 = {t_GSPC}")

Standard error S&P 500 = Ticker
^GSPC    0.00679
dtype: float64
t-value S&P 500 = Ticker
^GSPC    1.540082
dtype: float64

C:\Users\L00352955\AppData\Roaming\Python\Python313\site-packages\numpy\_core\fromnumeric.py:4062: FutureWarning:

The behavior of DataFrame.std with axis=None is deprecated, in a future version this will reduce over both axes and return a scalar. To retain the old behavior, pass axis=0 (or do not pass axis)

Since the t-value of the mean return of S&P 500 is lower than 2, I can’t reject the null hypothesis at the 95% confidence level. Therefore, at the 95% confidence level, S&P 500 mean return is not statistically greater than 0.

We can calculate the p-value of the test.

What is the p-value of a test?

The p-value of a test is the probability that I will be wrong if I reject the NULL hypothesis. In other words, (1-pvalue) will be the probability that MY HYPOTHESIS (the alternative hypothesis) is true!

(1-pvalue)% is called the confidence level I can use to reject the null hypothesis.

Fortunately, there is a Python function that does the same we did, but faster and it gets the p-value of the test:

from scipy import stats as st

# One-sided t-test
ttest_GSPC = st.ttest_1samp(ccr, 0, alternative='greater')

# Showing the t-Statistics and the p-value:
ttest_GSPC

TtestResult(statistic=array([1.54008237]), pvalue=array([0.06444343]), df=array([59]))

I got the same result with this t.test function.

But what does this mean? Does this mean that investing in the S&P is not going to give you positive returns over time? No, this is not quite a good conclusion.

We got a p-value= 6.4%. This means that we can accept our hypothesis that monthly mean return of the S&P500 is >0 at the 93.6% (=1-pvalue)! When we get a p-value<0.05 we can say that we have strong evidence (at least at the 95% confidence level) to reject the null hypothesis (accept my hypothesis). In this case we say that our result is statistically significant.

When we get a p-value between 5% and 10% (0.05 and 0.10) we can say that we evidence at the 90% confidence level to reject the null hypothesis. We can also say that our results are marginally significant.

2.2 Two-sample t-test exercise

You have to respond to the following question:

IS AMD MEAN RETURN HIGHER THAN INTEL MEAN RETURN?

To respond to this question, do a t-test to check whether the mean monthly cc return of AMD (AMD) is greater than the mean monthly return of Intel. Use data from Jan 2020 to date.

import pandas as pd
import numpy as np
import yfinance as yf

# Getting price data and selecting adjusted price columns:
sprices=yf.download(tickers="AMD INTC", start="2020-01-01",interval="1mo")

sprices=sprices['Close']

C:\Users\L00352955\AppData\Local\Temp\ipykernel_18052\916823141.py:6: FutureWarning:

YF.download() has changed argument auto_adjust default to True

[                       0%                       ][*********************100%***********************]  2 of 2 completed

# Calculating returns:
sr = np.log(sprices) - np.log(sprices.shift(1))
# Deleting the first month with NAs:
sr=sr.dropna()

# Stating the hypotheses: 
# H0: (mean(rAMD) - mean(rINTEL)) = 0
# Ha: (mean(rAMD) - mean(rINTEL)) <> 0

# Calculating the standard error of the difference of the means:
N = sr['AMD'].count()
amdvar = sr['AMD'].var()
intelvar = sr['INTC'].var()
sediff = np.sqrt((1/N) * (amdvar + intelvar ) )

# Calculating the t-Statistic:
t = (sr['AMD'].mean() - sr['INTC'].mean()) / sediff
t

np.float64(1.0267858003305392)

# Calculating the pvalue from the t-Statistic:
from scipy import stats as st
# The st.t.sf function calculates the 1-tailed pvalue, so we multiply it by 2 to get the 2-tailed pvalue
# the degrees of freedom for 2-independent-means t-test is calculated with the following formula:
df = ( ((N-1) / N**2) * (amdvar + intelvar)**2  / ( (amdvar/N)**2 + (intelvar/N)**2  ) )
# Now we calculate the pvalue with the t and df:
pvalue = 2 * st.t.sf(np.abs(t), df)
pvalue

np.float64(0.3063223387519934)

# Using the ttest_ind function from stats:
st.ttest_ind(sr['AMD'],sr['INTC'],equal_var=False)
# We got the same result as above!
# With this function we avoid calculating all steps of the hypothesis test!

TtestResult(statistic=np.float64(1.0267858003305392), pvalue=np.float64(0.3063223387519934), df=np.float64(137.64725101316557))

import researchpy as rp
# Using the ttest function from researchpy:
rp.ttest(sr['AMD'],sr['INTC'],equal_variances=False)
# We got the same result as above!
# With this function we avoid calculating all steps of the hypothesis test!

C:\Users\L00352955\AppData\Roaming\Python\Python313\site-packages\researchpy\ttest.py:301: FutureWarning:

Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'AMD' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.

C:\Users\L00352955\AppData\Roaming\Python\Python313\site-packages\researchpy\ttest.py:460: FutureWarning:

Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'Difference (AMD - INTC) = ' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.

(   Variable      N      Mean        SD        SE  95% Conf.  Interval
 0       AMD   72.0  0.021928  0.152673  0.017993  -0.013948  0.057805
 1      INTC   72.0 -0.002146  0.127556  0.015033  -0.032120  0.027829
 2  combined  144.0  0.009891  0.140703  0.011725  -0.013286  0.033069,
          Satterthwaite t-test   results
 0  Difference (AMD - INTC) =     0.0241
 1       Degrees of freedom =   137.6473
 2                        t =     1.0268
 3    Two side test p value =     0.3063
 4   Difference < 0 p value =     0.8468
 5   Difference > 0 p value =     0.1532
 6                Cohen's d =     0.1711
 7                Hedge's g =     0.1702
 8           Glass's delta1 =     0.1577
 9         Point-Biserial r =     0.0872)

3 Measures of linear relationship

We might be interested in learning whether there is a pattern of movement of a random variable when another random variable moves up or down. An important pattern we can measure is the linear relationship. The main two measures of linear relationship between 2 random variables are:

Covariance and
Correlation

Let’s start with an example. Imagine we want to see whether there is a relationship between the S&P500 and Microsoft stock.

The S&P500 is an index that represents the 500 biggest US companies, which is a good representation of the US financial market. We will use monthly data for the last 3-4 years.

Let’s download the price data and do the corresponding return calculation. Instead of pandas, we will use yfinance to download online data from Yahoo Finance.

import numpy as np
import pandas as pd
import yfinance as yf
import matplotlib
import matplotlib.pyplot as plt

# We download price data for Microsoft and the S&P500 index:
prices=yf.download(tickers="MSFT ^GSPC", start="2019-01-01",interval="1mo")
# We select Adjusted closing prices and drop any row with NA values:
adjprices = prices['Close'].dropna()

C:\Users\L00352955\AppData\Local\Temp\ipykernel_18052\1778179880.py:8: FutureWarning:

YF.download() has changed argument auto_adjust default to True

[                       0%                       ][*********************100%***********************]  2 of 2 completed

GSPC stands for Global Standard & Poors Composite, which is the S&P500 index.

Now we will do some informative plots to start learning about the possible relationship between GSPC and MSFT.

Unfortunately, the range of stock prices and market indexes can vary a lot, so this makes difficult to compare price movements in one plot. For example, if we plot the MSFT prices and the S&P500:

plt.clf()
adjprices.plot(y=['MSFT','^GSPC'])
plt.show()

<Figure size 672x480 with 0 Axes>

It looks like the GSPC has had a better performance, but this is misleading since both investment have different range of prices.

When comparing the performance of 2 or more stock prices and/or indexes, it is a good idea to generate an index for each series, so that we can emulate how much $1.00 invested in each stock/index would have moved over time. We can divide the stock price of any month by the stock price of the first month to get a growth factor:

adjprices['iMSFT'] = adjprices['MSFT'] / adjprices['MSFT'].iloc[0]
adjprices['iGSPC'] = adjprices['^GSPC'] / adjprices['^GSPC'].iloc[0]

This growth factor is like an index of the original variable. Now we can plot these 2 new indexes over time and see which investment was better:

plt.clf()
adjprices.plot(y=['iMSFT','iGSPC'])
plt.show()

<Figure size 672x480 with 0 Axes>

Now we have a much better picture of which instrument has had better performance over time. The line of each instrument represents how much $1.00 invested the instrument would have been changing over time.

Now we calculate continuously compounded monthly returns. With pandas most of the data management functions works row-wise. In other words, operations are performed to all columns by row:

r = np.log(adjprices) - np.log(adjprices.shift(1))
# Dropping rows with NA values (the first month will have NAs)
r = r.dropna()
# Selecting only 2 columns (out of the 4 columns):
r = r[['MSFT','^GSPC']]
# Renameing the column names:
r.columns = ['MSFT','GSPC']

Now the r dataframe will have 2 columns for both cc historical returns:

r.head()

	MSFT	GSPC
Date
2019-02-01	0.070250	0.029296
2019-03-01	0.055671	0.017766
2019-04-01	0.101963	0.038560
2019-05-01	-0.054442	-0.068041
2019-06-01	0.083538	0.066658

To learn about the possible relationship between the GSPC and MSFT we can look at their prices and also we can look at their returns.

We start with a scatter plot to see whether there is a linear relationship between the MSFT prices and the GSPC index:

plt.clf()
r.plot.scatter(x='GSPC', y='MSFT',c='DarkBlue')
plt.show()

<Figure size 672x480 with 0 Axes>

What do you see?

We can also do a scatter plot to visualize the relationship between the MSFT returns and GSPC returns:

plt.clf()
adjprices.plot.scatter(x='^GSPC', y='MSFT',c='DarkBlue')
plt.show()

<Figure size 672x480 with 0 Axes>

What do you see? Which plot conveys a stronger linear relationship?

The scatter plot using the prices conveys an apparent stronger linear relationship compared to the scatter plot using returns.

Stock returns are variables that usually does NOT grow over time; they look like a plot of heart bits:

plt.clf()
r.plot(y=['MSFT','GSPC'])
plt.show()

<Figure size 672x480 with 0 Axes>

Stock returns behave like a stationary variable since they do not have a growing or declining trend over time. A stationary variable is a variable that has a similar average and standard deviation in any time period.

Stock prices (and indexes) are variables that usually grow over time (sooner or later). These variables are called non-stationary variables. A non-stationary variable usually changes its mean depending on the time period.

In statistics, we have to be very careful when looking at linear relationships when using non-stationary variables, like stock prices. It is very likely that we end up with spurious measures of linear relationships when we use non-stationary variables. To learn more about the risk of estimating spurious relationships, we will cover this issue in the topic of time-series regression models (covered in a more advanced module).

Then, in this case it is better to look at linear relationship between stock returns (not prices).

3.1 Covariance

The Covariance between 2 random variables, X and Y, is a measure of linear relationship.

The Covariance is the average of product deviations between X and Y from their corresponding means.

For a sample of N and 2 random variables X and Y, we can calculate the population covariance as:

Cov(X,Y)=\frac{1}{N}\left[(X_{1}-\bar{X})(Y_{1}-\bar{Y})+...+(X_{N}-\bar{X})(Y_{N}-\bar{Y})\right]

3.2 Correlation

Correlation is a very practical measure of linear relationship between 2 random variables. It is actually a scaled version of the Covariance:

Corr(X,Y)=\frac{Cov(X,Y)}{SD(X)SD(Y)}

If we divide Cov(X,Y) by the product of the standard deviations of X and Y, we get the correlation, which can have values only between -1 and +1.

-1<=Corr(X,Y)<=1

If Corr(X,Y) = +1, that means that X moves exactly in the same way than Y, so Y is proportional (in the same direction) than X; actually Y should be equal to X multiplied by number.

If Corr(X,Y) = -1 means that Y moves exactly proportional to X, but in the opposite direction.

If Corr(X,Y) = 0 means that the movements of Y are not related to the movements of X. In other words, that X and Y move independent of each other; in this case, there is no clear linear pattern of how Y moves when X moves.

If 0<Corr(X,Y)<1 means that there is a positive linear relationship between X and Y. The strength of this relationship is given by the magnitude of the correlation. For example, if Corr(X,Y) = 0.50, that means that if X increases, there is a probability of 50% that Y will also increase.

If -1<Corr(X,Y)<0 means that there is a negative linear relationship between X and Y. The strength of this relationship is given by the magnitude of the correlation. For example, if Corr(X,Y) = - 0.50, that means that if X increases, there is a probability of 50% that Y will decrease (and vice versa).

If we want to test that Corr(X,Y) is positive and significant, we need to do a hypothesis test. The formula for the standard error (standard deviation of the correlation) is:

SD(corr)=\sqrt{\frac{(1-corr^{2})}{(N-2)}}

Then, the t-Statistic for this hypothesis test will be:

t=\frac{corr}{\sqrt{\frac{(1-corr^{2})}{(N-2)}}}

If Corr(X,Y)>0 and t>2 (its pvalue will be <0.05), then we can say that we have a 95% confidence that there is a positive linear relationship; in other words, that the correlation is positive and statistically significant (significantly greater than zero).

3.3 Calculating covariance and correlation

We can program the covariance of 2 variables according to the formula:

msft_mean = r['MSFT'].mean()
gspc_mean = r['GSPC'].mean()
N = r['GSPC'].count()
sum_of_prod = ((r['MSFT'] - msft_mean) * (r['GSPC'] - gspc_mean) ).sum()  
cov = sum_of_prod / (N-1)
cov

np.float64(0.00207974541822867)

Fortunately, we have the numpy function cov to calculate the covariance:

covm = np.cov(r['MSFT'],r['GSPC'])
covm

array([[0.0037534 , 0.00207975],
       [0.00207975, 0.0022462 ]])

The cov function calculates the covariance matrix using both returns. We can find the covariance in the non-diagonal elements, which will be the same values since the covariance matrix is symetric.

The diagonal values have the variances of each return since the covariance of one variable with itself is actually its variance (Cov(X,X) = Var(X) ) .

Then, to extract the covariance between MSFT and GSPC returns we can extract the element in the row 1 and column 2 of the matrix:

cov = covm[0,1]
cov

np.float64(0.00207974541822867)

This value is exactly the same we calculated manually.

We can use the corrcoef function of numpy to calculate the correlation matrix:

corr = np.corrcoef(r['MSFT'],r['GSPC'])
corr

array([[1.        , 0.71626433],
       [0.71626433, 1.        ]])

The correlation matrix will have +1 in its diagonal since the correlation of one variable with itself is +1. The non-diagonal value will be the actual correlation between the corresponding 2 variables (the one in the row, and the one in the column).

We could also manually calculate correlation using the previous covariance:

corr2 = cov / (r['MSFT'].std() * r['GSPC'].std())
corr2

np.float64(0.7162643268078946)

We can use the scipy pearsonr function to calculate correlation and also the 2-tailed pvalue to see whether the correlation is statistically different than zero:

from scipy.stats import pearsonr
corr2 = pearsonr(r['MSFT'],r['GSPC'])
corr2

PearsonRResult(statistic=np.float64(0.7162643268078945), pvalue=np.float64(1.867580152446407e-14))

The pvalue is almost zero (1.3 * 10^{-13}) . MSFT and GSPC returns have a positive and very significant correlation (at the 99.9999…% confidence level).